包含结目的dd717所有页面包含可渲染的文本,现在以wwWdd717com无法在显示了

&figure&&img src=&https://pic4.zhimg.com/v2-f22b9b49f07174ccf39d_b.jpg& data-rawwidth=&512& data-rawheight=&512& class=&origin_image zh-lightbox-thumb& width=&512& data-original=&https://pic4.zhimg.com/v2-f22b9b49f07174ccf39d_r.jpg&&&/figure&&p&爬虫学习阶段性总结&/p&&p&爬虫的基础知识我打算就先学到这里了,以后需要用起来的时候再去看看相关文档和谷歌,做一个小量级的爬虫程序问题不大,对于分布式的和增量更新去重等需求就直接上框架,用别人的轮子还是蛮爽的。&/p&&p&简单小量级:requests+pyquery&/p&&p&JS渲染太多的:selenium+Phantomjs&/p&&p&框架:Pyspider或者Scrapy,个人比较喜欢Scrapy,主要是pyspider的文档真的少,两者框架差不多的,前者有WEBUI,后者是命令行模式,看喜欢哪个就用哪个吧。&/p&&p&&br&&/p&&p&学习资料总结:&/p&&p&第一部分基础&/p&&p&1. 环境搭建:&/p&&p&建议直接用虚拟机ubuntu(自带python2和3),terminal敲代码就可以了&/p&&p&2. python基础(0基础的看这个):&/p&&p&&a href=&http://link.zhihu.com/?target=http%3A//www.runoob.com/python3/python3-tutorial.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&runoob.com/python3/pyth&/span&&span class=&invisible&&on3-tutorial.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&3. 看完上面这个之后看下面这个:(感觉作者写到错误调试那一章之后就不是太好了,感觉是一股脑东西给你砸过来)&/p&&p&&a href=&http://link.zhihu.com/?target=https%3A//www.liaoxuefeng.com/wiki/958fa6d3a2e542c000& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&liaoxuefeng.com/wiki/00&/span&&span class=&invisible&&64a6b949df42a6d3a2e542c000&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&扩展阅读:&/p&&p&《简明python教程》&/p&&p&&a href=&http://link.zhihu.com/?target=https%3A//molun.net/byte-of-python-2017-new-translation-edition-release/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&molun.net/byte-of-pytho&/span&&span class=&invisible&&n-2017-new-translation-edition-release/&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&4. Git教程&/p&&p&&a href=&http://link.zhihu.com/?target=https%3A//www.liaoxuefeng.com/wiki/bb000& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&liaoxuefeng.com/wiki/00&/span&&span class=&invisible&&27c8c017b000&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&这个写的蛮好的,看这个就够了,后续需要的项目代码可以从github上拷贝,在上面也有很多的爬虫项目可以参考,注意看star和fork数量,可以参考时间比较新的(push时间),算是一个挺好的资源来源(比自己百度去搜索的爬虫很多都过期了,或者网站策略 更新了,对新手不太友好,会浪费比较多时间)&/p&&p&第二部分爬虫:&/p&&p&1. 崔庆才博客和教程视频:&/p&&p&&a href=&http://link.zhihu.com/?target=http%3A//cuiqingcai.com/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&https://cuiqingcai.com&/a&&/p&&p&&br&&/p&&p&教学视频在youtube有,我也存到本地网盘了。&/p&&p&视频链接可以在微信公众号【Python数据分析之路】回复“1”获取。&/p&&p&视频内容都挺好的,一步步做下来,基本也就入门了爬虫了&/p&&p&2. 阅读相关的基础知识:&/p&&p&上面做爬虫的时候会经常遇到一些概念(前端),崔是做前端的,所以有些讲的很快,不懂的时候建议把视频暂停下来,看看下面这些概念&/p&&p&HTTP/HTML/AJAX/JSON/CSS/XPATH&/p&&p&教程都可以在这里找到:&/p&&p&&a href=&http://link.zhihu.com/?target=http%3A//www.runoob.com/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&runoob.com/&/span&&span class=&invisible&&&/span&&/a&&/p&&p&看重点就可以了,不需要全部都看,比如HTML就看一下基础元素属性和总结就好了,需要用到其他的时候再补&/p&&p&3. 数据库的基础知识:&/p&&p&教程都可以在这里找到:&/p&&p&&a href=&http://link.zhihu.com/?target=http%3A//www.runoob.com/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&runoob.com/&/span&&span class=&invisible&&&/span&&/a&&/p&&p&MySQL/MongoDB/Redis&/p&&p&同理,看一下基础应用就好了,简单爬虫要求并不高,后续再深入学习。&/p&&p&&br&&/p&&p&4. 阅读相关的库文档,按照库文档的理解敲敲例子&/p&&p&这里包括请求库,解析库,存储库和工具库&/p&&p&大概是下面这些(资料通过百度或者谷歌都能搜到)&/p&&p&requests/re/selenium/lxml/beautifulSoup/pyquery/pyspider/scrapy/pymql/pymongo&/p&&p&&br&&/p&&p&重点觉得应该看一下几个解析库,比如bs4/pyquery/lxml和用一下css/xpath语法,熟悉一下每次筛选之后是什么类型的数据,以及怎么遍历,取出自己想要的数据,每个例子最好都使用print(type())打印一下类型,加深理解。&/p&&p&&br&&/p&&p&第三部分爬虫练手资源:&/p&&p&1. 知乎分享的合集(有挺多不能用得了,但是可以参考一下名字,搜一下最新的文章)&/p&&p&&a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&&span class=&invisible&&https://&/span&&span class=&visible&&zhuanlan.zhihu.com/p/27&/span&&span class=&invisible&&938007&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&2. 崔庆才视频中的几个项目&/p&&p&&br&&/p&&p&不需要太多。&/p&&p&&br&&/p&&p&第四部分其他经验:&/p&&p&1. 爬虫容易写出满足感,所以不要在这上面花太多时间(其实我需要数据的时候还是找其他人要的,所以加个QQ群,程序员都是乐于助人的),适可而止就好了,除非要在爬虫这个职业上做长期发展&/p&&p&2. 加多几个靠谱的python或者爬虫QQ群&/p&&p&3. 多做笔记,多写代码(写个爬虫可能只需要1个小时,但是写个文章就需要2个小时,在写文章的过程中加深理解,跟读书的时候做笔记是一样的)&/p&&p&4. 至少要做到以下几个爬虫例子:&/p&&p&(1) requests+正则表达式爬取静态网页(最好是加入搜索关键词的),并加入多进程,数据库存储,文件下载(图片和文本)&/p&&p&(2) requests+lxml+xpath爬取静态网页,其他同第(1)点&/p&&p&(3) requests+bs4+css/xpath爬取静态网页,其他同第(1)点&/p&&p&(4) requests+pyquery+css爬取静态网页,其他同第(1)点&/p&&p&(5) selenium+Phantomjs爬取静态网页,其他同第(1)点&/p&&p&(6) pyspider+ selenium+Phantomjs爬取静态网页,其他同第(1)点(静态网页用pyspider爬感觉大材小用)&/p&&p&(7) scrapy爬取动态网页,其他同第(1)点&/p&&p&(8) 找一个封IP和cookies的网站(比如微博),用scrapy爬取,把几个pipeline都用起来,然后加入分布式爬取(找3个云服务器就ok了,一个发布任务,两个爬取),其他同第(1)点&/p&&p&5. 坚持,适可而止&/p&&p&&br&&/p&&p&第五部分 其他资源:&/p&&p&1. 反爬文章参考:&/p&&p&这个写的很全面也挺棒的:&/p&&p&&a href=&http://link.zhihu.com/?target=http%3A//www.freebuf.com/articles/web/137763.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&freebuf.com/articles/we&/span&&span class=&invisible&&b/137763.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&这是一个实例演示,也不错:&/p&&p&&a href=&http://link.zhihu.com/?target=http%3A//www.freebuf.com/news/140965.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&freebuf.com/news/140965&/span&&span class=&invisible&&.html&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&这个破解例子不错:&/p&&p&&a href=&http://link.zhihu.com/?target=http%3A//blog.csdn.net/bone_ace/article/details/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&blog.csdn.net/bone_ace/&/span&&span class=&invisible&&article/details/&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&&br&&/p&&p&2. 一个完整的爬虫+前端界面(真的挺棒的!)&/p&&p&&a href=&http://link.zhihu.com/?target=https%3A//github.com/GuozhuHe/webspider& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&github.com/GuozhuHe/web&/span&&span class=&invisible&&spider&/span&&span class=&ellipsis&&&/span&&/a&&/p&&p&&/p&
爬虫学习阶段性总结爬虫的基础知识我打算就先学到这里了,以后需要用起来的时候再去看看相关文档和谷歌,做一个小量级的爬虫程序问题不大,对于分布式的和增量更新去重等需求就直接上框架,用别人的轮子还是蛮爽的。简单小量级:requests+pyqueryJS渲染太多…
&p&我先会用 Node.js 写爬虫,当我翻译成 Python 的时候,也就会用 Python 爬虫了。&/p&&p&在我会写爬虫的时候,我掌握了这几个技能:&/p&&ol&&li&HTTP 协议&/li&&li&Socket&/li&&li&HTML(实际上 CSS 和 JavaScript 我更加熟悉,但是最初写爬虫的时候用得少)&/li&&li&解析 HTML&/li&&/ol&&p&顺便对&b&整个流程非常熟悉&/b&,也就是使用 Socket API 发送 HTTP 请求,得到了 HTTP 响应,这个响应是一个字符串 / Buffer,解析这个字符串 / Buffer 得到我需要的内容。&/p&&p&&br&&/p&&p&当然这个是背后的原理,如果只是想快速写一个爬虫,那么自然有更加轻松的选择&/p&&p&比如使用 &a href=&//link.zhihu.com/?target=http%3A//docs.python-requests.org/en/master/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Requests: HTTP for Humans&/a& 库来发送网络请求,这样就不需要自己用原始的 Socket 来封装函数了,而且 HTTP 请求中的 headers、body 处理也更加方便。&/p&&p&比如使用 &a href=&//link.zhihu.com/?target=https%3A//pythonhosted.org/pyquery/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&pyquery: a jquery-like library for python&/a& 来解析 HTTP 响应,这样就不用自己解析字符串或者先构造树然后再解析了。&/p&&p&&br&&/p&&p&这两步处理起来是比较轻松的,拿到数据之后的事情就看情况了。只是保存的话哪怕用纯文本也是可以的,展示的话有很多图表库都可以选择。&/p&&p&&br&&/p&&p&到这一步就可以写出比较简单的爬虫了,后续想处理复杂爬虫的时候会碰到各种各样的问题,比如权限验证、动态渲染、反爬虫机制等,这些在 &a href=&//link.zhihu.com/?target=https%3A//book.douban.com/subject//& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Python网络数据采集&/a& 里都有提到,有需要的话翻一翻就可以。&/p&&p&&br&&/p&&p&顺带提一点,在学习过程中掌握原理之后,使用的工具越简单越方便越好。也许有人说写爬虫就只是调用 API 而已,问题是 API 方便就应该调用啊。退一步讲就算自己用 Socket 写一个网络库本身也费不了多少功夫,用一棵树来解析 HTML 也费不了什么功夫。&/p&&p&能节省时间就应该节省时间,所以应该使用 &b&Requests + PyQuery&/b& 的方案,&b&千万不要使用&/b& urllib、urllib2、Beautiful Soup、xPath、re 等&/p&&p&&br&&/p&&p&楼主看的书的确稍微有些不合适,基本上只学习了 Python 的语法,而没有弄清楚做这件事情的流程,而流程反而是更加重要的,当然这方面看书就可以了。&/p&&p&&br&&/p&&p&想快速实现直接看 &a href=&//link.zhihu.com/?target=https%3A//book.douban.com/subject//& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Python网络数据采集&/a&。&/p&&p&想扎实基础看下面的&/p&&p&学习 HTTP 用 &a href=&//link.zhihu.com/?target=https%3A//book.douban.com/subject//& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&图解HTTP&/a&&/p&&p&学习 HTML 和 CSS 用 &a href=&//link.zhihu.com/?target=https%3A//book.douban.com/subject//& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Head First HTML与CSS(第2版)&/a&&/p&&p&学习 PyQuery 用 &a href=&//link.zhihu.com/?target=https%3A//book.douban.com/subject//& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&jQuery基础教程 (第4版)&/a& (前半部分足够)&/p&&figure&&img src=&https://pic1.zhimg.com/v2-10e6786fab6d06f26beda4_b.jpg& data-size=&normal& data-rawwidth=&2848& data-rawheight=&848& class=&origin_image zh-lightbox-thumb& width=&2848& data-original=&https://pic1.zhimg.com/v2-10e6786fab6d06f26beda4_r.jpg&&&figcaption&Python 和 Node &/figcaption&&/figure&
我先会用 Node.js 写爬虫,当我翻译成 Python 的时候,也就会用 Python 爬虫了。在我会写爬虫的时候,我掌握了这几个技能:HTTP 协议SocketHTML(实际上 CSS 和 JavaScript 我更加熟悉,但是最初写爬虫的时候用得少)解析 HTML顺便对整个流程非常熟悉,也就…
&figure&&img src=&https://pic2.zhimg.com/v2-1ceb6c13b34c3df51adc_b.jpg& data-rawwidth=&939& data-rawheight=&532& class=&origin_image zh-lightbox-thumb& width=&939& data-original=&https://pic2.zhimg.com/v2-1ceb6c13b34c3df51adc_r.jpg&&&/figure&&blockquote&&i&七月末的南京简直开启了「微波炉」模式,白天要学车的我,晚上自然选择宅在家看直播,看着狗贼叔叔满屏幕的弹幕,我就想着能不能把弹幕爬下来呢?说干就干&/i&&/blockquote&&h2&结果的展示:&/h2&&p&这里只抓到弹幕内容和发送用户&br&并输出在终端上,有兴趣的小伙伴&br&可以在这个基础上接着开发,&br&搜集弹幕做做数据分析也是很ok的啊!&/p&&p&下面是展示图:&/p&&p&&br&&/p&&figure&&img src=&https://pic3.zhimg.com/v2-139ded43ed347d4e0cd9b5c_b.jpg& data-rawwidth=&649& data-rawheight=&434& class=&origin_image zh-lightbox-thumb& width=&649& data-original=&https://pic3.zhimg.com/v2-139ded43ed347d4e0cd9b5c_r.jpg&&&/figure&&p&&br&&/p&&p&&br&&/p&&h2&资料的搜集&/h2&&p&面向Google编程的我,第一件事当然是键入关键词:「Python 弹幕」&/p&&p&吃惊的是,网上已经有了炒鸡完善的弹幕第三方库:「DanMU」&/p&&p&使用起来也是炒鸡简单,十几行代码就能轻松获取直播间的弹幕了,&/p&&p&有兴趣的同学可以去搜索看看。&/p&&p&本着练手和不折腾会死的态度,我还是想尝试自己写一个版本出来,&/p&&p&然后就找到了 斗鱼居然开放了Api,&/p&&p&这样的话,只要稍微处理一下,就能愉快的获取想要的信息了。&/p&&h2&斗鱼Api接口文档和接入协议&/h2&&p&&br&&/p&&ul&&li&《斗鱼弹幕服务器第三方接入协议v1.4.1》: &a href=&https://link.zhihu.com/?target=http%3A//dev-bbs.douyutv.com/forum.php%3Fmod%3Dviewthread%26tid%3D115%26extra%3Dpage%253D1& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&http://dev-bbs.douyutv.com/forum.php?mod=viewthread&tid=115&extra=page%3D1&/a&&/li&&li&《斗鱼第三方开放平台API文档v2.1》: &a href=&https://link.zhihu.com/?target=http%3A//dev-bbs.douyutv.com/forum.php%3Fmod%3Dviewthread%26tid%3D113%26extra%3Dpage%253D1& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&http://dev-bbs.douyutv.com/forum.php?mod=viewthread&tid=113&extra=page%3D1&/a&&/li&&/ul&&p&&br&&/p&&p&仔细观察文档之后,我发现只要自己实现一下协议头,&/p&&p&就能接入弹幕服务器了,&/p&&p&接着构造弹幕请求,&/p&&p&就能实时的获取每一条弹幕了。&/p&&h2&请求头的构造&/h2&&p&先看文档的要求:&/p&&p&&br&&/p&&figure&&img src=&https://pic2.zhimg.com/v2-bd921f9ca81c4_b.jpg& data-rawwidth=&880& data-rawheight=&1498& class=&origin_image zh-lightbox-thumb& width=&880& data-original=&https://pic2.zhimg.com/v2-bd921f9ca81c4_r.jpg&&&/figure&&p&&br&&/p&&p&&br&&/p&&p&简而言之呢:&/p&&p&请求一共分为三个部分:长度,头部,数据部&br&分别按照文档的要求构造就行,&br&需要注意的是,获取和返回的类型是都是 &b&Bytes&/b&&/p&&p&&b&代码:&/b&&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&k&&def&/span& &span class=&nf&&send_req_msg&/span&&span class=&p&&(&/span&&span class=&n&&msgstr&/span&&span class=&p&&):&/span&
&span class=&sd&&'''构造并发送符合斗鱼api的请求'''&/span&
&span class=&n&&msg&/span& &span class=&o&&=&/span& &span class=&n&&msgstr&/span&&span class=&o&&.&/span&&span class=&n&&encode&/span&&span class=&p&&(&/span&&span class=&s1&&'utf8'&/span&&span class=&p&&)&/span&
&span class=&n&&data_length&/span& &span class=&o&&=&/span& &span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&)&/span& &span class=&o&&+&/span& &span class=&mi&&8&/span&
&span class=&n&&code&/span& &span class=&o&&=&/span& &span class=&mi&&689&/span&
&span class=&c1&&# 构造协议头&/span&
&span class=&n&&msgHead&/span& &span class=&o&&=&/span& &span class=&nb&&int&/span&&span class=&o&&.&/span&&span class=&n&&to_bytes&/span&&span class=&p&&(&/span&&span class=&n&&data_length&/span&&span class=&p&&,&/span& &span class=&mi&&4&/span&&span class=&p&&,&/span& &span class=&s1&&'little'&/span&&span class=&p&&)&/span& \
&span class=&o&&+&/span& &span class=&nb&&int&/span&&span class=&o&&.&/span&&span class=&n&&to_bytes&/span&&span class=&p&&(&/span&&span class=&n&&data_length&/span&&span class=&p&&,&/span& &span class=&mi&&4&/span&&span class=&p&&,&/span& &span class=&s1&&'little'&/span&&span class=&p&&)&/span& &span class=&o&&+&/span& \
&span class=&nb&&int&/span&&span class=&o&&.&/span&&span class=&n&&to_bytes&/span&&span class=&p&&(&/span&&span class=&n&&code&/span&&span class=&p&&,&/span& &span class=&mi&&4&/span&&span class=&p&&,&/span& &span class=&s1&&'little'&/span&&span class=&p&&)&/span&
&span class=&n&&client&/span&&span class=&o&&.&/span&&span class=&n&&send&/span&&span class=&p&&(&/span&&span class=&n&&msgHead&/span&&span class=&p&&)&/span&
&span class=&n&&sent&/span& &span class=&o&&=&/span& &span class=&mi&&0&/span&
&span class=&k&&while&/span& &span class=&n&&sent&/span& &span class=&o&&&&/span& &span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&):&/span&
&span class=&n&&tn&/span& &span class=&o&&=&/span& &span class=&n&&client&/span&&span class=&o&&.&/span&&span class=&n&&send&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&[&/span&&span class=&n&&sent&/span&&span class=&p&&:])&/span&
&span class=&n&&sent&/span& &span class=&o&&=&/span& &span class=&n&&sent&/span& &span class=&o&&+&/span& &span class=&n&&tn&/span&
&/code&&/pre&&/div&&h2&获取弹幕&/h2&&p&这里的部分也是按照文档要求写就成&br&首先 发送登录请求&br&接着 每隔固定时间发送【心跳请求】防止断线&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&k&&def&/span& &span class=&nf&&DM_start&/span&&span class=&p&&(&/span&&span class=&n&&roomid&/span&&span class=&p&&):&/span&
&span class=&c1&&# 构造登录授权请求&/span&
&span class=&n&&msg&/span& &span class=&o&&=&/span& &span class=&s1&&'type@=loginreq/roomid@=&/span&&span class=&si&&{}&/span&&span class=&s1&&/&/span&&span class=&se&&\0&/span&&span class=&s1&&'&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&n&&roomid&/span&&span class=&p&&)&/span&
&span class=&n&&send_req_msg&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&)&/span&
&span class=&c1&&# 构造获取弹幕消息请求&/span&
&span class=&n&&msg_more&/span& &span class=&o&&=&/span& &span class=&s1&&'type@=joingroup/rid@=&/span&&span class=&si&&{}&/span&&span class=&s1&&/gid@=-9999/&/span&&span class=&se&&\0&/span&&span class=&s1&&'&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&n&&roomid&/span&&span class=&p&&)&/span&
&span class=&n&&send_req_msg&/span&&span class=&p&&(&/span&&span class=&n&&msg_more&/span&&span class=&p&&)&/span&
&span class=&k&&while&/span& &span class=&kc&&True&/span&&span class=&p&&:&/span&
&span class=&c1&&# 服务端返回的数据&/span&
&span class=&n&&data&/span& &span class=&o&&=&/span& &span class=&n&&client&/span&&span class=&o&&.&/span&&span class=&n&&recv&/span&&span class=&p&&(&/span&&span class=&mi&&1024&/span&&span class=&p&&)&/span&
&span class=&c1&&# 通过re模块找发送弹幕的用户名和内容&/span&
&span class=&n&&danmu_username&/span& &span class=&o&&=&/span& &span class=&n&&username_re&/span&&span class=&o&&.&/span&&span class=&n&&findall&/span&&span class=&p&&(&/span&&span class=&n&&data&/span&&span class=&p&&)&/span&
&span class=&n&&danmu_content&/span& &span class=&o&&=&/span& &span class=&n&&danmu_re&/span&&span class=&o&&.&/span&&span class=&n&&findall&/span&&span class=&p&&(&/span&&span class=&n&&data&/span&&span class=&p&&)&/span&
&span class=&k&&if&/span& &span class=&ow&&not&/span& &span class=&n&&data&/span&&span class=&p&&:&/span&
&span class=&k&&break&/span&
&span class=&k&&else&/span&&span class=&p&&:&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span& &span class=&ow&&in&/span& &span class=&nb&&range&/span&&span class=&p&&(&/span&&span class=&mi&&0&/span&&span class=&p&&,&/span& &span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&danmu_content&/span&&span class=&p&&)):&/span&
&span class=&k&&try&/span&&span class=&p&&:&/span&
&span class=&c1&&# 输出信息&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span&&span class=&s1&&'[&/span&&span class=&si&&{}&/span&&span class=&s1&&]:&/span&&span class=&si&&{}&/span&&span class=&s1&&'&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&n&&danmu_username&/span&&span class=&p&&[&/span&&span class=&mi&&0&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&decode&/span&&span class=&p&&(&/span&
&span class=&s1&&'utf8'&/span&&span class=&p&&),&/span& &span class=&n&&danmu_content&/span&&span class=&p&&[&/span&&span class=&mi&&0&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&decode&/span&&span class=&p&&(&/span&&span class=&n&&encoding&/span&&span class=&o&&=&/span&&span class=&s1&&'utf8'&/span&&span class=&p&&)))&/span&
&span class=&k&&except&/span&&span class=&p&&:&/span&
&span class=&k&&continue&/span&
&span class=&k&&def&/span& &span class=&nf&&keeplive&/span&&span class=&p&&():&/span&
&span class=&sd&&'''&/span&
&span class=&sd&&
保持心跳,15秒心跳请求一次&/span&
&span class=&sd&&
'''&/span&
&span class=&k&&while&/span& &span class=&kc&&True&/span&&span class=&p&&:&/span&
&span class=&n&&msg&/span& &span class=&o&&=&/span& &span class=&s1&&'type@=keeplive/tick@='&/span& &span class=&o&&+&/span& &span class=&nb&&str&/span&&span class=&p&&(&/span&&span class=&nb&&int&/span&&span class=&p&&(&/span&&span class=&n&&time&/span&&span class=&o&&.&/span&&span class=&n&&time&/span&&span class=&p&&()))&/span& &span class=&o&&+&/span& &span class=&s1&&'/&/span&&span class=&se&&\0&/span&&span class=&s1&&'&/span&
&span class=&n&&send_req_msg&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&)&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span&&span class=&s1&&'发送心跳包'&/span&&span class=&p&&)&/span&
&span class=&n&&time&/span&&span class=&o&&.&/span&&span class=&n&&sleep&/span&&span class=&p&&(&/span&&span class=&mi&&15&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&h2&tricky 的部分&/h2&&p&上面的内容,说起来都不是很难,&br&但是想要完整的实现需求,&br&这里需要的知识还是比较多的:&/p&&ul&&li&socket相关&/li&&li&正则表达式相关&/li&&li&signal相关&/li&&li&多线程、多进程相关&/li&&/ul&&p&比如我想要实现捕捉「ctrl+c」的信号,&br&好在我们退出程序的时候,能够正确的处理&br&这时候就要用到&b&signal&/b&相关的知识&br&说起来,在今天之前,我完全不知道还可以这样用。&/p&&p&总之越是学到后面,&br&越是会觉得自己的知识储备不足,&br&Python 作为一门十分强大和容易上手的语言,&br&能够帮助我们迅速的实现需求,&br&但是不要认为他单单只能写爬虫哦,&/p&&h2&完整的代码&/h2&&p&有详细的注释哦:&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&sd&&'''&/span&
&span class=&sd&&利用斗鱼弹幕 api&/span&
&span class=&sd&&尝试抓取斗鱼tv指定房间的弹幕&/span&
&span class=&sd&&'''&/span&
&span class=&kn&&import&/span& &span class=&nn&&multiprocessing&/span&
&span class=&kn&&import&/span& &span class=&nn&&socket&/span&
&span class=&kn&&import&/span& &span class=&nn&&time&/span&
&span class=&kn&&import&/span& &span class=&nn&&re&/span&
&span class=&kn&&import&/span& &span class=&nn&&signal&/span&
&span class=&c1&&# 构造socket连接,和斗鱼api服务器相连接&/span&
&span class=&n&&client&/span& &span class=&o&&=&/span& &span class=&n&&socket&/span&&span class=&o&&.&/span&&span class=&n&&socket&/span&&span class=&p&&(&/span&&span class=&n&&socket&/span&&span class=&o&&.&/span&&span class=&n&&AF_INET&/span&&span class=&p&&,&/span& &span class=&n&&socket&/span&&span class=&o&&.&/span&&span class=&n&&SOCK_STREAM&/span&&span class=&p&&)&/span&
&span class=&n&&host&/span& &span class=&o&&=&/span& &span class=&n&&socket&/span&&span class=&o&&.&/span&&span class=&n&&gethostbyname&/span&&span class=&p&&(&/span&&span class=&s2&&&openbarrage.douyutv.com&&/span&&span class=&p&&)&/span&
&span class=&n&&port&/span& &span class=&o&&=&/span& &span class=&mi&&8601&/span&
&span class=&n&&client&/span&&span class=&o&&.&/span&&span class=&n&&connect&/span&&span class=&p&&((&/span&&span class=&n&&host&/span&&span class=&p&&,&/span& &span class=&n&&port&/span&&span class=&p&&))&/span&
&span class=&c1&&# 弹幕查询正则表达式&/span&
&span class=&n&&danmu_re&/span& &span class=&o&&=&/span& &span class=&n&&re&/span&&span class=&o&&.&/span&&span class=&n&&compile&/span&&span class=&p&&(&/span&&span class=&n&&b&/span&&span class=&s1&&'txt@=(.+?)/cid@'&/span&&span class=&p&&)&/span&
&span class=&n&&username_re&/span& &span class=&o&&=&/span& &span class=&n&&re&/span&&span class=&o&&.&/span&&span class=&n&&compile&/span&&span class=&p&&(&/span&&span class=&n&&b&/span&&span class=&s1&&'nn@=(.+?)/txt@'&/span&&span class=&p&&)&/span&
&span class=&k&&def&/span& &span class=&nf&&send_req_msg&/span&&span class=&p&&(&/span&&span class=&n&&msgstr&/span&&span class=&p&&):&/span&
&span class=&sd&&'''构造并发送符合斗鱼api的请求'''&/span&
&span class=&n&&msg&/span& &span class=&o&&=&/span& &span class=&n&&msgstr&/span&&span class=&o&&.&/span&&span class=&n&&encode&/span&&span class=&p&&(&/span&&span class=&s1&&'utf-8'&/span&&span class=&p&&)&/span&
&span class=&n&&data_length&/span& &span class=&o&&=&/span& &span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&)&/span& &span class=&o&&+&/span& &span class=&mi&&8&/span&
&span class=&n&&code&/span& &span class=&o&&=&/span& &span class=&mi&&689&/span&
&span class=&c1&&# 构造协议头&/span&
&span class=&n&&msgHead&/span& &span class=&o&&=&/span& &span class=&nb&&int&/span&&span class=&o&&.&/span&&span class=&n&&to_bytes&/span&&span class=&p&&(&/span&&span class=&n&&data_length&/span&&span class=&p&&,&/span& &span class=&mi&&4&/span&&span class=&p&&,&/span& &span class=&s1&&'little'&/span&&span class=&p&&)&/span& \
&span class=&o&&+&/span& &span class=&nb&&int&/span&&span class=&o&&.&/span&&span class=&n&&to_bytes&/span&&span class=&p&&(&/span&&span class=&n&&data_length&/span&&span class=&p&&,&/span& &span class=&mi&&4&/span&&span class=&p&&,&/span& &span class=&s1&&'little'&/span&&span class=&p&&)&/span& &span class=&o&&+&/span& \
&span class=&nb&&int&/span&&span class=&o&&.&/span&&span class=&n&&to_bytes&/span&&span class=&p&&(&/span&&span class=&n&&code&/span&&span class=&p&&,&/span& &span class=&mi&&4&/span&&span class=&p&&,&/span& &span class=&s1&&'little'&/span&&span class=&p&&)&/span&
&span class=&n&&client&/span&&span class=&o&&.&/span&&span class=&n&&send&/span&&span class=&p&&(&/span&&span class=&n&&msgHead&/span&&span class=&p&&)&/span&
&span class=&n&&sent&/span& &span class=&o&&=&/span& &span class=&mi&&0&/span&
&span class=&k&&while&/span& &span class=&n&&sent&/span& &span class=&o&&&&/span& &span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&):&/span&
&span class=&n&&tn&/span& &span class=&o&&=&/span& &span class=&n&&client&/span&&span class=&o&&.&/span&&span class=&n&&send&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&[&/span&&span class=&n&&sent&/span&&span class=&p&&:])&/span&
&span class=&n&&sent&/span& &span class=&o&&=&/span& &span class=&n&&sent&/span& &span class=&o&&+&/span& &span class=&n&&tn&/span&
&span class=&k&&def&/span& &span class=&nf&&DM_start&/span&&span class=&p&&(&/span&&span class=&n&&roomid&/span&&span class=&p&&):&/span&
&span class=&c1&&# 构造登录授权请求&/span&
&span class=&n&&msg&/span& &span class=&o&&=&/span& &span class=&s1&&'type@=loginreq/roomid@=&/span&&span class=&si&&{}&/span&&span class=&s1&&/&/span&&span class=&se&&\0&/span&&span class=&s1&&'&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&n&&roomid&/span&&span class=&p&&)&/span&
&span class=&n&&send_req_msg&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&)&/span&
&span class=&c1&&# 构造获取弹幕消息请求&/span&
&span class=&n&&msg_more&/span& &span class=&o&&=&/span& &span class=&s1&&'type@=joingroup/rid@=&/span&&span class=&si&&{}&/span&&span class=&s1&&/gid@=-9999/&/span&&span class=&se&&\0&/span&&span class=&s1&&'&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&n&&roomid&/span&&span class=&p&&)&/span&
&span class=&n&&send_req_msg&/span&&span class=&p&&(&/span&&span class=&n&&msg_more&/span&&span class=&p&&)&/span&
&span class=&k&&while&/span& &span class=&kc&&True&/span&&span class=&p&&:&/span&
&span class=&c1&&# 服务端返回的数据&/span&
&span class=&n&&data&/span& &span class=&o&&=&/span& &span class=&n&&client&/span&&span class=&o&&.&/span&&span class=&n&&recv&/span&&span class=&p&&(&/span&&span class=&mi&&1024&/span&&span class=&p&&)&/span&
&span class=&c1&&# 通过re模块找发送弹幕的用户名和内容&/span&
&span class=&n&&danmu_username&/span& &span class=&o&&=&/span& &span class=&n&&username_re&/span&&span class=&o&&.&/span&&span class=&n&&findall&/span&&span class=&p&&(&/span&&span class=&n&&data&/span&&span class=&p&&)&/span&
&span class=&n&&danmu_content&/span& &span class=&o&&=&/span& &span class=&n&&danmu_re&/span&&span class=&o&&.&/span&&span class=&n&&findall&/span&&span class=&p&&(&/span&&span class=&n&&data&/span&&span class=&p&&)&/span&
&span class=&k&&if&/span& &span class=&ow&&not&/span& &span class=&n&&data&/span&&span class=&p&&:&/span&
&span class=&k&&break&/span&
&span class=&k&&else&/span&&span class=&p&&:&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span& &span class=&ow&&in&/span& &span class=&nb&&range&/span&&span class=&p&&(&/span&&span class=&mi&&0&/span&&span class=&p&&,&/span& &span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&danmu_content&/span&&span class=&p&&)):&/span&
&span class=&k&&try&/span&&span class=&p&&:&/span&
&span class=&c1&&# 输出信息&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span&&span class=&s1&&'[&/span&&span class=&si&&{}&/span&&span class=&s1&&]:&/span&&span class=&si&&{}&/span&&span class=&s1&&'&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&n&&danmu_username&/span&&span class=&p&&[&/span&&span class=&mi&&0&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&decode&/span&&span class=&p&&(&/span&
&span class=&s1&&'utf8'&/span&&span class=&p&&),&/span& &span class=&n&&danmu_content&/span&&span class=&p&&[&/span&&span class=&mi&&0&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&decode&/span&&span class=&p&&(&/span&&span class=&n&&encoding&/span&&span class=&o&&=&/span&&span class=&s1&&'utf8'&/span&&span class=&p&&)))&/span&
&span class=&k&&except&/span&&span class=&p&&:&/span&
&span class=&k&&continue&/span&
&span class=&k&&def&/span& &span class=&nf&&keeplive&/span&&span class=&p&&():&/span&
&span class=&sd&&'''&/span&
&span class=&sd&&
保持心跳,15秒心跳请求一次&/span&
&span class=&sd&&
'''&/span&
&span class=&k&&while&/span& &span class=&kc&&True&/span&&span class=&p&&:&/span&
&span class=&n&&msg&/span& &span class=&o&&=&/span& &span class=&s1&&'type@=keeplive/tick@='&/span& &span class=&o&&+&/span& &span class=&nb&&str&/span&&span class=&p&&(&/span&&span class=&nb&&int&/span&&span class=&p&&(&/span&&span class=&n&&time&/span&&span class=&o&&.&/span&&span class=&n&&time&/span&&span class=&p&&()))&/span& &span class=&o&&+&/span& &span class=&s1&&'/&/span&&span class=&se&&\0&/span&&span class=&s1&&'&/span&
&span class=&n&&send_req_msg&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&)&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span&&span class=&s1&&'发送心跳包'&/span&&span class=&p&&)&/span&
&span class=&n&&time&/span&&span class=&o&&.&/span&&span class=&n&&sleep&/span&&span class=&p&&(&/span&&span class=&mi&&15&/span&&span class=&p&&)&/span&
&span class=&k&&def&/span& &span class=&nf&&logout&/span&&span class=&p&&():&/span&
&span class=&sd&&'''&/span&
&span class=&sd&&
与斗鱼服务器断开连接&/span&
&span class=&sd&&
关闭线程&/span&
&span class=&sd&&
'''&/span&
&span class=&n&&msg&/span& &span class=&o&&=&/span& &span class=&s1&&'type@=logout/'&/span&
&span class=&n&&send_req_msg&/span&&span class=&p&&(&/span&&span class=&n&&msg&/span&&span class=&p&&)&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span&&span class=&s1&&'已经退出服务器'&/span&&span class=&p&&)&/span&
&span class=&k&&def&/span& &span class=&nf&&signal_handler&/span&&span class=&p&&(&/span&&span class=&n&&signal&/span&&span class=&p&&,&/span& &span class=&n&&frame&/span&&span class=&p&&):&/span&
&span class=&sd&&'''&/span&
&span class=&sd&&
捕捉 ctrl+c的信号 即 signal.SIGINT&/span&
&span class=&sd&&
触发hander:&/span&
&span class=&sd&&
登出斗鱼服务器&/span&
&span class=&sd&&
关闭进程&/span&
&span class=&sd&&
'''&/span&
&span class=&n&&p1&/span&&span class=&o&&.&/span&&span class=&n&&terminate&/span&&span class=&p&&()&/span&
&span class=&n&&p2&/span&&span class=&o&&.&/span&&span class=&n&&terminate&/span&&span class=&p&&()&/span&
&span class=&n&&logout&/span&&span class=&p&&()&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span&&span class=&s1&&'Bye'&/span&&span class=&p&&)&/span&
&span class=&k&&if&/span& &span class=&n&&__name__&/span& &span class=&o&&==&/span& &span class=&s1&&'__main__'&/span&&span class=&p&&:&/span&
&span class=&c1&&#room_id = input('请输入房间ID: ')&/span&
&span class=&c1&&# 狗贼的房间号&/span&
&span class=&n&&room_id&/span& &span class=&o&&=&/span& &span class=&mi&&208114&/span&
&span class=&c1&&# 开启signal捕捉&/span&
&span class=&n&&signal&/span&&span class=&o&&.&/span&&span class=&n&&signal&/span&&span class=&p&&(&/span&&span class=&n&&signal&/span&&span class=&o&&.&/span&&span class=&n&&SIGINT&/span&&span class=&p&&,&/span& &span class=&n&&signal_handler&/span&&span class=&p&&)&/span&
&span class=&c1&&# 开启弹幕和心跳进程&/span&
&span class=&n&&p1&/span& &span class=&o&&=&/span& &span class=&n&&multiprocessing&/span&&span class=&o&&.&/span&&span class=&n&&Process&/span&&span class=&p&&(&/span&&span class=&n&&target&/span&&span class=&o&&=&/span&&span class=&n&&DM_start&/span&&span class=&p&&,&/span& &span class=&n&&args&/span&&span class=&o&&=&/span&&span class=&p&&(&/span&&span class=&n&&room_id&/span&&span class=&p&&,))&/span&
&span class=&n&&p2&/span& &span class=&o&&=&/span& &span class=&n&&multiprocessing&/span&&span class=&o&&.&/span&&span class=&n&&Process&/span&&span class=&p&&(&/span&&span class=&n&&target&/span&&span class=&o&&=&/span&&span class=&n&&keeplive&/span&&span class=&p&&)&/span&
&span class=&n&&p1&/span&&span class=&o&&.&/span&&span class=&n&&start&/span&&span class=&p&&()&/span&
&span class=&n&&p2&/span&&span class=&o&&.&/span&&span class=&n&&start&/span&&span class=&p&&()&/span&
&/code&&/pre&&/div&&p&&br&&/p&&blockquote&&i&每天的学习记录都会 同步更新到:&br&微信公众号: findyourownway&br&&br&知乎专栏:&a href=&https://zhuanlan.zhihu.com/Ehco-python& class=&internal&&&span class=&invisible&&https://&/span&&span class=&visible&&zhuanlan.zhihu.com/Ehco&/span&&span class=&invisible&&-python&/span&&span class=&ellipsis&&&/span&&/a&&br&&br&blog :&/i& &i&&a href=&https://link.zhihu.com/?target=http%3A//www.ehcoblog.ml/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&www.ehcoblog.ml&/a&&br&&br&Github:&/i& &i&&a href=&https://link.zhihu.com/?target=https%3A//github.com/Ehco1996/Python-crawler& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&github.com/Ehco1996/Pyt&/span&&span class=&invisible&&hon-crawler&/span&&span class=&ellipsis&&&/span&&/a&&/i&&/blockquote&&p&&/p&&p&&/p&&p&&/p&
七月末的南京简直开启了「微波炉」模式,白天要学车的我,晚上自然选择宅在家看直播,看着狗贼叔叔满屏幕的弹幕,我就想着能不能把弹幕爬下来呢?说干就干结果的展示:这里只抓到弹幕内容和发送用户 并输出在终端上,有兴趣的小伙伴 可以在这个基础上接着开…
&figure&&img src=&https://pic3.zhimg.com/v2-29aca8beb935a06257cc8b_b.jpg& data-rawwidth=&750& data-rawheight=&418& class=&origin_image zh-lightbox-thumb& width=&750& data-original=&https://pic3.zhimg.com/v2-29aca8beb935a06257cc8b_r.jpg&&&/figure&&p&&b&【通往数据自由之路导读】好久不见,手提代码来见,这篇文章分享的是一点资讯新闻网站的抓取和数据分析,机器学习。直接放代码!&/b&&/p&&blockquote&&b&流程思路&/b&:一点资讯是一个类似今日头条的新闻资讯类网站,我们通过抓取一点资讯上不同类别的新闻(一共5种类型的新闻,1403篇文章,190+万字),得到原始的数据素材。然后对其进行数据分析,生成词云,同时运用机器学习方法对其进行预测分类,并统计正向和反向情感。&/blockquote&&h2&&b&首先:进行数据的抓取,主要函数如下:&/b&&/h2&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&#抓取新闻数据,主代码太长,这里贴了图片。
&/code&&/pre&&/div&&figure&&img src=&https://pic3.zhimg.com/v2-9cb7ebc9c507e7bcca025fc371320fee_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&920& data-rawheight=&1481& class=&origin_image zh-lightbox-thumb& width=&920& data-original=&https://pic3.zhimg.com/v2-9cb7ebc9c507e7bcca025fc371320fee_r.jpg&&&/figure&&figure&&img src=&https://pic1.zhimg.com/v2-fe578b1fb395358bbbab64b71da7f99c_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&931& data-rawheight=&1135& class=&origin_image zh-lightbox-thumb& width=&931& data-original=&https://pic1.zhimg.com/v2-fe578b1fb395358bbbab64b71da7f99c_r.jpg&&&/figure&&p&这里主要通过下面类型格式的url去抓取链接,然后通过得到的新闻详情页面url进行内页的抓取。&/p&&blockquote&&a href=&http://link.zhihu.com/?target=http%3A//www.yidianzixun.com/home/q/news_list_for_channelchannel_id%3Dc9%26cstart%3D20%26cend%3D40%26infinite%3Dfalse%26refresh%3D1& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&yidianzixun.com/home/q/&/span&&span class=&invisible&&news_list_for_channelchannel_id=c9&cstart=20&cend=40&infinite=false&refresh=1&/span&&span class=&ellipsis&&&/span&&/a&&/blockquote&&p&由于一点资讯也是整个不同新闻网站的信息的综合性网站,所以新闻内页会有编码格式和布局格式的不同。在这里需要特别注意一下。&/p&&h2&&b&下一步:进行数据分析和文本&/b&&/h2&&p&&b&中文文本处理的过程中特别需要注意绘图时中文乱码的问题。&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&# 绘图之前的一些操作步骤# 防止中文字乱码
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei'] # 指定默认字体
# 解决保存图像是负号'-'显示为方块的问题
# 字体大小
mpl.rcParams['axes.unicode_minus'] = False font_size = 10
# 图表大小
fig_size = (12, 9)
# 更新字体大小mpl.rcParams['font.size'] = font_size
# 更新图表大小mpl.rcParams['figure.figsize'] = fig_size
&/code&&/pre&&/div&&p&看看新闻发布的时间段。&br&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&#按照日期的字符串进行排序
df_time = df.sort_values('date', ascending=[0])
# 删除原来的索引,重新建立索引
df_time = df_time.reset_index(drop = True)
# 采用计算每个时间段发布次数进行聚合
= copy.deepcopy(df_time)
df_time1['time'] = [time.strftime(&%Y-%m-%d %H&,time.strptime(str(postTime), '%Y-%m-%d %H:%M:%S')) for postTime in df_time1['date']]
time_count = (df_time1.loc[:, ['time', 'title']]).groupby(['time']).count()
time_count.index = pd.to_datetime(time_count.index)
# 查看每个时间段发布次数,作图
time_count['title'].plot()
&/code&&/pre&&/div&&figure&&img src=&https://pic1.zhimg.com/v2-a5c428721eeb3af5a5c0b_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&702& data-rawheight=&516& class=&origin_image zh-lightbox-thumb& width=&702& data-original=&https://pic1.zhimg.com/v2-a5c428721eeb3af5a5c0b_r.jpg&&&/figure&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&# 缩短时间再进行观察
time_count['title'][70:].plot()
&/code&&/pre&&/div&&figure&&img src=&https://pic3.zhimg.com/v2-f8b1f62504b3afff90e946_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&702& data-rawheight=&566& class=&origin_image zh-lightbox-thumb& width=&702& data-original=&https://pic3.zhimg.com/v2-f8b1f62504b3afff90e946_r.jpg&&&/figure&&p&&br&&/p&&p&新闻发布的时间也是跟我们正常人的作息时间是差不多的,早上9点打到一个高潮,晚上21-22点达到一个高潮。&br&&/p&&p&再来看看抓取的这几个类别中,哪个新闻数据源的信息量最多。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&# 通过source聚类发现哪个信息源被引用的次数较多
source_count = (df_time.loc[:, ['source', 'title']]).groupby(['source']).count()
source_count_sort = source_count.sort_values(['title'], ascending=[0])
# 观察哪些信息源被引用的较多
print(source_count_sort['title'][:10])
# 查看每个时间段发布次数,作图
source_count_sort['title'][:10].plot()
&/code&&/pre&&/div&&figure&&img src=&https://pic3.zhimg.com/v2-9ba9a55db402_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&728& data-rawheight=&538& class=&origin_image zh-lightbox-thumb& width=&728& data-original=&https://pic3.zhimg.com/v2-9ba9a55db402_r.jpg&&&/figure&&p&&br&&/p&&blockquote&&b&再来看看每个类别的喜欢数、评论数、up数。&/b&&/blockquote&&p&(ps:添加数据标签的这个小技巧也是我做了这个项目之后才学会的方法,大家可以借鉴我的这段代码。)&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&# 构建作图格式
tuple_c = []
for j in range(len(channel_id)):
tuple_c.append(tuple([like['like'].iloc[j], comment_count['comment_count'].iloc[j], up['up'].iloc[j]]))
# 设置柱形图宽度
bar_width = 0.15
index = np.arange(3)
#c2为体育,c3为娱乐,c5为财经,c7为军事,c9为社会,
channel_name = ['体育', '娱乐', '财经', '军事', '社会']
rects1 = plt.bar(index, tuple_c[0], bar_width, color='#0072BC', label=channel_name[0])
rects2 = plt.bar(index + bar_width, tuple_c[1], bar_width, color='#4E1C2D', label=channel_name[1])
rects3 = plt.bar(index + bar_width*2, tuple_c[2], bar_width, color='g', label=channel_name[2])
rects4 = plt.bar(index + bar_width*3, tuple_c[3], bar_width, color='#ED1C24', label=channel_name[3])
rects5 = plt.bar(index + bar_width*4, tuple_c[4], bar_width, color='c', label=channel_name[4])
plt.xticks(index + bar_width, count_name)
#plt.ylim(ymax=100, ymin=0)
# 图表标题
plt.title(u'like,comment,up 对比')
# 图例显示在图表下方
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.03), fancybox=True, ncol=5)
# 添加数据标签
def add_labels(rects):
for rect in rects:
height = rect.get_height()
plt.text(rect.get_x() + rect.get_width() / 2, height, height, ha='center', va='bottom')
# 柱形图边缘用白色填充,纯粹为了美观
rect.set_edgecolor('white')
add_labels(rects1)
add_labels(rects2)
add_labels(rects3)
add_labels(rects4)
add_labels(rects5)
&/code&&/pre&&/div&&figure&&img src=&https://pic4.zhimg.com/v2-15abedfbde7c7_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&720& data-rawheight=&559& class=&origin_image zh-lightbox-thumb& width=&720& data-original=&https://pic4.zhimg.com/v2-15abedfbde7c7_r.jpg&&&/figure&&p&&br&&/p&&p&查看评论最多的一篇comment的文章名&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&# 查看评论最多的一篇comment的文章名
df_comment = df.sort_values('comment_count', ascending=[0])
# 删除原来的索引,重新建立索引
df_comment = df_comment.reset_index(drop = True)
print(df_comment.iloc[0])
&/code&&/pre&&/div&&blockquote&title
《人民的名义》40戏骨总片酬4800万 不敌一小鲜肉&br&source
沈阳晚报&br&ategory
娱乐&br&date
07:38:47&br&like
159&br&comment_count
16252&br&up
5349&br&detail_fulltext
湖南卫视《人民的名义》播出劲头越来越足,这部集结陆毅、张丰毅、吴刚、许亚军、张凯丽、张志坚、...&/blockquote&&p&&br&&/p&&p&确实可以感觉到最近热播的《人民的名义》的火热程度。不过不排除有水军存在。&/p&&p&&br&&/p&&blockquote&&b&jieba及词云部分 &/b&&/blockquote&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&# 词频统计
contentAll =&&
for item in df['detail_fulltext']:
contentAll = contentAll + item
# 查看一共有多少字
print('此次分析的数据中一共有 %d 个字。' %len(contentAll))
此次分析的数据中一共有 1937622 个字。
segment = []
segs = jieba.cut(contentAll)
for seg in segs:
if len(seg) & 1 and seg!='\r\n':
segment.append(seg)
words_df=pd.DataFrame({'segment':segment})
words_df.head()
ancient_chinese_stopwords=pd.Series(['我们','没有','可以','什么','还是','一个','就是','这个','怎么','但是','不是','之后','通过','所以','现在','如果','为什么','这些','需要','这样','目前','大多','时候','或者','这样','如果','所以','因为','这些','他们','那么','开始','其中','这么','成为','还有','已经','可能','对于','之后','10','20','很多','其实','自己','当时','非常','表示','不过','出现','认为','利亚','罗斯','& &'])
words_df=words_df[~words_df.segment.isin(ancient_chinese_stopwords)]
#统计词频 ,并作图展示
words_stat=words_df.groupby(by=['segment'])['segment'].agg({&number&:np.size})
words_stat=words_stat.reset_index().sort(columns=&number&,ascending=False)
words_stat_sort = words_stat.sort_values(['number'], ascending=[0])
sns.set_color_codes(&muted&)
sns.barplot(x='segment', y='number', data=words_stat_sort[:11], color=&b&)
plt.ylabel('出现次数')
plt.title(&前10个最常见词统计&)
plt.show()
&/code&&/pre&&/div&&figure&&img src=&https://pic4.zhimg.com/v2-b293ccd7_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&730& data-rawheight=&554& class=&origin_image zh-lightbox-thumb& width=&730& data-original=&https://pic4.zhimg.com/v2-b293ccd7_r.jpg&&&/figure&&p&中国,美国,公司,市场等排名靠前。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&#进行词云分析
from scipy.misc import imread
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator
wordlist_after_jieba = jieba.cut(contentAll, cut_all = True)
wl_space_split = & &.join(wordlist_after_jieba)
bimg=imread('hhldata.jpg')
my_wordcloud = WordCloud(
background_color='white',
# 设置背景颜色
mask = bimg,
# 设置背景图片
max_words = 200,
# 设置最大现实的字数
stopwords = ancient_chinese_stopwords,
# 设置停用词
font_path = 'msyh.ttf',# 设置字体格式,如不设置显示不了中文
max_font_size = 120,
# 设置字体最大值
random_state = 30,
# 设置有多少种随机生成状态,即有多少种配色方案
).generate(wl_space_split)
# 根据图片生成词云颜色
image_colors = ImageColorGenerator(bimg)
#my_wordcloud.recolor(color_func=image_colors)
# 以下代码显示图片
plt.figure(figsize=(12,9))
plt.imshow(my_wordcloud)
plt.axis(&off&)
plt.show()
&/code&&/pre&&/div&&figure&&img src=&https://pic4.zhimg.com/v2-ce69d0bcb5f1bea67af73_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&708& data-rawheight=&444& class=&origin_image zh-lightbox-thumb& width=&708& data-original=&https://pic4.zhimg.com/v2-ce69d0bcb5f1bea67af73_r.jpg&&&/figure&&p&&br&&/p&&blockquote&&b&朴素贝叶斯文本主题分类 &/b&&/blockquote&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&# 导入包
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
class LanguageDetector():
def __init__(self, classifier=LogisticRegression(penalty='l2')):
self.classifier = classifier
self.vectorizer = CountVectorizer(ngram_range=(1,2), max_features=20000, preprocessor=self._remove_noise)
def _remove_noise(self, document):
noise_pattern = re.compile(&|&.join([&http\S+&, &\@\w+&, &\#\w+&]))
clean_text = re.sub(noise_pattern, &&, document)
return clean_text
def features(self, X):
return self.vectorizer.transform(X)
def fit(self, X, y):
self.vectorizer.fit(X)
self.classifier.fit(self.features(X), y)
def predict(self, x):
return self.classifier.predict(self.features([x]))
def score(self, X, y):
return self.classifier.score(self.features(X), y)
x = df_2['detail_fulltext']
y = df_2['setType']
# 区分训练集与测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
language_detector = LanguageDetector()
language_detector.fit(x_train, y_train)
print(language_detector.score(x_test, y_test))
&/code&&/pre&&/div&&p&最后得到的结果为:&b&66.10%&/b&(数据量再大一些应该会更高一些)&/p&&p&&br&&/p&&p&&b&情感分析&/b&&/p&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&# 求分数
def sentiment_score(senti_score_list):
score = []
for review in senti_score_list:
score_array = np.array(review)
Pos = np.sum(score_array[:, 0])
Neg = np.sum(score_array[:, 1])
AvgPos = np.mean(score_array[:, 0])
AvgPos = float('%.1f'%AvgPos)
AvgNeg = np.mean(score_array[:, 1])
AvgNeg = float('%.1f'%AvgNeg)
StdPos = np.std(score_array[:, 0])
StdPos = float('%.1f'%StdPos)
StdNeg = np.std(score_array[:, 1])
StdNeg = float('%.1f'%StdNeg)
score.append([Pos, Neg, AvgPos, AvgNeg, StdPos, StdNeg])
return score
contentAll = contentAll.replace(' ','').replace(',','')
#求Pos, Neg, AvgPos, AvgNeg, StdPos, StdNeg的值
start_all_time = time.time()
print(sentiment_score(sentiment_score_list(contentAll)))
end_all_time = time.time()
work_all_time = end_all_time - start_all_time
print(&试验总共所花时间为:%.2f s& % work_all_time)
&/code&&/pre&&/div&&p&最后得到的正向词和负向词的得分为:&br&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&[[.625, .25, 596.6, 724.4]]
总体来看,新闻还是偏正面的。
&/code&&/pre&&/div&&p&&br&&/p&&blockquote&&b&想要完整jupyter Notebook的同学可以在【通往数据自由之路】公众号中回复:&i&一点资讯&/i&&/b&&/blockquote&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&自我介绍何红亮,一位走在数据科学道路上的同学。通过“通往数据自由之路”,希望能记录自己从数据菜鸟到数据能手的进阶之路,分享自己所见、所做、所感。联系邮箱:。
可关注同名知乎专栏和微信公众号:通往数据自由之路。
&/code&&/pre&&/div&&p&&/p&
【通往数据自由之路导读】好久不见,手提代码来见,这篇文章分享的是一点资讯新闻网站的抓取和数据分析,机器学习。直接放代码!流程思路:一点资讯是一个类似今日头条的新闻资讯类网站,我们通过抓取一点资讯上不同类别的新闻(一共5种类型的新闻,1403篇…
&figure&&img src=&https://pic3.zhimg.com/v2-82f011e8b2cce30294aeac_b.jpg& data-rawwidth=&744& data-rawheight=&244& class=&origin_image zh-lightbox-thumb& width=&744& data-original=&https://pic3.zhimg.com/v2-82f011e8b2cce30294aeac_r.jpg&&&/figure&&h1&序&/h1&&p&本文主要内容:以最短的时间写一个最简单的爬虫,可以抓取论坛的帖子标题和帖子内容。&/p&&p&本文受众:没写过爬虫的萌新。&/p&&h1&入门&/h1&&h3&0.准备工作&/h3&&p&需要准备的东西: Python、scrapy、一个IDE或者随便什么文本编辑工具。&/p&&h3&1.技术部已经研究决定了,你来写爬虫。&/h3&&p&随便建一个工作目录,然后用命令行建立一个工程,工程名为miao,可以替换为你喜欢的名字。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&scrapy startproject miao
&/code&&/pre&&/div&&p&随后你会得到如下的一个由scrapy创建的目录结构&figure&&img src=&https://pic3.zhimg.com/v2-c645b8209e5abd2967eff_b.jpg& data-rawwidth=&234& data-rawheight=&180& class=&content_image& width=&234&&&/figure&&/p&&p&在spiders文件夹中创建一个python文件,比如miao.py,来作为爬虫的脚本。内容如下:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&import scrapy
class NgaSpider(scrapy.Spider):
name = &NgaSpider&
host = &http://bbs.ngacn.cc/&
# start_urls是我们准备爬的初始页
start_urls = [
&http://bbs.ngacn.cc/thread.php?fid=406&,
# 这个是解析函数,如果不特别指明的话,scrapy抓回来的页面会由这个函数进行解析。
# 对页面的处理和分析工作都在此进行,这个示例里我们只是简单地把页面内容打印出来。
def parse(self, response):
print response.body
&/code&&/pre&&/div&&h3&2.跑一个试试?&/h3&&p&如果用命令行的话就这样:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&cd miao
scrapy crawl NgaSpider
&/code&&/pre&&/div&&p&你可以看到爬虫君已经把你坛星际区第一页打印出来了,当然由于没有任何处理,所以混杂着html标签和js脚本都一并打印出来了。&/p&&h1&解析&/h1&&p&接下来我们要把刚刚抓下来的页面进行分析,从这坨html和js堆里把这一页的帖子标题提炼出来。其实解析页面是个体力活,方法多的是,这里只介绍xpath。&/p&&h3&0.为什么不试试神奇的xpath呢&/h3&&p&看一下刚才抓下来的那坨东西,或者用chrome浏览器手动打开那个页面然后按F12可以看到页面结构。每个标题其实都是由这么一个html标签包裹着的。举个例子:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&&a href='/read.php?tid=' id='t_tt1_33' class='topic'&[合作模式] 合作模式修改设想&/a&
&/code&&/pre&&/div&&p&可以看到href就是这个帖子的地址(当然前面要拼上论坛地址),而这个标签包裹的内容就是帖子的标题了。&br&于是我们用xpath的绝对定位方法,把class='topic'的部分摘出来。&/p&&h3&1.看看xpath的效果&/h3&&p&在最上面加上引用:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&from scrapy import Selector
&/code&&/pre&&/div&&p&把parse函数改成:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&
def parse(self, response):
selector = Selector(response)
# 在此,xpath会将所有class=topic的标签提取出来,当然这是个list
# 这个list里的每一个元素都是我们要找的html标签
content_list = selector.xpath(&//*[@class='topic']&)
# 遍历这个list,处理每一个标签
for content in content_list:
# 此处解析标签,提取出我们需要的帖子标题。
topic = content.xpath('string(.)').extract_first()
print topic
# 此处提取出帖子的url地址。
url = self.host + content.xpath('@href').extract_first()
&/code&&/pre&&/div&&p&再次运行就可以看到输出你坛星际区第一页所有帖子的标题和url了。&/p&&h1&递归&/h1&&p&接下来我们要抓取每一个帖子的内容。这里需要用到python的yield。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&yield Request(url=url, callback=self.parse_topic)
&/code&&/pre&&/div&&p&此处会告诉scrapy去抓取这个url,然后把抓回来的页面用指定的parse_topic函数进行解析。&/p&&p&至此我们需要定义一个新的函数来分析一个帖子里的内容。&/p&&p&完整的代码如下:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&import scrapy
from scrapy import Selector
from scrapy import Request
class NgaSpider(scrapy.Spider):
name = &NgaSpider&
host = &http://bbs.ngacn.cc/&
# 这个例子中只指定了一个页面作为爬取的起始url
# 当然从数据库或者文件或者什么其他地方读取起始url也是可以的
start_urls = [
&http://bbs.ngacn.cc/thread.php?fid=406&,
# 爬虫的入口,可以在此进行一些初始化工作,比如从某个文件或者数据库读入起始url
def start_requests(self):
for url in self.start_urls:
# 此处将起始url加入scrapy的待爬取队列,并指定解析函数
# scrapy会自行调度,并访问该url然后把内容拿回来
yield Request(url=url, callback=self.parse_page)
# 版面解析函数,解析一个版面上的帖子的标题和地址
def parse_page(self, response):
selector = Selector(response)
content_list = selector.xpath(&//*[@class='topic']&)
for content in content_list:
topic = content.xpath('string(.)').extract_first()
print topic
url = self.host + content.xpath('@href').extract_first()
# 此处,将解析出的帖子地址加入待爬取队列,并指定解析函数
yield Request(url=url, callback=self.parse_topic)
# 可以在此处解析翻页信息,从而实现爬取版区的多个页面
# 帖子的解析函数,解析一个帖子的每一楼的内容
def parse_topic(self, response):
selector = Selector(response)
content_list = selector.xpath(&//*[@class='postcontent ubbcode']&)
for content in content_list:
content = content.xpath('string(.)').extract_first()
print content
# 可以在此处解析翻页信息,从而实现爬取帖子的多个页面
&/code&&/pre&&/div&&p&到此为止,这个爬虫可以爬取你坛第一页所有的帖子的标题,并爬取每个帖子里第一页的每一层楼的内容。爬取多个页面的原理相同,注意解析翻页的url地址、设定终止条件、指定好对应的页面解析函数即可。&/p&&h1&Pipelines——管道&/h1&&p&此处是对已抓取、解析后的内容的处理,可以通过管道写入本地文件、数据库。&/p&&h3&0.定义一个Item&/h3&&p&在miao文件夹中创建一个items.py文件。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&from scrapy import Item, Field
class TopicItem(Item):
url = Field()
title = Field()
author = Field()
class ContentItem(Item):
url = Field()
content = Field()
author = Field()
&/code&&/pre&&/div&&p&此处我们定义了两个简单的class来描述我们爬取的结果。&/p&&h3&1. 写一个处理方法&/h3&&p&在miao文件夹下面找到那个pipelines.py文件,scrapy之前应该已经自动生成好了。&/p&&p&我们可以在此建一个处理方法。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&class FilePipeline(object):
## 爬虫的分析结果都会由scrapy交给此函数处理
def process_item(self, item, spider):
if isinstance(item, TopicItem):
## 在此可进行文件写入、数据库写入等操作
if isinstance(item, ContentItem):
## 在此可进行文件写入、数据库写入等操作
return item
&/code&&/pre&&/div&&h3&2.在爬虫中调用这个处理方法。&/h3&&p&要调用这个方法我们只需在爬虫中调用即可,例如原先的内容处理函数可改为:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&
def parse_topic(self, response):
selector = Selector(response)
content_list = selector.xpath(&//*[@class='postcontent ubbcode']&)
for content in content_list:
content = content.xpath('string(.)').extract_first()
## 以上是原内容
## 创建个ContentItem对象把我们爬取的东西放进去
item = ContentItem()
item[&url&] = response.url
item[&content&] = content
item[&author&] = && ## 略
## 这样调用就可以了
## scrapy会把这个item交给我们刚刚写的FilePipeline来处理
yield item
&/code&&/pre&&/div&&h3&3.在配置文件里指定这个pipeline&/h3&&p&找到settings.py文件,在里面加入&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&ITEM_PIPELINES = {
'miao.pipelines.FilePipeline': 400,
&/code&&/pre&&/div&&p&这样在爬虫里调用&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&yield item
&/code&&/pre&&/div&&p&的时候都会由经这个FilePipeline来处理。后面的数字400表示的是优先级。&br&可以在此配置多个Pipeline,scrapy会根据优先级,把item依次交给各个item来处理,每个处理完的结果会传递给下一个pipeline来处理。&br&可以这样配置多个pipeline:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&ITEM_PIPELINES = {
'miao.pipelines.Pipeline00': 400,
'miao.pipelines.Pipeline01': 401,
'miao.pipelines.Pipeline02': 402,
'miao.pipelines.Pipeline03': 403,
&/code&&/pre&&/div&&h1&Middleware——中间件&/h1&&p&通过Middleware我们可以对请求信息作出一些修改,比如常用的设置UA、代理、登录信息等等都可以通过Middleware来配置。&/p&&h3&0.Middleware的配置&/h3&&p&与pipeline的配置类似,在setting.py中加入Middleware的名字,例如&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&DOWNLOADER_MIDDLEWARES = {
&miao.middleware.UserAgentMiddleware&: 401,
&miao.middleware.ProxyMiddleware&: 402,
&/code&&/pre&&/div&&h3&1.破网站查UA, 我要换UA&/h3&&p&某些网站不带UA是不让访问的。在miao文件夹下面建立一个middleware.py&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&import random
agents = [
&Mozilla/5.0 (W U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5&,
&Mozilla/5.0 (W U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9&,
&Mozilla/5.0 (W U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7&,
&Mozilla/5.0 (W U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14&,
&Mozilla/5.0 (W U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14&,
&Mozilla/5.0 (W U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20&,
&Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27&,
&Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1&,
class UserAgentMiddleware(object):
def process_request(self, request, spider):
agent = random.choice(agents)
request.headers[&User-Agent&] = agent
&/code&&/pre&&/div&&p&这里就是一个简单的随机更换UA的中间件,agents的内容可以自行扩充。&/p&&h3&2.破网站封IP,我要用代理&/h3&&p&比如本地127.0.0.1开启了一个8123端口的代理,同样可以通过中间件配置让爬虫通过这个代理来对目标网站进行爬取。同样在middleware.py中加入:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&class ProxyMiddleware(object):
def process_request(self, request, spider):
# 此处填写你自己的代理
# 如果是买的代理的话可以去用API获取代理列表然后随机选择一个
proxy = &http://127.0.0.1:8123&
request.meta[&proxy&] = proxy
&/code&&/pre&&/div&&p&很多网站会对访问次数进行限制,如果访问频率过高的话会临时禁封IP。如果需要的话可以从网上购买IP,一般服务商会提供一个API来获取当前可用的IP池,选一个填到这里就好。&/p&&h1&一些常用配置&/h1&&p&在settings.py中的一些常用配置&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&# 间隔时间,单位秒。指明scrapy每两个请求之间的间隔。
DOWNLOAD_DELAY = 5
# 当访问异常时是否进行重试
RETRY_ENABLED = True
# 当遇到以下http状态码时进行重试
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 403, 404, 408]
# 重试次数
RETRY_TIMES = 5
# Pipeline的并发数。同时最多可以有多少个Pipeline来处理item
CONCURRENT_ITEMS = 200
# 并发请求的最大数
CONCURRENT_REQUESTS = 100
# 对一个网站的最大并发数
CONCURRENT_REQUESTS_PER_DOMAIN = 50
# 对一个IP的最大并发数
CONCURRENT_REQUESTS_PER_IP = 50
&/code&&/pre&&/div&&h1&我就是要用Pycharm&/h1&&p&如果非要用Pycharm作为开发调试工具的话可以在运行配置里进行如下配置:&br&Configuration页面:&br&Script填你的scrapy的cmdline.py路径,比如我的是&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py
&/code&&/pre&&/div&&p&然后在Scrpit parameters中填爬虫的名字,本例中即为:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&crawl NgaSpider
&/code&&/pre&&/div&&p&最后是Working diretory,找到你的settings.py文件,填这个文件所在的目录。&br&示例:&figure&&img src=&https://pic2.zhimg.com/v2-64c4fd52fea_b.jpg& data-rawwidth=&741& data-rawheight=&417& class=&origin_image zh-lightbox-thumb& width=&741& data-original=&https://pic2.zhimg.com/v2-64c4fd52fea_r.jpg&&&/figure&&/p&&p&按小绿箭头就可以愉快地调试了。&/p&
序本文主要内容:以最短的时间写一个最简单的爬虫,可以抓取论坛的帖子标题和帖子内容。本文受众:没写过爬虫的萌新。入门0.准备工作需要准备的东西: Python、scrapy、一个IDE或者随便什么文本编辑工具。1.技术部已经研究决定了,你来写爬虫。随便建一个工…
&figure&&img src=&https://pic2.zhimg.com/v2-0ffc829aafdf8dd9859717e_b.jpg& data-rawwidth=&547& data-rawheight=&314& class=&origin_image zh-lightbox-thumb& width=&547& data-original=&https://pic2.zhimg.com/v2-0ffc829aafdf8dd9859717e_r.jpg&&&/figure&&p&本文是崔庆才老师发布的免费公开课的详细步骤,可以观看视频(&a href=&https://link.zhihu.com/?target=https%3A//edu.hellobi.com/course/163& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&爬取知乎所有用户详细信息&/a&),崔庆才授权发布,未经允许,禁止转载。&/p&&p&边看视频边看本文效果更佳。&/p&&p&---------------------------------------------------------------------------------------------------------------------------&/p&&p&本节分享一下爬取知乎用户所有用户信息的Scrapy爬虫实战。&/p&&h2&本节目标&/h2&&p&本节要实现的内容有:&/p&&ul&&li&从一个大V用户开始,通过递归抓取粉丝列表和关注列表,实现知乎所有用户的详细信息的抓取。&/li&&li&将抓取到的结果存储到MongoDB,并进行去重操作。&/li&&/ul&&br&&h2&思路分析&/h2&&p&我们都知道每个人都有关注列表和粉丝列表,尤其对于大V来说,粉丝和关注尤其更多。&/p&&p&如果我们从一个大V开始,首先可以获取他的个人信息,然后我们获取他的粉丝列表和关注列表,然后遍历列表中的每一个用户,进一步抓取每一个用户的信息还有他们各自的粉丝列表和关注列表,然后再进一步遍历获取到的列表中的每一个用户,进一步抓取他们的信息和关注粉丝列表,循环往复,不断递归,这样就可以做到一爬百,百爬万,万爬百万,通过社交关系自然形成了一个爬取网,这样就可以爬到所有的用户信息了。当然零粉丝零关注的用户就忽略他们吧~&/p&&p&爬取的信息怎样来获得呢?不用担心,通过分析知乎的请求就可以得到相关接口,通过请求接口就可以拿到用户详细信息和粉丝、关注列表了。&/p&&p&接下来我们开始实战爬取。&/p&&br&&h2&环境需求&/h2&&p&&b&Python3&/b&&/p&&p&本项目使用的Python版本是Python3,项目开始之前请确保你已经安装了Python3。&/p&&p&&b&Scrapy&/b&&/p&&p&Scrapy是一个强大的爬虫框架,安装方式如下:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&pip3 install scrapy
&/code&&/pre&&/div&&p&&b&MongoDB&/b&&/p&&p&非关系型数据库,项目开始之前请先安装好MongoDB并启动服务。&/p&&p&&b&PyMongo&/b&&/p&&p&Python的MongoDB连接库,安装方式如下:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&pip3 install pymongo
&/code&&/pre&&/div&&br&&h2&创建项目&/h2&&p&安装好以上环境之后,我们便可以开始我们的项目了。&br&在项目开始之首先我们用命令行创建一个项目:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&scrapy startproject zhihuuser
&/code&&/pre&&/div&&br&&h2&创建爬虫&/h2&&p&接下来我们需要创建一个spider,同样利用命令行,不过这次命令行需要进入到项目里运行。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&cd zhihuuser
scrapy genspider zhihu www.zhihu.com
&/code&&/pre&&/div&&br&&h2&禁止ROBOTSTXT_OBEY&/h2&&p&接下来你需要打开settings.py文件,将ROBOTSTXT_OBEY修改为False。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&ROBOTSTXT_OBEY = False
&/code&&/pre&&/div&&p&它默认为True,就是要遵守robots.txt 的规则,那么 robots.txt 是个什么东西呢?&/p&&p&通俗来说, robots.txt 是遵循 Robot 协议的一个文件,它保存在网站的服务器中,它的作用是,告诉搜索引擎爬虫,本网站哪些目录下的网页 不希望 你进行爬取收录。在Scrapy启动后,会在第一时间访问网站的 robots.txt 文件,然后决定该网站的爬取范围。&/p&&p&当然,我们并不是在做搜索引擎,而且在某些情况下我们想要获取的内容恰恰是被 robots.txt 所禁止访问的。所以,某些时候,我们就要将此配置项设置为 False ,拒绝遵守 Robot协议 !&/p&&p&所以在这里设置为False。当然可能本次爬取不一定会被它限制,但是我们一般来说会首先选择禁止它。&/p&&br&&h2&尝试最初的爬取&/h2&&p&接下来我们什么代码也不修改,执行爬取,运行如下命令:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&scrapy crawl zhihu
&/code&&/pre&&/div&&p&你会发现爬取结果会出现这样的一个错误:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&500 Internal Server Error
&/code&&/pre&&/div&&p&访问知乎得到的状态码是500,这说明爬取并没有成功,其实这是因为我们没有加入请求头,知乎识别User-Agent发现不是浏览器,就返回错误的响应了。&/p&&p&所以接下来的一步我们需要加入请求headers信息,你可以在Request的参数里加,也可以在spider里面的custom_settings里面加,当然最简单的方法莫过于在全局settings里面加了。&/p&&p&我们打开settings.py文件,取消DEFAULT_REQUEST_HEADERS的注释,加入如下的内容:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (M Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
&/code&&/pre&&/div&&p&这个是为你的请求添加请求头,如果你没有设置headers的话,它就会使用这个请求头请求,添加了User-Agent信息,所以这样我们的爬虫就可以伪装浏览器了。&/p&&p&接下来重新运行爬虫。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&scrapy crawl zhihu
&/code&&/pre&&/div&&p&这时你就会发现得到的返回状态码就正常了。&/p&&p&解决了这个问题,我们接下来就可以分析页面逻辑来正式实现爬虫了。&/p&&br&&h2&爬取流程&/h2&&p&接下来我们需要先探寻获取用户}

我要回帖

更多关于 freemarker 包含页面 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信