scrapy爬虫学习【一】之wooyun白帽子精华榜

1.分析 wooyun白帽子一共两千多名，白帽子信息的链接如下: wooyun白帽子它分为精华榜，人气榜以及贡献榜；三个榜单数据是一样的，只是排名顺序不一样。这里选取精华榜作为数据来源。精华榜的链接为 http://wooyun.org/whitehats/do/1/page/n 可提取信息为注册日期，昵称，等级，精华漏洞数，精华比例， wooyun主页手动爬取手动爬去需要将url全部写在start_urls里面，一共104页因此一共104个链接自动爬取 start_urls里面只有一个链接，其它链接由程序自动获取 2.程序设计 2.1手动爬取在items.py中定义Item容器
class WooyunRankItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() ''' 以下信息分别为注册日期 woyun昵称精华漏洞数精华比例 wooyun个人主页 ''' register_date = scrapy.Field() nick_name= scrapy.Field() rank_level= scrapy.Field() essence_count = scrapy.Field() essence_ratio = scrapy.Field() homepage= scrapy.Field() 爬虫编写
#!/usr/bin/python# -*- coding:utf-8 -*- from scrapy.spidersimport Spider from scrapy.selectorimport Selector from wooyun_rank.itemsimport WooyunRankItem class WooyunSpider(Spider): ''' 爬取wooyun漏洞精华榜单 ''' name = "wooyunrank" #爬取速度为1s download_delay = 5 allowed_domains = ["wooyun.org"] start_urls = list() for pagein xrange(1,100): start_urls.append("http://wooyun.org/whitehats/do/1/page/"+str(page)) def parse(self,response): sel = Selector(response) item = WooyunRankItem() infos = sel.xpath("/html/body/div[5]/table/tbody/tr") for infoin infos: ''' extract()提取出来的是列表，列表数据是unicode的编码这里是一元数组，所以可以直接取第一个元素。针对多元数组，则应该对内容进行重组也可以按照如下方法首先就就行编码 urls = [url.encode('utf-8') for url in urls] python支持 print [url for url in urls]这样的语法 ''' item["register_date"] = info.xpath("th[1]/text()").extract()[0] item["rank_level"]= info.xpath("th[2]/text()").extract()[0] item["essence_count"] = info.xpath("th[3]/text()").extract()[0] item["essence_ratio"] = info.xpath("th[4]/text()").extract()[0] item["nick_name"]= info.xpath("td/a/text()").extract()[0] yield item pipelines.py编写(一)
import os import csv class WooyunRankPipeline(object): ''' process the item returned from the spider ''' def process_item(self, item, spider): if not os.path.exists("wooyunrank.csv"): with open("wooyunrank.csv","wb") as f: dict_writer = csv.DictWriter(f) dict_writer.writerow("register_date","nick_name","rank_level","essence_count","essence_ratio") f.close() with open("wooyunrank.csv","ab") as f: dict_writer = csv.DictWriter(f) dict_writer.writerow(item) f.close() return item 但是，上面这段代码会有一个明显的问题，那就是资源争夺，导致文件无法打开！虽然，上面的代码就算不产生资源争夺也是错误的！但是，这不是重点。下面是改进的代码！
import os import csv class WooyunRankPipeline(object): ''' process the item returned from the spider ''' def __init__(self): file_obj = open("wooyunrank.csv","wb") fieldnames = ["register_date","nick_name","rank_level","essence_count","essence_ratio"] self.dict_writer = csv.DictWriter(file_obj,fieldnames=fieldnames) self.dict_writer.writeheader() def process_item(self,item,spider): self.dict_writer.writerow(item) return item 2.2自动爬取 spider.py的编写
#!/usr/bin/python# -*- coding:utf-8 -*- import sys from scrapy.spiderimport Spider from scrapy.selectorimport Selector from wooyunrankauto.itemsimport WooyunrankautoItem from scrapy.contrib.spidersimport CrawlSpider,Rule from scrapy.contrib.linkextractorsimport LinkExtractor class WooyunSpider(CrawlSpider): ''' 爬取wooyun漏洞精华榜单 ''' name = "wooyunrankauto" # 爬取速度为1s download_delay = 1 allowed_domains = ["wooyun.org"] start_urls = [ "http://wooyun.org/whitehats/do/1/page/1" ] rules=[ Rule(LinkExtractor(allow=("/whitehats/do/1/page/\d+")),follow=True,callback='parse_item') ] # def __init__(self): # reload(sys) # if sys.getdefaultencoding()!="utf-8": # sys.setdefaultencoding("utf-8") def parse_item(self,response): sel = Selector(response) item = WooyunrankautoItem() infos = sel.xpath("/html/body/div[5]/table/tbody/tr") for infoin infos: ''' extract()提取出来的是列表，列表数据是unicode的编码这里是一元数组，所以可以直接取第一个元素。针对多元数组，则应该对内容进行重组也可以按照如下方法首先就就行编码 urls = [url.encode('utf-8') for url in urls] python支持 print [url for url in urls]这样的语法 ''' item["register_date"] = info.xpath("th[1]/text()").extract()[0] item["rank_level"]= info.xpath("th[2]/text()").extract()[0] item["essence_count"] = info.xpath("th[3]/text()").extract()[0] item["essence_ratio"] = info.xpath("th[4]/text()").extract()[0] item["nick_name"]= info.xpath("td/a/text()").extract()[0] return item # yield item 这里会出现一个问题,如果使用 return item 很明显，只会出现每个页面只有一条个人信息返回，但是会将所有的页面都爬去一次。如果使用 yield item ，页面的所有信息都会被抓取，但是只会爬去几个页面！！！这里的解决方法，可以将获取到的 itme["xxx"] 整个数组传递过去，在 pipelines.py 中进行数据处理！ 3.待解决问题在写入文件的时候，如何每次都打开文件会导致资源争夺；如果持续打开文件不关闭，会导致资源占用。有没有折中的方法上述的自动爬取问题暂时并没有解决方法！猜测是爬取与跟进页面属于同一表达式？ 4.注意事项编写完pipeline后，为了启动它必须修改setting.py中ITEM_PIPLINES的配置设置download_delay减轻服务器负载，防止被ban 如果需要对文件的编码进行处理，请在pipelines.py中进行。最好不要重写构造函数！在自己不了解的情况下！！！

参考文献1

参考文献2

scrapy爬虫学习【一】之wooyun白帽子精华榜

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本