让爬虫从每一个下载的网页中抽取一些数据,然后实现一些应用,称为抓取
三种抓取网页的方法
1、正则表达式
通过查看网页源码,使用浏览器工具查看网页的结构,编写相应的正则表达式
例如下面的网页:http://example.webscraping.com/view/United-Kingdom-239
想抓取出国家的面积,代码如下:
import urllib2import re
def scrape(html):
area = re.findall('<tr id="places_area__row">.*?<td\s*class=["\']w2p_fw["\']>(.*?)</td>', html)[0]
return area
if __name__ == '__main__':
html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()
print scrape(html)
正则表达式提供了快速的抓取方式,但是过于脆弱,容易在网页更新后出现问题
2、Beautiful Soup
Beautiful Soup非常流行的py模块,可以解析网页并提供定位内容的便捷接口
使用:pip install beautifulsoup4 进行安装
import urllib2from bs4 import BeautifulSoup
def scrape(html):
soup = BeautifulSoup(html, "html.parser") # 将不合法的标签补充完整
tr = soup.find(attrs={'id': 'places_area__row'}) # 定位到面积行
td = tr.find(attrs={'class': 'w2p_fw'}) # 定位到面积的标记
area = td.text
return area
if __name__ == '__main__':
html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()
print scrape(html)
上面的代码比正则表达式复杂,但是容易理解和构造
3、Lxml
Lxml是基于libxml2这个xml解析库的python封装,安装过程很复杂
import urllib2import lxml.html
def scrape(html):
tree = lxml.html.fromstring(html)
td = tree.cssselect('tr#places_neighbours__row > td.w2p_fw')[0]
area = td.text_content()
return area
if __name__ == '__main__':
html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()
print scrape(html)
接下来将上面三种方法进行性能对比
# -*- coding: utf-8 -*-import csv
import time
import urllib2
import re
import timeit
from bs4 import BeautifulSoup
import lxml.html
FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours')
def regex_scraper(html):
results = {}
for field in FIELDS:
results[field] = re.search('<tr id="places_{}__row">.*?<td class="w2p_fw">(.*?)</td>'.format(field), html).groups()[0]
return results
def beautiful_soup_scraper(html):
soup = BeautifulSoup(html, 'html.parser')
results = {}
for field in FIELDS:
results[field] = soup.find('table').find('tr', id='places_{}__row'.format(field)).find('td', class_='w2p_fw').text
return results
def lxml_scraper(html):
tree = lxml.html.fromstring(html)
results = {}
for field in FIELDS:
results[field] = tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content()
return results
def main():
times = {}
html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()
NUM_ITERATIONS = 1000 # number of times to test each scraper
for name, scraper in ('Regular expressions', regex_scraper), ('Beautiful Soup', beautiful_soup_scraper), ('Lxml', lxml_scraper):
times[name] = []
# record start time of scrape
start = time.time()
for i in range(NUM_ITERATIONS):
if scraper == regex_scraper:
# the regular expression module will cache results
# so need to purge this cache for meaningful timings
re.purge()
result = scraper(html)
# check scraped result is as expected
assert(result['area'] == '244,820 square kilometres')
times[name].append(time.time() - start)
# record end time of scrape and output the total
end = time.time()
print '{}: {:.2f} seconds'.format(name, end - start)
writer = csv.writer(open('times.csv', 'w'))
header = sorted(times.keys())
writer.writerow(header)
for row in zip(*[times[scraper] for scraper in header]):
writer.writerow(row)
if __name__ == '__main__':
main()
运行结果:
Regular expressions: 5.36 seconds
Beautiful Soup: 39.40 seconds
Lxml: 7.09 seconds
Beautiful Soup使用py编写,其他两个使用C编写,所有速度较快
结论:
抓取方法 性能 使用难度 安装难度 正则表达式 快 困难 简单(内置模块) Beautiful Soup 慢 简单 简单(纯python) Lxml 快 简单 相对困难 为链接爬虫添加抓取回调前面的文章实现了一个链接爬虫,如果想复用,需要添加一个callback参数处理抓取行为,callback是一个函数,在发生特定事件之后会调用该函数。
在 link_crawler 中添加一段代码:
def link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp', proxy=None, num_retries=1, scrape_callback=None):……
if scrape_callback:
links.extend(scrape_callback(url, html) or [])
……
这里只需要将scrape_callback 定制化处理,就能进行爬取其他的网站了
下面对lxml抓取示例的代码进行修改:
def scrape_callback(url, html):if re.search('/view/', url):
tree = lxml.html.fromstring(html)
row = [tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content() for field in FIELDS]
print url, row
将上面的功能进行扩展,将得到的数据保存在csv表格中,完整的代码如下:
import csvimport re
import urlparse
import lxml.html
from link_crawler import link_crawler
class ScrapeCallback:
def __init__(self):
self.writer = csv.writer(open('countries.csv', 'w'))
self.fields = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours')
self.writer.writerow(self.fields)
def __call__(self, url, html):
if re.search('/view/', url):
tree = lxml.html.fromstring(html)
row = []
for field in self.fields:
row.append(tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content())
self.writer.writerow(row)
if __name__ == '__main__':
link_crawler('http://example.webscraping.com/', '/(index|view)', scrape_callback=ScrapeCallback())