用python写网络爬虫2--数据抓取

让爬虫从每一个下载的网页中抽取一些数据，然后实现一些应用，称为抓取

三种抓取网页的方法

1、正则表达式

通过查看网页源码，使用浏览器工具查看网页的结构，编写相应的正则表达式

例如下面的网页：http://example.webscraping.com/view/United-Kingdom-239

想抓取出国家的面积，代码如下：

import urllib2
import re
def scrape(html):
area = re.findall('<tr id="places_area__row">.*?<td\s*class=["\']w2p_fw["\']>(.*?)</td>', html)[0]
return area
if __name__ == '__main__':
html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()
print scrape(html)

正则表达式提供了快速的抓取方式，但是过于脆弱，容易在网页更新后出现问题

2、Beautiful Soup

Beautiful Soup非常流行的py模块，可以解析网页并提供定位内容的便捷接口

使用：pip install beautifulsoup4 进行安装

import urllib2
from bs4 import BeautifulSoup
def scrape(html):
soup = BeautifulSoup(html, "html.parser") # 将不合法的标签补充完整
tr = soup.find(attrs={'id': 'places_area__row'}) # 定位到面积行
td = tr.find(attrs={'class': 'w2p_fw'}) # 定位到面积的标记
area = td.text
return area
if __name__ == '__main__':
html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()
print scrape(html)

上面的代码比正则表达式复杂，但是容易理解和构造

3、Lxml

Lxml是基于libxml2这个xml解析库的python封装，安装过程很复杂

import urllib2
import lxml.html
def scrape(html):
tree = lxml.html.fromstring(html)
td = tree.cssselect('tr#places_neighbours__row > td.w2p_fw')[0]
area = td.text_content()
return area
if __name__ == '__main__':
html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()
print scrape(html)

接下来将上面三种方法进行性能对比

# -*- coding: utf-8 -*-
import csv
import time
import urllib2
import re
import timeit
from bs4 import BeautifulSoup
import lxml.html
FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours')
def regex_scraper(html):
results = {}
for field in FIELDS:
results[field] = re.search('<tr id="places_{}__row">.*?<td class="w2p_fw">(.*?)</td>'.format(field), html).groups()[0]
return results
def beautiful_soup_scraper(html):
soup = BeautifulSoup(html, 'html.parser')
results = {}
for field in FIELDS:
results[field] = soup.find('table').find('tr', id='places_{}__row'.format(field)).find('td', class_='w2p_fw').text
return results
def lxml_scraper(html):
tree = lxml.html.fromstring(html)
results = {}
for field in FIELDS:
results[field] = tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content()
return results
def main():
times = {}
html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()
NUM_ITERATIONS = 1000 # number of times to test each scraper
for name, scraper in ('Regular expressions', regex_scraper), ('Beautiful Soup', beautiful_soup_scraper), ('Lxml', lxml_scraper):
times[name] = []
# record start time of scrape
start = time.time()
for i in range(NUM_ITERATIONS):
if scraper == regex_scraper:
# the regular expression module will cache results
# so need to purge this cache for meaningful timings
re.purge()
result = scraper(html)
# check scraped result is as expected
assert(result['area'] == '244,820 square kilometres')
times[name].append(time.time() - start)
# record end time of scrape and output the total
end = time.time()
print '{}: {:.2f} seconds'.format(name, end - start)
writer = csv.writer(open('times.csv', 'w'))
header = sorted(times.keys())
writer.writerow(header)
for row in zip(*[times[scraper] for scraper in header]):
writer.writerow(row)
if __name__ == '__main__':
main()

运行结果：

Regular expressions: 5.36 seconds

Beautiful Soup: 39.40 seconds

Lxml: 7.09 seconds

Beautiful Soup使用py编写，其他两个使用C编写，所有速度较快

结论：

抓取方法性能使用难度安装难度正则表达式快困难简单（内置模块） Beautiful Soup 慢简单简单（纯python） Lxml 快简单相对困难为链接爬虫添加抓取回调

前面的文章实现了一个链接爬虫，如果想复用，需要添加一个callback参数处理抓取行为，callback是一个函数，在发生特定事件之后会调用该函数。

在 link_crawler 中添加一段代码:

def link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp', proxy=None, num_retries=1, scrape_callback=None):
……
if scrape_callback:
links.extend(scrape_callback(url, html) or [])
……

这里只需要将scrape_callback 定制化处理，就能进行爬取其他的网站了

下面对lxml抓取示例的代码进行修改：

def scrape_callback(url, html):
if re.search('/view/', url):
tree = lxml.html.fromstring(html)
row = [tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content() for field in FIELDS]
print url, row

将上面的功能进行扩展，将得到的数据保存在csv表格中，完整的代码如下：

import csv
import re
import urlparse
import lxml.html
from link_crawler import link_crawler
class ScrapeCallback:
def __init__(self):
self.writer = csv.writer(open('countries.csv', 'w'))
self.fields = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours')
self.writer.writerow(self.fields)
def __call__(self, url, html):
if re.search('/view/', url):
tree = lxml.html.fromstring(html)
row = []
for field in self.fields:
row.append(tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content())
self.writer.writerow(row)
if __name__ == '__main__':
link_crawler('http://example.webscraping.com/', '/(index|view)', scrape_callback=ScrapeCallback())

用python写网络爬虫2--数据抓取

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本