Python爬虫实战――爬取今日头条美女图片

python爬虫实战――爬取今日头条美女图片

一点号天善智能昨天

笔者是头条的深度使用者，经常用头条完成“看片”大业。若不信的话可以试试在头条搜索街拍，返回的都是一道道靓丽的风景线。

想把图片存下来，该怎么办呢？我们可以用Python爬虫啊。

人生苦短，我用Python！

1、工具

Python3.5，Sublime Text，windows 7

2、分析（第三步有完整代码）

可以看到搜索结果默认返回了 20 篇文章，当页面滚动到底部时头条通过 ajax 加载更多文章，浏览器按下 F12 打开调试工具（我的是 Chrome），点击 Network 选项，尝试加载更多的文章，可以看到相关的 http 请求：

php?url=0FnWwkwW8f" alt="Python爬虫实战――爬取今日头条美女图片" />

此次返回Request URL:

toutiao.com/search_content/?offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&cur_tab=1

来试试返回了什么

import jsonfrom urllib import requesturl = "http://www.toutiao.com/search_content/?offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&cur_tab=1"with request.urlopen(url) as res:d = json.loads(res.read.decode)print(d)

发现我们需要的东西在'data'里，打开一篇文章，来试试如何下载单篇图片。

import jsonfrom urllib import requesturl = 'http://www.toutiao.com/a6314996711535444226/#p=1'with request.urlopen(url) as res:soup = BeautifulSoup(res.read.decode(errors='ignore'), 'html.parser')article_main = soup.find('div', id='article-main')photo_list = [photo.get('src') for photo in article_main.find_all('img') if photo.get('src')]print(photo_list)
Python爬虫实战――爬取今日头条美女图片

输出

['p3.pstatp.com/large/159f00010b30d6736512', 'p1.pstatp.com/large/1534000488c40143b9ce', 'p3.pstatp.com/large/159d0001834ff61ccb8c', 'p1.pstatp.com/large/1534000488c1cd02b5ed']

首先用BeautifulSoup解析网页，通过 find 方法找到 article-main 对应的 div 块，在该 div 块下继续使用 find_all 方法搜寻全部的 img 标签，并提取其 src 属性对应的值，于是我们便获得了该文章下全部图片的 URL 列表。

接下来就是保存图片。

photo_url = "http://p3.pstatp.com/large/159f00010b30d6736512"photo_name = photo_url.rsplit('/', 1)[-1] + '.jpg'with request.urlopen(photo_url) as res, open(photo_name, 'wb') as f:f.write(res.read)

基本步骤就是这么多了，整理下爬取流程：

指定查询参数，向 toutiao.com/search_content/ 提交我们的查询请求。从返回的数据（JSON 格式）中解析出全部文章的 URL，分别向这些文章发送请求。从返回的数据（HTML 格式）提取出文章的标题和全部图片链接。再分别向这些图片链接发送请求，将返回的图片输入保存到本地（E:\jiepai）。修改查询参数，以使服务器返回新的文章数据，继续第一步。 3、完整代码 import reimport jsonimport timeimport randomfrom pathlib import Pathfrom urllib import parsefrom urllib import errorfrom urllib import requestfrom datetime import datetimefrom http.client import IncompleteReadfrom socket import timeout as socket_timeoutfrom bs4 import BeautifulSoupdef _get_timestamp:"""向 http://www.toutiao.com/search_content/ 发送的请求的参数包含一个时间戳，该函数获取当前时间戳，并格式化成头条接收的格式。格式为 datetime.today 返回的值去掉小数点后取第一位到倒数第三位的数字。"""row_timestamp = str(datetime.timestamp(datetime.today))return row_timestamp.replace('.', '')[:-3]def _create_dir(name):"""根据传入的目录名创建一个目录，这里用到了 python3.4 引入的 pathlib 库。"""directory = Path(name)if not directory.exists:directory.mkdirreturn directorydef _get_query_string(data):"""将查询参数编码为 url，例如：data = {'offset': offset,'format': 'json','keyword': '街拍','autoload': 'true','count': 20,'_': 1480675595492}则返回的值为：?offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&_=1480675595492""""return parse.urlencode(data)def get_article_urls(req, timeout=10):with request.urlopen(req, timeout=timeout) as res:d = json.loads(res.read.decode).get('data')if d is None:print("数据全部请求完毕...")returnurls = [article.get('article_url') for article in d if article.get('article_url')]return urlsdef get_photo_urls(req, timeout=10):with request.urlopen(req, timeout=timeout) as res:# 这里 decode 默认为 utf-8 编码，但返回的内容中含有部分非 utf-8 的内容，会导致解码失败# 所以我们使用 ignore 忽略这部分内容soup = BeautifulSoup(res.read.decode(errors='ignore'), 'html.parser')article_main = soup.find('div', id='article-main')if not article_main:print("无法定位到文章主体...")returnheading = article_main.h1.stringif '街拍' not in heading:print("这不是街拍的文章！！！")returnimg_list = [img.get('src') for img in article_main.find_all('img') if img.get('src')]return heading, img_listdef save_photo(photo_url, save_dir, timeout=10):photo_name = photo_url.rsplit('/', 1)[-1] + '.jpg'# 这是 pathlib 的特殊操作，其作用是将 save_dir 和 photo_name 拼成一个完整的路径。例如：# save_dir = 'E：\jiepai'# photo_name = '11125841455748.jpg'# 则 save_path = 'E：\jiepai\11125841455748.jpg'save_path = save_dir / photo_namewith request.urlopen(photo_url, timeout=timeout) as res, save_path.open('wb') as f:f.write(res.read)print('已下载图片：{dir_name}/{photo_name}，请求的 URL 为：{url}'.format(dir_name=dir_name, photo_name=photo_name, url=a_url))if __name__ == '__main__':ongoing = Trueoffset = 0 # 请求的偏移量，每次累加 20root_dir = _create_dir('E:\jiepai') # 保存图片的根目录request_headers = {'Referer': 'http://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}while ongoing:timestamp = _get_timestampquery_data = {'offset': offset,'format': 'json','keyword': '街拍','autoload': 'true','count': 20, # 每次返回 20 篇文章'_': timestamp}query_url = 'http://www.toutiao.com/search_content/' + '?' + _get_query_string(query_data)article_req = request.Request(query_url, headers=request_headers)article_urls = get_article_urls(article_req)# 如果不再返回数据，说明全部数据已经请求完毕，跳出循环if article_urls is None:break# 开始向每篇文章发送请求for a_url in article_urls:# 请求文章时可能返回两个异常，一个是连接超时 socket_timeout，# 另一个是 HTTPError，例如页面不存在# 连接超时我们便休息一下，HTTPError 便直接跳过。try:photo_req = request.Request(a_url, headers=request_headers)photo_urls = get_photo_urls(photo_req)# 文章中没有图片？跳到下一篇文章if photo_urls is None:continuearticle_heading, photo_urls = photo_urls# 这里使用文章的标题作为保存这篇文章全部图片的目录。# 过滤掉了标题中在 windows 下无法作为目录名的特殊字符。dir_name = re.sub(r'[\\/:*?"<>|]', '', article_heading)download_dir = _create_dir(root_dir / dir_name)# 开始下载文章中的图片for p_url in photo_urls:# 由于图片数据以分段形式返回，在接收数据时可能抛出 IncompleteRead 异常try:save_photo(p_url, save_dir=download_dir)except IncompleteRead as e:print(e)continueexcept socket_timeout:print("连接超时了，休息一下...")time.sleep(random.randint(15, 25))continueexcept error.HTTPError:continue# 一次请求处理完毕，将偏移量加 20，继续获取新的 20 篇文章。offset += 20

同理，只需修改代码，就可以下载想要的关键词，自己动手，想啥有啥。

打个广告，寻找喜欢爬虫的小伙伴。

打算爬取头条的评论，因为头条评论比正文好看。

既然有人要图，好吧。。只有下载的一部分图片( . )

链接：pan.baidu.com/s/1qYuD20k 密码：tr88

推荐3

Python爬虫实战――爬取今日头条美女图片

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本