Python网络爬虫1 简单的Http请求

最近这段时间会有需要写一个网络爬虫。会在这里将实现网络爬虫的经验记录下来。

爬虫什么的，只是一个名字罢了。简单地说，也都是从http请求开始的。

python实现http请求主要依赖的是urllib.request模块。例如发送Http请求：

from urllib import request url = 'http://www.zhyea.com/2016/07/17/memory-analyzer-all.html' response = request.urlopen(url) content = response.read()

就是这么简单。

通常，在命令行打印出来的是要抓取的网页的源代码。想从中过滤出来需要的信息可以再次进行匹配和筛选。比如使用正则式匹配获取title和body中的内容：

def get_target(pattern, content): m = re.search(pattern, content) target = "" if m: target = m.group(0) return target title = get_target(r"<title>.*<\title>", content) body = get_target(r"<body[\w|\W]*<body>", content)

对于一些采集程序来说做到这里就够了。如果我们要的是网页的内容而非网页的html，则需要使用比正则表达式更强大的工具。在下一节会用一个实例介绍相关的内容。

附上完整的程序：

#!python # encoding: utf-8 import re from urllib import request from urllib import parse def get(url): response = request.urlopen(url) content = "" if response: content = response.read().decode("utf8") response.close() return content def post(url, **paras): param = parse.urlencode(paras).encode("utf8") req = request.Request(url, param) response = request.urlopen(req) content = "" if response: content = response.read().decode("utf8") response.close() return content def get_target(pattern, content): m = re.search(pattern, content) target = "" if m: target = m.group(0) return target def main(): url = 'http://www.zhyea.com/2016/07/17/memory-analyzer-all.html' content = get(url) title = get_target(r"<title>.*<\title>", content) body = get_target(r"<body[\w|\W]*<body>", content) if __name__ == "__main__": main()

######

Python网络爬虫1 简单的Http请求

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎