轻量级爬虫――python实现

多点分享，少点说教。

简单框架:

调度端：程序入口，负责初始化程序，控制程序的逻辑结构。

U RL 管理器：负责管理程序将要访问的URL池，以及已访问的URL池，保证程序走在正确的道路上，不至于重复访问，也不访问无用网页。

网页下载器：将给定URL对应的页面下载到本地，供程序使用。

页面解析器：解析出网页中符合条件的URL，以及我们所需要的数据。

最终数据：可以将得到的数据存到数据库或以各种形式保存下来，本程序以网页形式保存。

代码实现:

调度端（主程序）：

负责整个程序的逻辑结构

# -*- coding:utf-8 -*-
import manager
import downloader
import html_parser
import outputer
class MainProgram(object):
def __init__ (self):
self.urls=manager.Manager()
self.downloader=downloader.Downloader()
self.parser=html_parser.Parser()
self.outputer=outputer.Outputer()
def craw(self,root_url):
count=1
self.urls.add_new_url(root_url)
while self.urls.has_new_url():
new_url=self.urls.get_new_url()
print("craw %d : %s"%(count,new_url))
new_html=self.downloader.download(new_url)
new_urls,new_data=self.parser.parse(new_url,new_html)
self.urls.add_new_urls(new_urls)
self.outputer.collect_data(new_data)
if count==100:
break
count=count+1
self.outputer.output_html()
if __name__=="__main__":
root_url="http://www.haha365.com/joke/"
obj_spider=MainProgram()
obj_spider.craw(root_url)
print("end") URL管理器

管理程序的URL池，保证程序走在正确的路上

class Manager(object):
#url管理器
def __init__(self):
self.new_urls=set()
self.old_urls=set()
def add_new_url(self,url):
#添加url
if url is None:
return
if url not in self.new_urls and url not in self.old_urls:
self.new_urls.add(url)
def add_new_urls(self,urls):
#对新的url集合调用添加url函数
if urls is None or len(urls)==0:
return
for url in urls:
self.add_new_url(url)
def has_new_url(self):
#url池不空
return len(self.new_urls)!=0
def get_new_url(self):
#取url
new_url=self.new_urls.pop()
self.old_urls.add(new_url)
return new_url 网页下载器

将网页请求到本地以供获取其内容

import urllib.request
class Downloader(object):
#html下载器
def download(self,url):
#下载页面
if url is None:
return None
response=urllib.request.urlopen(url)
if response.getcode()!=200:
return None
return response.read() 页面解析器

1.解析网页中的URL，将其交给URL管理器

2.解析网页中的数据，以获得目的数据

from bs4 import BeautifulSoup
import urllib.parse
import re
class Parser(object):
#页面解析器
def parse(self,url,html):
#页面解析=url解析+html解析
if url is None or html is None:
return
soup=BeautifulSoup(html,'html.parser',from_encoding='utf-8')
new_urls=self._get_new_urls(url,soup)
new_data=self._get_new_data(url,soup)
return new_urls,new_data
def _get_new_urls(self,url,soup):
#url解析
new_urls=set()
links=soup.find_all('a', href=re.compile(r"/index_+\d+\.htm"))
for link in links:
new_url=link['href']
new_urls.add(urllib.parse.urljoin(url, new_url))
return new_urls
def _get_new_data(self,url,soup):
#html解析
res_data={}
try:
res_data['url']=url
text_node=soup.find('div',class_="cat_llb")
res_data['text']=text_node.findAll(text=True)
except:
continue
return res_data 保存数据

本程序保存到网页文本中。

class Outputer(object):
#数据整理
def __init__(self):
self.datas=[]
def collect_data(self,data):
#数据整合
if data is None:
return
self.datas.append(data)
def output_html(self):
#输出到网页
fileout = open("output.html", "w", encoding='utf-8')
fileout.write("<html>")
fileout.write("<head>")
fileout.write("<meta charset='utf-8'>")
fileout.write("</head>")
fileout.write("<body>")
fileout.write("<table>")
for data in self.datas:
try:
fileout.write("<tr>")
fileout.write("<td>%s</td>" % data['url'])
fileout.write("<td>%s</td>" % data['text'])
fileout.write("</tr>")
except:
continue
fileout.write("</table>")
fileout.write("</body>")
fileout.write("</html>")
fileout.close()

轻量级爬虫――python实现

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本