Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Python爬虫爬取知乎小结

$
0
0

欢迎加入 python 学习交流群:535993938 禁止闲聊 ! 名额有限 ! 非喜勿进 !

最近学习了一点网络爬虫,并实现了使用Python来爬取知乎的一些功能,这里做一个小的总结。网络爬虫是指通过一定的规则自动的从网上抓取一些信息的程序或脚本。我们知道机器学习和数据挖掘等都是从大量的数据出发,找到一些有价值有规律的东西,而爬虫则可以帮助我们解决获取数据难的问题,因此网络爬虫是我们应该掌握的一个技巧。

python有很多开源工具包供我们使用,我这里使用了requests、BeautifulSoup4、json等包。requests模块帮助我们实现http请求,bs4模块和json模块帮助我们从获取到的数据中提取一些想要的信息,几个模块的具体功能这里不具体展开。下面我分功能来介绍如何爬取知乎。

模拟登录

要想实现对知乎的爬取,首先我们要实现模拟登录,因为不登录的话好多信息我们都无法访问。下面是登录函数,这里我直接使用了知乎用户 fireling 的登录函数,具体如下。其中你要在函数中的data里填上你的登录账号和密码,然后在爬虫之前先执行这个函数,不出意外的话你就登录成功了,这时你就可以继续抓取想要 的数据。注意,在首次使用该函数时,程序会要求你手动输入captcha码,输入之后当前文件夹会多出cookiefile文件和zhihucaptcha.gif,前者保留了cookie信息,后者则保存了验证码,之后再去模拟登录时,程序会自动帮我们填上验证码。

def login(): url = 'http://www.zhihu.com' loginURL = 'http://www.zhihu.com/login/email' headers = { "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:41.0) Gecko/20100101 Firefox/41.0', "Referer": "http://www.zhihu.com/", 'Host': 'www.zhihu.com', } data = { 'email': 'you@example.com', 'password': '**************', 'rememberme': "true", } global s s = requests.session() global xsrf if os.path.exists('cookiefile'): with open('cookiefile') as f: cookie = json.load(f) s.cookies.update(cookie) req1 = s.get(url, headers=headers) soup = BeautifulSoup(req1.text, "html.parser") xsrf = soup.find('input', {'name': '_xsrf', 'type': 'hidden'}).get('value') # 建立一个zhihu.html文件,用于验证是否登陆成功 with open('zhihu.html', 'w') as f: f.write(req1.content) else: req = s.get(url, headers=headers) print req soup = BeautifulSoup(req.text, "html.parser") xsrf = soup.find('input', {'name': '_xsrf', 'type': 'hidden'}).get('value') data['_xsrf'] = xsrf timestamp = int(time.time() * 1000) captchaURL = 'http://www.zhihu.com/captcha.gif?=' + str(timestamp) print captchaURL with open('zhihucaptcha.gif', 'wb') as f: captchaREQ = s.get(captchaURL, headers=headers) f.write(captchaREQ.content) loginCaptcha = raw_input('input captcha:\n').strip() data['captcha'] = loginCaptcha print data loginREQ = s.post(loginURL, headers=headers, data=data) if not loginREQ.json()['r']: print s.cookies.get_dict() with open('cookiefile', 'wb') as f: json.dump(s.cookies.get_dict(), f) else: print 'login fail' 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

需要注意的是,在login函数中有一个全局变量s=reequests.session(),我们用这个全局变量来访问知乎,整个爬取过程中,该对象都会保持我们的持续模拟登录。

获取用户基本信息

知乎上每个用户都有一个唯一ID,例如我的ID是marcovaldong,那么我们就可以通过访问地址 https://www.zhihu.com/people/marcovaldong 来访问我的主页。个人主页中包含了居住地、所在行业、性别、教育情况、获得的赞数、感谢数、关注了哪些人、被哪些人关注等信息。因此,我首先介绍如何通过爬虫来获取某一个知乎用户的一些信息。下面的函数get_userInfo(userID)实现了爬取一个知乎用户的个人信息,我们传递给该用户一个用户ID,该函数就会返回一个 list,其中包含昵称、ID、居住地、所在行业、性别、所在公司、职位、毕业学校、专业、赞同数、感谢数、提问数、回答数、文章数、收藏数、公共编辑数量、关注的人数、被关注的人数、主页被多少个人浏览过等19个数据。

def get_userInfo(userID): user_url = 'https://www.zhihu.com/people/' + userID response = s.get(user_url, headers=header_info) # print response soup = BeautifulSoup(response.content, 'lxml') name = soup.find_all('span', {'class': 'name'})[1].string # print 'name: %s' % name ID = userID # print 'ID: %s' % ID location = soup.find('span', {'class': 'location item'}) if location == None: location = 'None' else: location = location.string # print 'location: %s' % location business = soup.find('span', {'class': 'business item'}) if business == None: business = 'None' else: business = business.string # print 'business: %s' % business gender = soup.find('input', {'checked': 'checked'}) if gender == None: gender = 'None' else: gender = gender['class'][0] # print 'gender: %s' % gender employment = soup.find('span', {'class': 'employment item'}) if employment == None: employment = 'None' else: employment = employment.string # print 'employment: %s' % employment position = soup.find('span', {'class': 'position item'}) if position == None: position = 'None' else: position = position.string # print 'position: %s' % position education = soup.find('span', {'class': 'education item'}) if education == None: education = 'None' else: education = education.string # print 'education: %s' % education major = soup.find('span', {'class': 'education-extra item'}) if major == None: major = 'None' else: major = major.string # print 'major: %s' % major agree = int(soup.find('span', {'class': 'zm-profile-header-user-agree'}).strong.string) # print 'agree: %d' % agree thanks = int(soup.find('span', {'class': 'zm-profile-header-user-thanks'}).strong.string) # print 'thanks: %d' % thanks infolist = soup.find_all('a', {'class': 'item'}) asks = int(infolist[1].span.string) # print 'asks: %d' % asks answers = int(infolist[2].span.string) # print 'answers: %d' % answers posts = int(infolist[3].span.string) # print 'posts: %d' % posts collections = int(infolist[4].span.string) # print 'collections: %d' % collections logs = int(infolist[5].span.string) # print 'logs: %d' % logs followees = int(infolist[len(infolist)-2].strong.string) # print 'followees: %d' % followees followers = int(infolist[len(infolist)-1].strong.string) # print 'followers: %d' % followers scantime = int(soup.find_all('span', {'class': 'zg-gray-normal'})[len(soup.find_all('span', {'class': 'zg-gray-normal'}))-1].strong.string) # print 'scantime: %d' % scantime info = (name, ID, location, business, gender, employment, position, education, major, agree, thanks, asks, answers, posts, collections, logs, followees, followers, scantime) return info if __name__ == '__main__': login() userID = 'marcovaldong' info = get_userInfo(userID) print 'The information of ' + userID + ' is: ' for i in range(len(info)): print info[i] 10 11 12 13 14 15 16

Viewing all articles
Browse latest Browse all 9596

Trending Articles