Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Write your first web crawler in Python Scrapy

$
0
0

Thescraping series will not get completed without discussing Scrapy . In this post I am going to write a web crawler that will scrape data from OLX’s Electronics & Appliances’ items. Before I get into the code, how about having a brief intro of Scrapy itself?

What is Scrapy?

From Wikipedia :

Scrapy (/skrepi/ skray-pee)[1] is a free and open source web crawling framework , written in python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler.[2] It is currently maintained by Scrapinghub Ltd., a web scraping development and services company.

A web crawling framework which has done all the heavy lifting that is needed to write a crawler. What are those things, I will explore further below.

Read on!

Creating Project

Scrapy introduces the idea of a project with multiple crawlers or spiders in a single project. This concept is helpful specially if you are writing multiple crawlers of different sections of a site or sub-domains of a site. So, first create the project:

Adnans-MBP:ScrapyCrawlersAdnanAhmad$ scrapystartprojectolx New Scrapyproject 'olx', usingtemplatedirectory '//anaconda/lib/python2.7/site-packages/scrapy/templates/project', createdin: /Development/PetProjects/ScrapyCrawlers/olx Youcanstartyourfirstspiderwith: cd olx scrapygenspiderexampleexample.com CreatingCrawler

I ran the command scrapy startproject olx which will create a project with name olx and helpful information for next steps. You go to the newly created folder and then execute command for generating first spider with name and the domain of the site to be crawled:

Adnans-MBP:ScrapyCrawlersAdnanAhmad$ cd olx/ Adnans-MBP:olxAdnanAhmad$ scrapygenspiderelectronicswww.olx.com.pk Createdspider 'electronics' usingtemplate 'basic' in module: olx.spiders.electronics

I generated the code of my first Spider with name electronics , since I am accessing the electronics section of OLX I named it like that, you can name it to anything you want or dedicate your first spider to your spouse or (girl|boy)friend

The final project structure will be something like given below:


Write your first web crawler in Python Scrapy

As you can see, there is a separate folder only for Spiders, as mentioned, you can add multiple spiders with in a single project. Let’s open electronics.py spider file. When you open it, you will find something like that:

# -*- coding: utf-8 -*- import scrapy class ElectronicsSpider(scrapy.Spider): name = "electronics" allowed_domains = ["www.olx.com.pk"] start_urls = ['http://www.olx.com.pk/'] def parse(self, response): pass

As you can see, ElectronicsSpider is subclass of scrapy.Spider . The name property is actually name of the spider which was given in the spider generation command. This name will help while running the crawler itself. The allowed_domains property tells which domains are accessible for this crawler and strart_urls is the place to mention initial URLs that to be accessed at first place. Beside file structure this is a good feature to draw the boundaries of your crawler.

The parse method, as the name suggests that to parse the content of the page being accessed. Since I am going to write a crawler that goes to multiple pages, I am going to make a few changes.

from scrapy.spidersimport CrawlSpider, Rule from scrapy.linkextractorsimport LinkExtractor class ElectronicsSpider(CrawlSpider): name = "electronics" allowed_domains = ["www.olx.com.pk"] start_urls = [ 'https://www.olx.com.pk/computers-accessories/', 'https://www.olx.com.pk/tv-video-audio/', 'https://www.olx.com.pk/games-entertainment/' ] rules = ( Rule(LinkExtractor(allow=(), restrict_css=('.pageNextPrev',)), callback="parse_item", follow=True),) def parse_item(self, response): print('Processing..' + response.url) # print(response.text)

In order to make the crawler navigate to many page, I rather subclassed my Crawler from Crawler instead of scrapy.Spider . This class makes crawling many pages of a site easier. You can do similar with the generated code but you’ll need to take care of recursion to navigate next pages.

The next is to set rules variable, here you mention the rules of navigating the site. The LinkExtractor actually takes parameters to draw navigation boundaries. Here I am using restrict_css parameter to set the class for NEXT page. If you go to this page and inspect element you can find something like this:


Write your first web crawler in Python Scrapy

pageNextPrev is the class that be used to fetch link of next pages. The call_back parameter tells which method to use to access the page elements. We will work on this method soon.

Do remember, you need to change name of the method from parse() to parse_item() or whatever to avoid overriding the base class otherwise your rule will not work even if you set follow=True .

So far so good, let’s test the crawler I have done so far. Again, go to terminal and write:

Adnans-MBP:olxAdnanAhmad$ scrapycrawlelectronics

The 3rd parameter is actually the name of the spider which was set earlier in the name property of ElectronicsSpiders class. On console you find lots of useful information that is helpful to debug your crawler. You can disable the debugger if you don’t want to see debugging information. The command will be similar with --nolog switch.

Adnans-MBP:olxAdnanAhmad$ scrapycrawl --nologelectronics

If you run now it will print something like:

Adnans-MBP:olxAdnanAhmad$ scrapycrawl --nologelectronics Processing..https://www.olx.com.pk/computers-accessories/?page=2 Processing..https://www.olx.com.pk/tv-video-audio/?page=2 Processing..https://www.olx.com.pk/games-entertainment/?page=2 Processing..https://www.olx.com.pk/computers-accessories/ Processing..https://www.olx.com.pk/tv-video-audio/ Processing..https://www.olx.com.pk/games-entertainment/ Processing..https://www.olx.com.pk/computers-accessories/?page=3 Processing..https://www.olx.com.pk/tv-video-audio/?page=3 Processing..https://www.olx.com.pk/games-entertainment/?page=3 Processing..https://www.olx.com.pk/computers-accessories/?page=4 Processing..https://www.olx.com.pk/tv-video-audio/?page=4 Processing..https://www.olx.com.pk/games-entertainment/?page=4 Processing..https://www.olx.com.pk/computers-accessories/?page=5 Processing..https://www.olx.com.pk/tv-video-audio/?page=5 Processing..https://www.olx.com.pk/games-entertainment/?page=5 Processing..https://www.olx.com.pk/computers-accessories/?page=6 Processing..https://www.olx.com.pk/tv-video-audio/?page=6 Processing..https://www.olx.com.pk/games-entertainment/?page=6 Processing..https://www.olx.com.pk/computers-accessories/?page=7 Processing..https://www.olx.com.pk/tv-video-audio/?page=7 Processing..https://www.olx.com.pk/games-entertainment/?page=7

Viewing all articles
Browse latest Browse all 9596

Trending Articles