
Background
Let’s assume that we have two competitors selling similar pairs of shoes in the same area. Typically, if a competitor wants to know of another competitor’s pricing, competitor A would enquire from someone close to competitor B.
These days, it is quite different. If we want to purchase a bouquet of roses, we just check the seller’s platform for the price. This simply defines web scraping―the art of extracting data from a website. But we can automate the above examples in python with Beautiful Soup module.
Dos and don’tsof web scrapingWeb scraping is legal in one context and illegal in another context. For example, it is legal when the data extracted is composed of directories and telephone listing for personal use. However, if the extracted data is for commercial use―without the consent of the owner―this would be illegal. Thus, we should be careful when extracting data from awebsite and always be mindful of the law .
Getting startedThere are three standard methods we can use to scrape data from a web page on a website. We can use a regular expression, Beautiful Soup, andCSS selectors. If you know of any other approach to scrape data from a web page, kindly make it available in the comments section .
Before we dive straight into scraping data from a stock exchange site, let’s understand a number of basic terms in web scraping .
Web Crawling: Web crawling simply refers to downloading ofHTML pages on a website via user agents known as crawlers/user-agents. Google bots , baiduspider, Bingbot, and others. Robots.txt: Robots.txt is a file which contains a set of suggestions/instructions purposely for crawlers. These set of instructions/suggestions specify whether a crawler has the right to access a particular web page on a website or not. Sitemap Files: Sitemap files are provided by websites to make crawling a bit easier for crawlers/user-agents. It simply helps crawlers to locate updated content of pages on websites. Instead of crawling web pages of a website, crawlers check the updated content of a website via the sitemap files. For further details, the sitemap standard is defined at http://www.sitemaps.org/protocol.html Beautiful Soup: Beautiful Soup is a popular module in Python that parses (or examines) a web page and provides a convenient interface for navigating content. I prefer Beautiful Soup to a regular expression and CSS selectors when scraping data from a web page. It is also one of the recommended Python libraries by the #1 Stack Overflow answerer,Martijn Pieters. But if you want, you can also build a web scraper in Node.js .Apart from the Beautiful Soup, which we will use to scrape data from a web page, there are modules in Python to help us know technical aspects of our web target. We can use the builtwith module to know more of our target’s technical details. You can install the builtwith module by doing the following :
pip install builtwithThe builtwith module exposes arrays of technologies a website was built upon. Web intermediaries (i.e WAFs or proxies) may block other technical aspects for security reasons. For instance, let’s try to examine Bloomberg’s website
import builtwithbuiltwith.parse(“http://www.bloomberg.com”)
Below is a screenshot of the output:

Before we scrape the name and price of theindex on Bloomberg, we need to check the robot.txt file of our target before we take any further steps. To remind us again of its purpose, I initially explained that robots.txt is a file composed of suggestions for crawlers (or web robots).
For this project, our target is Bloomberg. Let’s check out Bloomberg’s restrictions for web crawlers.
Just type the following in the url space bar:
http//:www.bloomberg.com/robots.txtThe code above simply sends a request to the web server to retrieve robots.txt file. Below is the robots.txt file retrieved from the web server. Now let’s check the web robots rules of Bloomberg.

Crawling our target
With the help of robots.txt file, we know where we can allow our crawler to download HTML pages and where we should not allow our crawler to tread. As good web citizens, it is advisable to obey bots rules. However, it is not impossible for us to allow our crawler to venture into restricted areas. Bloomberg may ban our IP address for an hour or a longer period.
For this project, it is not necessary to download/crawl a specific web page. We can use a Firebug extension to check or inspect the page where we want to scrape our data from.

Now let’s use Firebug to find the related HTML of the index’s name and price of the day. Similarly, we can use the browser’s native inspector, too. I prefer to use both.
Just hover or moveyour cursor to the index name and click the related HTML tags. We can see the name of the index, which should look like something similar to the one below:

Let examine the sitemap file of our target
Sitemap files simply provide links to updated content of a website. Therefore, it allows crawlers to efficiently crawl web pages of interest. Below are a number of Bloomberg’s sitemap files:

Let’s scrape data from our target:
Now it is time to scrape a particular data on our target site: www.bloomberg.com . There are diverse ways to scrape data from a web page. We can use CSS selectors, regular expressions, and the popular BeautifulSoup module. Among these three approaches, we are going to use the BeautifulSoup to scrape data from a web page. The name we will use to install pip via this module is quite different from when we import it. When choosing between text editors , you can choose to use Sublime, Atom or, Notepad++. Others are available, too.
Now let’s assume we don’t have BeautifulSoup. Let’s install BeautifulSoup via pip.

Next, we import urllib2 and BeautifulSoup4:
#import librariesimport urllib2 // urllib2 is used to fetch url(s) via urlopen()
from bs4 import BeautifulSoup // when importing ‘Beautiful Soup’ don’t add 4.
From datetime import datetime // contains functions and classes for working with dates and times, separately and together Now,