Python Web Scraping using Beautiful Soup

Background

Let’s assume that we have two competitors selling similar pairs of shoes in the same area. Typically, if a competitor wants to know of another competitor’s pricing, competitor A would enquire from someone close to competitor B.

These days, it is quite different. If we want to purchase a bouquet of roses, we just check the seller’s platform for the price. This simply defines web scraping―the art of extracting data from a website. But we can automate the above examples in python with Beautiful Soup module.

Dos and don’tsof web scraping

Web scraping is legal in one context and illegal in another context. For example, it is legal when the data extracted is composed of directories and telephone listing for personal use. However, if the extracted data is for commercial use―without the consent of the owner―this would be illegal. Thus, we should be careful when extracting data from awebsite and always be mindful of the law .

Getting started

There are three standard methods we can use to scrape data from a web page on a website. We can use a regular expression, Beautiful Soup, andCSS selectors. If you know of any other approach to scrape data from a web page, kindly make it available in the comments section .

Before we dive straight into scraping data from a stock exchange site, let’s understand a number of basic terms in web scraping .

Web Crawling: Web crawling simply refers to downloading ofHTML pages on a website via user agents known as crawlers/user-agents. Google bots , baiduspider, Bingbot, and others. Robots.txt: Robots.txt is a file which contains a set of suggestions/instructions purposely for crawlers. These set of instructions/suggestions specify whether a crawler has the right to access a particular web page on a website or not. Sitemap Files: Sitemap files are provided by websites to make crawling a bit easier for crawlers/user-agents. It simply helps crawlers to locate updated content of pages on websites. Instead of crawling web pages of a website, crawlers check the updated content of a website via the sitemap files. For further details, the sitemap standard is defined at http://www.sitemaps.org/protocol.html Beautiful Soup: Beautiful Soup is a popular module in Python that parses (or examines) a web page and provides a convenient interface for navigating content. I prefer Beautiful Soup to a regular expression and CSS selectors when scraping data from a web page. It is also one of the recommended Python libraries by the #1 Stack Overflow answerer,Martijn Pieters. But if you want, you can also build a web scraper in Node.js .

Apart from the Beautiful Soup, which we will use to scrape data from a web page, there are modules in Python to help us know technical aspects of our web target. We can use the builtwith module to know more of our target’s technical details. You can install the builtwith module by doing the following :

pip install builtwith

The builtwith module exposes arrays of technologies a website was built upon. Web intermediaries (i.e WAFs or proxies) may block other technical aspects for security reasons. For instance, let’s try to examine Bloomberg’s website

import builtwith
builtwith.parse(“http://www.bloomberg.com”)

Below is a screenshot of the output:

Before we scrape the name and price of theindex on Bloomberg, we need to check the robot.txt file of our target before we take any further steps. To remind us again of its purpose, I initially explained that robots.txt is a file composed of suggestions for crawlers (or web robots).

For this project, our target is Bloomberg. Let’s check out Bloomberg’s restrictions for web crawlers.

Just type the following in the url space bar:

http//:www.bloomberg.com/robots.txt

The code above simply sends a request to the web server to retrieve robots.txt file. Below is the robots.txt file retrieved from the web server. Now let’s check the web robots rules of Bloomberg.

Crawling our target

With the help of robots.txt file, we know where we can allow our crawler to download HTML pages and where we should not allow our crawler to tread. As good web citizens, it is advisable to obey bots rules. However, it is not impossible for us to allow our crawler to venture into restricted areas. Bloomberg may ban our IP address for an hour or a longer period.

For this project, it is not necessary to download/crawl a specific web page. We can use a Firebug extension to check or inspect the page where we want to scrape our data from.

Now let’s use Firebug to find the related HTML of the index’s name and price of the day. Similarly, we can use the browser’s native inspector, too. I prefer to use both.

Just hover or moveyour cursor to the index name and click the related HTML tags. We can see the name of the index, which should look like something similar to the one below:

Let examine the sitemap file of our target

Sitemap files simply provide links to updated content of a website. Therefore, it allows crawlers to efficiently crawl web pages of interest. Below are a number of Bloomberg’s sitemap files:

Let’s scrape data from our target:

Now it is time to scrape a particular data on our target site: www.bloomberg.com . There are diverse ways to scrape data from a web page. We can use CSS selectors, regular expressions, and the popular BeautifulSoup module. Among these three approaches, we are going to use the BeautifulSoup to scrape data from a web page. The name we will use to install pip via this module is quite different from when we import it. When choosing between text editors , you can choose to use Sublime, Atom or, Notepad++. Others are available, too.

Now let’s assume we don’t have BeautifulSoup. Let’s install BeautifulSoup via pip.

Next, we import urllib2 and BeautifulSoup4:

#import libraries
import urllib2 // urllib2 is used to fetch url(s) via urlopen()
from bs4 import BeautifulSoup // when importing ‘Beautiful Soup’ don’t add 4.
From datetime import datetime // contains functions and classes for working with dates and times, separately and together Now,

Python Web Scraping using Beautiful Soup

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎