Ugly soup with Python, requests & Beautiful Soup

Web scraping has never been a coveted nor favorite discipline of mine; in fact, for me web scraping is an unfortunate, but sometimes necessary evil. Scraping web-pages, at least for me, is a very unstructured process, basically pure Trial & Error. Perhaps, with more experience, more interest in html and the www in general, it might become more likable, akin to the other types of programming I really enjoy doing… Who knows….

Anyways, I wanted to scrape some data for further analysis down the line.Normally, I’d put Pandas to heavy use for a lot of the data munging tasks, but since my neighborhood is currently experiencing the Mother of all Power outtages, after the storm of the century on past Tuesday, I dont have power to run my computer thus, this stuff is done with pythonista (2!) on my old & tired iPad. Btw, Pythonista is a really great Python environment for i*, and I guess some day I should upgrade tp Pythonista3…. But I sure miss Pandas; it makes structured data manipulation so very much more convenient than using lists or even numpy arrays….!

So, for this exercise: I wanted to collect the results from a very famous long distance cross country ski race, the Marcialonga, which takes place each January in my favorite place on this earth, Val di Fiemme, in the Dolomites. I’d preferred for the organisers of the race to publish the results in a more easy-to-grab format than html, but I couldn’t find anything else but the web page. Which furthermore splits the 5600 result entries to 56 pages… which took a while to figure out how to scrape multiple linked pages.

Below a couple of screen shots of an initial, very basic analysis. I’ll do quite a bit more statistical analysis if and when the power grid resumes operations…

Image may be NSFW.
Clik here to view. Ugly soup with Python, requests & Beautiful Soup

# coding: utf-8 import requests from requests.utils import quote import re from bs4 import BeautifulSoup as bs import numpy as np from datetime import datetime,timedelta import matplotlib.pyplot as plt comp_list = [] # results are in 56 separate pages for page in range(1,57): print page url = 'https://www.marcialonga.it/marcialonga_ski/EN_results.php' payload = {'pagenum':page} print url print payload r = requests.get(url,params=payload) print r.status_code #print r.text c = r.content soup = bs(c,'html.parser') #print soup.prettify() tbl= soup.find('table') #print tbl main_table = tbl #print main_table #print main_table.get_text() competitors = main_table.find_all(class_='SP') for comp in competitors: comp_list.append(comp.get_text()) comp_list = list(map(lambda x : x.encode('utf-8'),comp_list)) #print comp_list print len(comp_list) def parse_item(i): res_pattern = r'[0-9]+' char_pattern = r'[A-Z]+' num_pattern = r'[0-9]+:[0-9]+:[0-9]+\.[0-9]' age_pattern = r'[0-9]+/' res =re.match(res_pattern,i).group() chars = re.findall(char_pattern,i,flags=re.IGNORECASE) nat = chars[-1][1:] nums = re.findall(num_pattern,i) age = re.findall(age_pattern,i) age_over=age[0][:-1] time_pattern = r'0[0-9]:[0-9]+:[0-9]+\.[0-9]$' time = re.findall(time_pattern,nums[0]) t = datetime.strptime(time[0],'%H:%M:%S.%f') #t = datetime.strftime(t,'%H:%M:%S.%f') td = (t - datetime(1900,1,1)).total_seconds() name = chars[0] + ' ' + chars[1] gender = chars[-2] return (res, name,nat,gender,age_over,td) results = [] for comp in comp_list: results.append(parse_item(comp)) ''' results = np.array(results,dtype=[('res','i4'),('name','U100'),('nat','U3'),('gender','U1'),('age','i4'),('time','datetime64[us]')]) ''' gender = np.array([results[i][3] for i in range(len(results))]) pos = np.array([results[i][0] for i in range(len(results))]) ages = np.array([results[i][4] for i in range(len(results))]).astype(int) secs = np.array([results[i][5] for i in range(len(results))]) print ages.size print secs.size friend_1 = 18844 friend_2= 24446 male_mask = gender=='M' male_secs = secs[male_mask] female_secs = secs[~male_mask] male_mean = male_secs.mean() female_mean = female_secs.mean() bins=range(10000,38000,1000) plt.subplot(211) plt.hist(male_secs,color='b',weights=np.zeros_like(male_secs) + 1. / male_secs.size,alpha=0.5,label='Men',bins=bins) plt.hist(female_secs,color='r',weights=np.zeros_like(female_secs) + 1. / female_secs.size,alpha=0.5,label='Women',bins=bins) plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women') plt.xlabel('Time [seconds]') plt.ylabel('Relative Frequency') plt.axvline(friend_1,ls='dashed',color='cyan',label='Friend_1',lw=5) plt.axvline(friend_2,ls='dashed',color='magenta',label='Friend_2',lw=5) plt.axvline(male_mean,ls='dashed',color='darkblue',label='Men mean',lw=5) plt.axvline(female_mean,ls='dashed',color='darkred',label='Women mean',lw=5) plt.legend(loc='upper right') def colors(x): if gender[x] == 'M': return 'b' else: return 'r' print secs.min(),secs.max() colormap = list(map(colors,range(len(gender)))) plt.subplot(212) plt.hist(male_secs,color='b',alpha=0.5,label='Men',bins=bins) plt.hist(female_secs,color='r',alpha=0.5,label='Women',bins=bins) plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women') plt.xlabel('Time [seconds]') plt.ylabel('Nr of Skiers') plt.legend(loc='upper right') plt.tight_layout() plt.show()

the end :wink:

Ugly soup with Python, requests & Beautiful Soup

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本