Parsing Text: Numerical Data from Historical Accounts

In a recent article on Data Shop Talk , I introduced an interesting set of analyses on Olympic data. One of the analyses focused on Olympians who had died or gone missing due to war. The data from Sports-Reference.com came in a csv file format with the following information: Athlete, Gender, Country, Sport, and Notes. This was a great start to my analysis but there was something critical missing in order to properly illustrate the timeline of deaths: the date of death.

The data was there but it was embedded in the description under “Notes”. These notes, however, were not standardized and the date of death could not be easily pulled out with Spotfire alone. So what I did was write a simple python script to help me parse the Notes and get the year of death.

The data I used can be found here . Let’s take a quick look at some example entries:

27 December 1942. Died in a Japanese internment camp in Burma.

In January 1945 Bevan a lieutenant in the Hampshire regiment of the British Army was involved in an operation to capture the towns of Putt and Walderath in the Heisberg district of Western Germany. He was fatally wounded by a mine and was buried across the border in the Netherlands at the military cemetery in the town of Brunssum. The Alick Bevan Plate Meeting at the Crystal Palace circuit in London is contested in his honour.

Died in World War II (per Volker Kluge) date and place not known.

As we can see, some dates come before the description, some are embedded in the story, and others are generalized to the war. However, we can use all of these clues to give us an answer as to when they died. All we need to do is look for a date or see if the war is mentioned.

Here’s the code I came up with real quick:

import csv
file = open('warLossOlympians.csv', "rb")
reader = csv.reader(file)
target = open('warLossOlympiansDate.csv', 'w')
rownum = 0
for row in reader:
# Save header row and add year of death
rowData = row[0] + "," + row[1] + "," + row[2] + "," + row[3] + "," + row[4]
if rownum == 0:
header = row
target.write(rowData + "," + "Year of Death")
target.write("\n")
else:
# Take the Notes section and divide it into words
words = row[4].split(' ')
gotDate = False
# Go through each word and use clues to find the date
for word in words:
# Can we find a date that starts with a 19xx or 20xx?
# Can we find a description that mentions WW1 or WW2?
if ((word[:2] == "19" or word[:2] == "20") and (len(word) >= 4)):
# Ignore periods at the end of dates
if ((len(word) == 5)):
word = word[:4]
# Add the new data
target.write(rowData + "," + word)
gotDate = True
elif (word == "I"):
# Add the new data
target.write(rowData + "," + "1918")
gotDate = True
elif (word == "II"):
# Add the new data
target.write(rowData + "," + "1945")
gotDate = True
# If no date was found, leave entry empty
if (gotDate == False):
target.write(rowData + ",")
target.write("\n")
rownum += 1
file.close()
target.close()

It’s nothing fancy but it gets the job done. I go through and look for a date in the Notes. If I can’t find that that then I look for mention of either World War 1 or World War 2. I decided to generalize those dates to the end of the war (1918 or 1945) since we know that they were definitely deceased by then. This covers pretty much all the entries in this data set. There are a few outliersthat I had tolook up and do by hand, but definitely better than going through all 575 entries.

And that’s it! Run the code and then you have yourself a new csv with the Year of Death included in.

The full code and data can be foundhere.

Next time you find yourself having to carve through text to get numerical data try this technique.If you found this data interesting, checkout the full Spotfire analysis on Exchange.AI !

Parsing Text: Numerical Data from Historical Accounts

Trending Articles

[奇怪机翻组] 双梦相牵 / ふたりの夢もち [RJ01259078] [WebRip] [1080P HEVC-10Bit AAC 2.0]...

HONDA CITY VTI-S 菜單分享

#新闻拍一拍# 新的摩尔定律：黄氏定律

一如既往的痴情能否打动月瓶金蝎？ (豆瓣月亮水瓶小组)

求購按摩椅~'~

「粉红」不是霸凌辜莞允杠部落客：我爽在哪？

Intel 7-10代集成显卡驱动31.0.101.2137完整版

涉Gotbit加密货币市场操纵台男纽约被捕

臺灣法治會計學會2025年第三季研討會

不靠姊姊！張柏芝弟弟開計程車維生

关门一家亲：习远平、张澜澜、徐才厚

剑指offer——24.二叉树中和为某一值的路径

苏珊米勒日晕05.11｜狮子鼓励孩子；处女相信自己 (豆瓣 SUSAN MILLER小组)

【台積電IT卓越新戰略5】台積IT組織5年三次大調整，要靠平臺工程讓DevOps創新再加速

【日语无字】春之钟.Haru.no.kane.1985.JAP.vhsrip.NoSub.by.xiongzaixia&vivi

美籍老公不讓步李愛綺兒子念公立小學

爆杨兰兰对于朦胧一见倾心泄露亲爹习近平致命机密？【阿波罗网报道】

湖州师范学院音乐学院开发的 Kontakt 8 明代魏氏乐琵琶/瑟/月琴音源即将发布

LameXP 4.21.2382 免安裝中文版 - MP3音樂轉檔軟體

免费翻墙节点大全