Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Parsing Text: Numerical Data from Historical Accounts

$
0
0

In a recent article on Data Shop Talk , I introduced an interesting set of analyses on Olympic data. One of the analyses focused on Olympians who had died or gone missing due to war. The data from Sports-Reference.com came in a csv file format with the following information: Athlete, Gender, Country, Sport, and Notes. This was a great start to my analysis but there was something critical missing in order to properly illustrate the timeline of deaths: the date of death.

The data was there but it was embedded in the description under “Notes”. These notes, however, were not standardized and the date of death could not be easily pulled out with Spotfire alone. So what I did was write a simple python script to help me parse the Notes and get the year of death.

The data I used can be found here . Let’s take a quick look at some example entries:

27 December 1942. Died in a Japanese internment camp in Burma.

In January 1945 Bevan a lieutenant in the Hampshire regiment of the British Army was involved in an operation to capture the towns of Putt and Walderath in the Heisberg district of Western Germany. He was fatally wounded by a mine and was buried across the border in the Netherlands at the military cemetery in the town of Brunssum. The Alick Bevan Plate Meeting at the Crystal Palace circuit in London is contested in his honour.

Died in World War II (per Volker Kluge) date and place not known.

As we can see, some dates come before the description, some are embedded in the story, and others are generalized to the war. However, we can use all of these clues to give us an answer as to when they died. All we need to do is look for a date or see if the war is mentioned.

Here’s the code I came up with real quick:

import csv
file = open('warLossOlympians.csv', "rb")
reader = csv.reader(file)
target = open('warLossOlympiansDate.csv', 'w')
rownum = 0
for row in reader:
# Save header row and add year of death
rowData = row[0] + "," + row[1] + "," + row[2] + "," + row[3] + "," + row[4]
if rownum == 0:
header = row
target.write(rowData + "," + "Year of Death")
target.write("\n")
else:
# Take the Notes section and divide it into words
words = row[4].split(' ')
gotDate = False
# Go through each word and use clues to find the date
for word in words:
# Can we find a date that starts with a 19xx or 20xx?
# Can we find a description that mentions WW1 or WW2?
if ((word[:2] == "19" or word[:2] == "20") and (len(word) >= 4)):
# Ignore periods at the end of dates
if ((len(word) == 5)):
word = word[:4]
# Add the new data
target.write(rowData + "," + word)
gotDate = True
elif (word == "I"):
# Add the new data
target.write(rowData + "," + "1918")
gotDate = True
elif (word == "II"):
# Add the new data
target.write(rowData + "," + "1945")
gotDate = True
# If no date was found, leave entry empty
if (gotDate == False):
target.write(rowData + ",")
target.write("\n")
rownum += 1
file.close()
target.close()

It’s nothing fancy but it gets the job done. I go through and look for a date in the Notes. If I can’t find that that then I look for mention of either World War 1 or World War 2. I decided to generalize those dates to the end of the war (1918 or 1945) since we know that they were definitely deceased by then. This covers pretty much all the entries in this data set. There are a few outliersthat I had tolook up and do by hand, but definitely better than going through all 575 entries.

And that’s it! Run the code and then you have yourself a new csv with the Year of Death included in.

The full code and data can be foundhere.

Next time you find yourself having to carve through text to get numerical data try this technique.If you found this data interesting, checkout the full Spotfire analysis on Exchange.AI !


Viewing all articles
Browse latest Browse all 9596

Trending Articles