Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Reading Wikipedia XML Dumps with Python

$
0
0

Wikipedia contains a vast amount of data. It is possible to make use of this data in computer programs for a variety of purposes. However, the sheer size of Wikipedia makes this difficult. You should not access Wikipedia data programmatically. Such access would generate a large volume of additional traffic for Wikipedia and likely result in your IP address being banned by Wikipedia. Rather, you should download an offline copy of the Wikipedia for your use. There are a variety of Wikipedia dump files available. However, for this demonstration we will make use of the XML file that contains just the latest versions of each of the Wikipedia articles. The file that you will need to download is named:

enwiki-latest-pages-articles.xml

This file can be found at the following location:

https://dumps.wikimedia.org/enwiki/latest/

The file will be tarred and zipped, so you must decompress it.

Format of the Wikipedia XML Dump

Do not try to open the enwiki-latest-pages-articles.xml file directly with a XML or text editor, as it is very large. The code below shows you the beginning of this file. As you can see the file is made up of page tags that contain revision tags.

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en"> <siteinfo> <sitename>Wikipedia</sitename> <dbname>enwiki</dbname> <base>https://en.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.29.0-wmf.12</generator> <case>first-letter</case> <namespaces> ... </namespaces> </siteinfo> <page> <title>AccessibleComputing</title> <ns>0</ns> <id>10</id> <redirect title="Computer accessibility" /> <revision> <id>631144794</id> <parentid>381202555</parentid> <timestamp>2014-10-26T04:50:23Z</timestamp> <contributor> <username>Paine Ellsworth</username> <id>9092818</id> </contributor> <comment>add [[WP:RCAT|rcat]]s</comment> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space="preserve">#REDIRECT [[Computer accessibility]] </text> <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1> </revision> </page> <page> <title>Anarchism</title> <ns>0</ns> <id>12</id> <revision> <id>766348469</id> <parentid>766047928</parentid> <timestamp>2017-02-19T18:08:07Z</timestamp> <contributor> <username>GreenC bot</username> <id>27823944</id> </contributor> <minor /> <comment>Reformat 1 archive link. [[User:Green Cardamom/WaybackMedic_2.1|Wayback Medic 2.1]]</comment> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space="preserve"> ... </text> </mediawiki>

To read this file it is important that the XML is streamed and not read directly into memory as a DOM parser might do. The xml.etree.ElementTree class can be used to do this. The following imports are needed for this example. For the complete source code see the following GitHub link .

import xml.etree.ElementTree as etree import codecs import csv import time import os

The following constants are defined to specify the three export files and the path. Adjust the path to the location on your computer that holds the Wikipedia articles XML dump.

PATH_WIKI_XML = 'C:\\Users\\jeffh\\data\\' FILENAME_WIKI = 'enwiki-latest-pages-articles.xml' FILENAME_ARTICLES = 'articles.csv' FILENAME_REDIRECT = 'articles_redirect.csv' FILENAME_TEMPLATE = 'articles_template.csv' ENCODING = "utf-8"

This example program will separate the articles, redirects and templates into three CSV files.

I use the following function to display a time elapsed. This program was typically taking about 30 minutes on my computer.

# Nicely formatted time string def hms_string(sec_elapsed): h = int(sec_elapsed / (60 * 60)) m = int((sec_elapsed % (60 * 60)) / 60) s = sec_elapsed % 60 return "{}:{:>02}:{:>05.2f}".format(h, m, s)

The following function is used to strip the namespaces from the tags.

def strip_tag_name(t): t = elem.tag idx = k = t.rfind("}") if idx != -1: t = t[idx + 1:] return t

Setup the filenames according to the path:

pathWikiXML = os.path.join(PATH_WIKI_XML, FILENAME_WIKI) pathArticles = os.path.join(PATH_WIKI_XML, FILENAME_ARTICLES) pathArticlesRedirect = os.path.join(PATH_WIKI_XML, FILENAME_REDIRECT) pathTemplateRedirect = os.path.join(PATH_WIKI_XML, FILENAME_TEMPLATE)

Reset counters to track the types of pages found.

totalCount = 0 articleCount = 0 redirectCount = 0 templateCount = 0 title = None start_time = time.time()

Begin streaming the XML file and write the headers for the 3 CSV files that will be built according to the data found in the XML.

with codecs.open(pathArticles, "w", ENCODING) as articlesFH, \ codecs.open(pathArticlesRedirect, "w", ENCODING) as redirectFH, \ codecs.open(pathTemplateRedirect, "w", ENCODING) as templateFH: articlesWriter = csv.writer(articlesFH, quoting=csv.QUOTE_MINIMAL) redirectWriter = csv.writer(redirectFH, quoting=csv.QUOTE_MINIMAL) templateWriter = csv.writer(templateFH, quoting=csv.QUOTE_MINIMAL) articlesWriter.writerow(['id', 'title', 'redirect']) redirectWriter.writerow(['id', 'title', 'redirect']) templateWriter.writerow(['id', 'title'])

Process all of the start/end tags and obtain the name ( tname ) of each tag.

for event, elem in etree.iterparse(pathWikiXML, events=('start', 'end')): tname = strip_tag_name(elem.tag) if event == 'start': if tname == 'page': title = '' id = -1 redirect = '' inrevision = False ns = 0 elif tname == 'revision': # Do not pick up on revision id's inrevision = True

For end tags, collect the title , id , redirect , ns and page tags, which mean:

title - The title of the page. id - The internal Wikipedia ID for the page. redirect - What this page redirects to. ns - Namespaces help identify what type of page. Type 10 is a template page. page - The actual page(contains the previous listed tags).

The following code processes these tag types:

else: if tname == 'title': title = elem.text elif tname == 'id' and not inrevision: id = int(elem.text) elif tname == 'redirect': redirect = elem.attrib['title'] elif tname == 'ns': ns = int(elem.text)

Once a page ends, we can collect the other values.

elif tname == 'page': totalCount += 1 if ns == 10: templateCount += 1 templateWriter.writerow([id, title]) elif len(redirect) > 0: articleCount += 1 articlesWriter.writerow([id, title, redirect]) else: redirectCount += 1 redirectWriter.writerow([id, title, redirect])

Display status updates.

if totalCount > 1 and (totalCount % 100000) == 0: print("{:,}".format(totalCount)) elem.clear()

Display final stats.

elapsed_time = time.time() - start_time print("Total pages: {:,}".format(totalCount)) print("Template pages: {:,}".format(templateCount)) print("Article pages: {:,}".format(articleCount)) print("Redirect pages: {:,}".format(redirectCount)) print("Elapsed time: {}".format(hms_string(elapsed_time)))


Viewing all articles
Browse latest Browse all 9596

Trending Articles