Reading Wikipedia XML Dumps with Python

Wikipedia contains a vast amount of data. It is possible to make use of this data in computer programs for a variety of purposes. However, the sheer size of Wikipedia makes this difficult. You should not access Wikipedia data programmatically. Such access would generate a large volume of additional traffic for Wikipedia and likely result in your IP address being banned by Wikipedia. Rather, you should download an offline copy of the Wikipedia for your use. There are a variety of Wikipedia dump files available. However, for this demonstration we will make use of the XML file that contains just the latest versions of each of the Wikipedia articles. The file that you will need to download is named:

enwiki-latest-pages-articles.xml

This file can be found at the following location:

https://dumps.wikimedia.org/enwiki/latest/

The file will be tarred and zipped, so you must decompress it.

Format of the Wikipedia XML Dump

Do not try to open the enwiki-latest-pages-articles.xml file directly with a XML or text editor, as it is very large. The code below shows you the beginning of this file. As you can see the file is made up of page tags that contain revision tags.

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en"> <siteinfo> <sitename>Wikipedia</sitename> <dbname>enwiki</dbname> <base>https://en.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.29.0-wmf.12</generator> <case>first-letter</case> <namespaces> ... </namespaces> </siteinfo> <page> <title>AccessibleComputing</title> <ns>0</ns> <id>10</id> <redirect title="Computer accessibility" /> <revision> <id>631144794</id> <parentid>381202555</parentid> <timestamp>2014-10-26T04:50:23Z</timestamp> <contributor> <username>Paine Ellsworth</username> <id>9092818</id> </contributor> <comment>add [[WP:RCAT|rcat]]s</comment> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space="preserve">#REDIRECT [[Computer accessibility]] </text> <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1> </revision> </page> <page> <title>Anarchism</title> <ns>0</ns> <id>12</id> <revision> <id>766348469</id> <parentid>766047928</parentid> <timestamp>2017-02-19T18:08:07Z</timestamp> <contributor> <username>GreenC bot</username> <id>27823944</id> </contributor> <minor /> <comment>Reformat 1 archive link. [[User:Green Cardamom/WaybackMedic_2.1|Wayback Medic 2.1]]</comment> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space="preserve"> ... </text> </mediawiki>

To read this file it is important that the XML is streamed and not read directly into memory as a DOM parser might do. The xml.etree.ElementTree class can be used to do this. The following imports are needed for this example. For the complete source code see the following GitHub link .

import xml.etree.ElementTree as etree import codecs import csv import time import os

The following constants are defined to specify the three export files and the path. Adjust the path to the location on your computer that holds the Wikipedia articles XML dump.

PATH_WIKI_XML = 'C:\\Users\\jeffh\\data\\' FILENAME_WIKI = 'enwiki-latest-pages-articles.xml' FILENAME_ARTICLES = 'articles.csv' FILENAME_REDIRECT = 'articles_redirect.csv' FILENAME_TEMPLATE = 'articles_template.csv' ENCODING = "utf-8"

This example program will separate the articles, redirects and templates into three CSV files.

I use the following function to display a time elapsed. This program was typically taking about 30 minutes on my computer.

# Nicely formatted time string def hms_string(sec_elapsed): h = int(sec_elapsed / (60 * 60)) m = int((sec_elapsed % (60 * 60)) / 60) s = sec_elapsed % 60 return "{}:{:>02}:{:>05.2f}".format(h, m, s)

The following function is used to strip the namespaces from the tags.

def strip_tag_name(t): t = elem.tag idx = k = t.rfind("}") if idx != -1: t = t[idx + 1:] return t

Setup the filenames according to the path:

pathWikiXML = os.path.join(PATH_WIKI_XML, FILENAME_WIKI) pathArticles = os.path.join(PATH_WIKI_XML, FILENAME_ARTICLES) pathArticlesRedirect = os.path.join(PATH_WIKI_XML, FILENAME_REDIRECT) pathTemplateRedirect = os.path.join(PATH_WIKI_XML, FILENAME_TEMPLATE)

Reset counters to track the types of pages found.

totalCount = 0 articleCount = 0 redirectCount = 0 templateCount = 0 title = None start_time = time.time()

Begin streaming the XML file and write the headers for the 3 CSV files that will be built according to the data found in the XML.

with codecs.open(pathArticles, "w", ENCODING) as articlesFH, \ codecs.open(pathArticlesRedirect, "w", ENCODING) as redirectFH, \ codecs.open(pathTemplateRedirect, "w", ENCODING) as templateFH: articlesWriter = csv.writer(articlesFH, quoting=csv.QUOTE_MINIMAL) redirectWriter = csv.writer(redirectFH, quoting=csv.QUOTE_MINIMAL) templateWriter = csv.writer(templateFH, quoting=csv.QUOTE_MINIMAL) articlesWriter.writerow(['id', 'title', 'redirect']) redirectWriter.writerow(['id', 'title', 'redirect']) templateWriter.writerow(['id', 'title'])

Process all of the start/end tags and obtain the name ( tname ) of each tag.

for event, elem in etree.iterparse(pathWikiXML, events=('start', 'end')): tname = strip_tag_name(elem.tag) if event == 'start': if tname == 'page': title = '' id = -1 redirect = '' inrevision = False ns = 0 elif tname == 'revision': # Do not pick up on revision id's inrevision = True

For end tags, collect the title , id , redirect , ns and page tags, which mean:

title - The title of the page. id - The internal Wikipedia ID for the page. redirect - What this page redirects to. ns - Namespaces help identify what type of page. Type 10 is a template page. page - The actual page(contains the previous listed tags).

The following code processes these tag types:

else: if tname == 'title': title = elem.text elif tname == 'id' and not inrevision: id = int(elem.text) elif tname == 'redirect': redirect = elem.attrib['title'] elif tname == 'ns': ns = int(elem.text)

Once a page ends, we can collect the other values.

elif tname == 'page': totalCount += 1 if ns == 10: templateCount += 1 templateWriter.writerow([id, title]) elif len(redirect) > 0: articleCount += 1 articlesWriter.writerow([id, title, redirect]) else: redirectCount += 1 redirectWriter.writerow([id, title, redirect])

Display status updates.

if totalCount > 1 and (totalCount % 100000) == 0: print("{:,}".format(totalCount)) elem.clear()

Display final stats.

elapsed_time = time.time() - start_time print("Total pages: {:,}".format(totalCount)) print("Template pages: {:,}".format(templateCount)) print("Article pages: {:,}".format(articleCount)) print("Redirect pages: {:,}".format(redirectCount)) print("Elapsed time: {}".format(hms_string(elapsed_time)))

Reading Wikipedia XML Dumps with Python

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本