Wikipedia contains a vast amount of data. It is possible to make use of this data in computer programs for a variety of purposes. However, the sheer size of Wikipedia makes this difficult. You should not access Wikipedia data programmatically. Such access would generate a large volume of additional traffic for Wikipedia and likely result in your IP address being banned by Wikipedia. Rather, you should download an offline copy of the Wikipedia for your use. There are a variety of Wikipedia dump files available. However, for this demonstration we will make use of the XML file that contains just the latest versions of each of the Wikipedia articles. The file that you will need to download is named:
The file will be tarred and zipped, so you must decompress it.
Format of the Wikipedia XML Dump
Do not try to open the enwiki-latest-pages-articles.xml file directly with a XML or text editor, as it is very large. The code below shows you the beginning of this file. As you can see the file is made up of page tags that contain revision tags.
To read this file it is important that the XML is streamed and not read directly into memory as a DOM parser might do. The xml.etree.ElementTree class can be used to do this. The following imports are needed for this example. For the complete source code see the following GitHub link.
1 2 3 4 5
import xml.etree.ElementTree as etree import codecs import csv import time import os
The following constants are defined to specify the three export files and the path. Adjust the path to the location on your computer that holds the Wikipedia articles XML dump.
Process all of the start/end tags and obtain the name (tname) of each tag.
1 2 3 4 5 6 7 8 9 10 11 12 13
for event, elem in etree.iterparse(pathWikiXML, events=('start', 'end')): tname = strip_tag_name(elem.tag)
if event == 'start': if tname == 'page': title = '' id = -1 redirect = '' inrevision = False ns = 0 elif tname == 'revision': # Do not pick up on revision id's inrevision = True
For end tags, collect the title, id, redirect, ns and page tags, which mean:
title - The title of the page.
id - The internal Wikipedia ID for the page.
redirect - What this page redirects to.
ns - Namespaces help identify what type of page. Type 10 is a template page.
page - The actual page(contains the previous listed tags).
The following code processes these tag types:
1 2 3 4 5 6 7 8 9
else: if tname == 'title': title = elem.text elif tname == 'id'andnot inrevision: id = int(elem.text) elif tname == 'redirect': redirect = elem.attrib['title'] elif tname == 'ns': ns = int(elem.text)
Once a page ends, we can collect the other values.
Jeff Heaton, Ph.D. is a YouTuber, [computer/data] [scientist/engineer], and indie publisher. Heaton Research is the homepage for his projects and research.