Date Tags data / web

In this hands-on workshop, you will learn to parse wikitext from beginning to end. Using data from the German Wiktionary, we will fetch the data, parse the XML, and process the wikitext to extract linguistic content such as parts of speech, meanings, and inflections, and more.

We will cover the following points:

  • Fetching the Data: Learn two ways to retrieve wiki data—using the wiki Special Export tool or downloading wiki dump files.
  • Parsing the XML Files: Once the data is retrieved in XML format, this section explains how to parse the files to extract the wikitext.
  • Parsing the Wikitext: In the final part, we will parse the wikitext and extract elements such as headings, sections, word forms, meanings, inflections, and more.

What You Need

We will use Google Colab, a free, cloud-based platform for running Python code in a Jupyter Notebook environment.

To participate:

  • Sign in with your Google account.
  • Have a stable Internet connection.
  • You can follow along on a tablet, but for editing, a laptop is recommended.

All materials will be available at: https://lennon-c.github.io/python-wikitext-parser-guide/pycon_at

If you prefer to run the code on your own machine, the website also includes the installation instructions and download links to the source code and data.

Speaker

Carolina Lennon

Carolina Lennon

Economist by training with a passion for Python programming, deeply grateful to the Python and open-source community for the countless hours of learning and joy.