Wikipedia Multistream Database Dumps
Introduction About a month ago, I decided I wanted to download a database dump of Wikipedia. Why, you ask? Science isn’t about why, it’s about why not! Of course, I would like this to be somewhat useful, so in this article we’ll be figuring out how this dump is structured and how we can read it with a short script. Wikipedia archive downloads In order to process these archives, first we’ll need to download them. Somewhat unsurprisingly, we can get them from Wikipedia itself. This page has links to various mirrors that host the database dumps. They are fairly large, even compressed (May 2025 is 25 GB). That size makes it pretty difficult to parse the dump whole. However, the folks over at Wikipedia thought of that, and as such provide a version of the dump that doesn’t need to be decompressed fully in order to view its contents. With the help of a supplementary index file, we can decompress chunks of the database and parse those individually. For this article, we’ll be using the dump from May 2025. ...
Initial Commit
Hello everyone, and welcome to my website! My name is Ethan Clark. I am currently an undergraduate student at Dakota State University, majoring in Computer Science and Cyber Operations. On this site I hope to post about projects and experiences I have, most of which will be computer-related. More specifically, some of my areas of interest are system administration, programming, reverse engineering, and self-hosting. There might be some other things that sneak in from time to time, though. ...