Archive TeamShutdowns don’t stop during the weekends

The Internet does not forget? The opposite is the case. Huge mountains of data are lost on the internet every day. That’s why the Archive Team scours the vastness of the WWW and rescues data from extinction – around the clock and on a voluntary basis.

Logo shown after the start of a Warrior VM
Logo shown after the start of a Warrior VM CC-BY-SA 4.0

The Archive Team is not an official organization but a loose group. It consists of people who have organized themselves through IRC channels. They are united by one goal: archiving the Internet.

The team develops its own web scrapers to collect data and hosts them at the Internet Archive. The data is accessible through the so-called Wayback Machine. With the help of the Wayback Machine, users can then scroll through the Internet over several decades. In most cases, there is a note about when the Archive team made the data available. Even though they provide the scraped data to this archive, the group is not affiliated with the Internet Archive.

Without the Archive Team’s tireless work, much of the historical Internet would likely be lost. We spoke with three members of the team, which does not have an official spokesperson, about their work. How was the Archive Team started?

arkiver: Archive Team was founded by Jason Scott in 2009. Nowadays, he is not around much at the Archive Team, though. The collection on the Internet Archive (IA) was created in 2011. I joined the Archive Team almost 10 years ago. Running servers with a lot of bandwidth can be expensive. How do you pay for it?

JustAnotherArchivist: So far, there have been no large monetary contributions. Rather, individual people rent servers for use by our scripts and software, for example. They pay for this with their own money from their regular jobs. It would be quite accurate to call the whole thing an expensive hobby. The team has been pursuing this volunteer hobby for 14 years now. That is a really long time span. You don’t receive any monetary reward for this voluntary work. What keeps you going?

arkiver: Working on archiving the web is incredibly important and interesting work. The always emerging technical problems are interesting to solve. Furthermore, the people at the Archive Team are very motivated, which also keeps me going. Why is your work so important?

arkiver: I believe that our work is an important contribution to preserving history. Especially the content of websites that belong to media outlets, governments, and others. This involves so-called outlinks. These are links from a specific URL of one domain to a URL of another domain (links going out), and in our case, we’re especially interested in outlinks that are posted by humans. Because if a human posts a link to some other place online, it means that this place has some value to that user and may thus be valuable overall.

Of course, we can’t store everything. The Internet is huge, and if we tried to follow every single URL to archive everything, we would fail. The number of URLs and data out there becomes unmanageable rather quickly.

Therefore, we have to make a selection. For example, we can see what people are linking to on Reddit or on their blogs. Or when there are links to government websites or reputable media. If a website is linked to certain places, there is a reason for it. Usually, the data behind these links has some value. And this usually makes them worth archiving.

“This whole thing is a group effort.”

JustAnotherArchivist: I fully agree with everything arkiver said. It’s a meaningful task. We don’t often hear from users of our archives, but when we do, it’s usually because they’re overjoyed to find that their content on some long-dead site still exists; was one example of a project in the last few years where particularly many people showed up.

Apart from that, with more and more information being born digitally and on the web, it’s important to preserve it for the future. One great example here is political events (government activities, elections, parties), where the relevant resources like public statements or campaign promises are often tricky to find in detail mere years later.

Then there’s the technical part: each project has its own challenges for which solutions don’t generally exist. Finally, there’s also an artistic aspect to it, as the trickier challenges often require creative solutions. And yeah, it’s relatively easy to make a significant contribution to this because it’s a very niche activity with few actors worldwide.

rewby: Doing archivism at the scale we do requires so much knowledge that it’s too much for any one person to contain. I have been with the Archive Team since 2021-ish. We all have areas we specialize in. Some of us know parts of the tangled mess of spaghetti code that holds this place together better than anyone else. This whole thing is a group effort.

I mostly handle uploading stuff to the IA, for example, but even just „uploading some files“ gets really complicated when you’re dealing with gigabits of data and thousands to millions of files per minute.

On the other hand, I am not very knowledgeable about the code that actually runs on workers and grabs the pages that I end up having to upload.

arkiver: Among other things, we use so-called warriors for archiving. A Warrior is an instance of code on a person’s machine. The code runs a specific project that we make available to people. This can be a scraper that is deployed using that person’s IP address. Often, IP addresses are blocked for scraping. So we appreciate it if there are as many Warriors with different IPs as possible.

JustAnotherArchivist: I think we see a collision of terms here. I personally use „Warrior“ only in a rather narrow sense for specific software (and its container images/virtual machine distributions) that can be used to simply contribute to our projects. I distinguish it from „workers“ in general, which includes project-specific container images.

One petabyte of Telegram data You scraped more than one Petabyte of the Telegram Web Interface. How did you achieve this?

rewby: That depends on how deep you want to go. I haven’t looked at the Telegram scraping code that closely, but all warrior projects are fairly similar at a high level.

The basic process is as follows: We have a tracker that keeps track of units of work items. These units usually consist of things like single posts, forum threads, or other similar items. The definition of an „item“ depends on the project.

Warriors are where the public’s participation comes in. They run these so-called warriors on anything from their family PC at home to clusters of high-performance servers in data centers. They request items from the tracker and then run a specialized (project-specific) piece of code that grabs that particular piece of the site being archived.

It saves those into so-called Web ARChive files (WARC files). These files contain a byte-for-byte accurate copy of both the request sent to fetch something and the response from the server. Depending on how the project is structured, it may then attempt to parse the retrieved data to find more things to archive.

Then, depending on the project, it may either archive those immediately or submit them back to the tracker for someone else to pick up.

Finally, it takes the finalized WARC file for each item and uploads it to one of the target servers for a specific project. These files tend to be quite small – think kilobytes or megabytes. Since files that small come with a lot of overhead for places like the IA to store and process, we combine them into „megaWARCs“ on the targets.

This process takes millions of tiny WARCs and combines them into a single, gigabyte-large WARC file. These megaWARCs are then uploaded to the IA by the targets. The IA then ingests them into their pipeline, which performs various indexing operations to make them usable and viewable on the Wayback Machine. This process usually takes a few days.

How the ArchiveBot works Could you elaborate on using IRC for communication and the actual archiving?

JustAnotherArchivist: ArchiveBot is typically used for recursive crawls of websites. It starts from some URL, usually the homepage, and follows links within the site in a breadth-first recursive descent to exhaustion. In the default configuration, it also follows one layer of links to external hosts. These are external links that appear on the target site.

There are also modes for retrieving a single page or a list of URLs. Users control it via IRC, both to launch jobs and manipulate them as they run. For example, they can ignore URLs based on regex patterns or change the request rate. ArchiveBot has a public web interface that allows monitoring everything in detail, such as each URL being retrieved.

Jobs are not distributed, but the system as a whole consists of several servers, each running some number of jobs concurrently. A central server acts as the IRC and web interface and keeps track of what is running and where. The WARC data from each job is continuously uploaded to IA in chunks of a few gigabytes to reduce the disk space requirements.

This approach is suitable for small- to medium-sized websites. A typical number of smoothly running jobs involves retrieving a million URLs every few days. When that isn’t sufficient for archiving a site before a deadline, we need to bring out the bigger guns, like a DPoS (distributed preservation of service) project (what rewby called the ‚warrior project‘ above) or other specialized software. [Note: The ironic acronym DPoS alludes to DDoS and originates from the group.]

The protection of anonymity Why do you appear anonymously as the Archive Team, even in this interview?

arkiver: I want to stay anonymous so I don’t have to worry about possible negative impacts on my offline life. People often self-censor themselves, and that wouldn’t be good if we did with our work at the Archive Team.

Especially since some people outside of the Archive Team might not understand the context in which we work and the reasoning behind what we do. It’s very easy for others to put a negative label on our activities that could have repercussions in society or offline life. I don’t want to expose myself to that directly.

Therefore, I prefer not to have my real name attached to everything here. It’s not out of fear, per se, but rather to avoid having to consider society’s interpretation and judgment of everything I do here. What are your goals for the future?

rewby: Personally, I see our future goals as „improving our software quality, enhancing our pipelines, supporting more types of sites (as many of our current tools are limited to HTTP/1.1), and archiving more things.“ Or, as our little tagline goes, „We will rescue more of your shit“.

Deine Spende für digitale Freiheitsrechte

Wir berichten über aktuelle netzpolitische Entwicklungen, decken Skandale auf und stoßen Debatten an. Dabei sind wir vollkommen unabhängig. Denn unser Kampf für digitale Freiheitsrechte finanziert sich zu fast 100 Prozent aus den Spenden unserer Leser:innen.

0 Ergänzungen

Wir freuen uns auf Deine Anmerkungen, Fragen, Korrekturen und inhaltlichen Ergänzungen zum Artikel. Bitte keine reinen Meinungsbeiträge! Unsere Regeln zur Veröffentlichung von Ergänzungen findest Du unter Deine E-Mail-Adresse wird nicht veröffentlicht.