The Archivist’s Nook: Archiving the Internet – Do Web Crawlers Dream of Electric Sheep?

Still from "The X Files" — That’s a whole lot of files!

The following post was authored by Digital Archivist Paul Kelly.

My previous blog entry covered the digitization of physical things. Well, paper at least. We’re pretty on top of that! You take a page, scan it, and kablammo – it’s online, so to speak. But how do we deal with records that are already digital, like, say, web pages? Do we print them? Stick them in folders? What would that even look like? 136 billion pages, apparently. Granted, CUA’s site is tiny compared to the internet as a whole, but you get the general idea.

As it turns out, though, non-profit organization The Internet Archive has been doing more all these years than hosting bootlegs of old Smashing Pumpkins gigs. In fact, since 1996 they’ve also been saving snapshots of websites (47 billion and counting), all of which are accessible through the Wayback Machine. If you know the URL, chances are you can score some older version of the page. I dare you to find my Dead Journal from 2001.

Now, don’t get me wrong, Wayback is amazing, but it’s not without limitations. For example, you have to know the link that you’re looking for, archived sites are not particularly organized, content is near-impossible to search, and users have little to no control over what is saved. All of this is a problem for archivists, who like to organize things into oblivion. So, yes, maybe the content is there (or maybe not – good luck finding out), but how do we make it useful?

Internet Archive servers in San Francisco, CA

In 2005, Internet Archive provided a solution: Archive-It, a tool that allows librarians and archivists to point Wayback’s crawlers to specific sites, and create groups of web content in the same way we would decide what to retain in a paper collection. Suddenly web history is manageable and searchable, even for sites that, like tears in rain, no longer exist. In 2014, CUA decided to jump aboard.

In spring 2015, the Archives created five webpage collections: Schools and Departments, Social Media, Student Organizations, University Athletics, and University Libraries. Some of this isn’t that different to documents we have from 1890, such as class schedules, syllabi and so on, but more exciting is the student-generated content we can now collect. Youtube? We’ve got it. Twitter? Yep. Facebook pages? Mhm. Instagram? That too. We’re now truly documenting both sides of the college equation – both the university, and the experience of being a student. We’re also fortifying these collections with archival principles, and implementing real intellectual control (page-level Dublin Core, for the metadata nerds out there). Just think – in years to come, when we’re having smart lenses grafted to our retinas, researchers will be able to look back and see what our web presence looked like in 2015. Pretty cool, right?

So what’s up next for the web archiving program? To tease – so far, we’ve only been collecting on the university end, but what if we collected what others said about CUA? Think scrapbooks, only cooler. In any case, you’ll read about it here first. As before – stay tuned, folks.

Related