Web archiving: where does it leave us?

While brainstorming topics for this – my final – blog post for the “Introduction to Social Media” course, I found myself lamenting the conclusion of an unexpectedly fulfilling first experience as a blogger. Creative writing has proven to be an arduous task for me in the past, so when I learned that my graded course work for COM0011 would involve producing a series of six blog posts throughout the term, I was naturally less than enthused. Yet surprisingly, I am leaving the course with a body of work about which I am quite proud, both for the way in which I expressed myself in writing, but likewise for some of the aesthetic choices I made so that my posts would present visually in certain ways on this WordPress blog. This led me to consider my options for saving an account of my modest attempt at blogging. Although a rendition of my work could be captured through an awkward combination of representative screenshots, Microsoft Word files and downloaded graphics, web archiving appears to provide a tidier alternative to preserving online content.

Web archiving (sometimes also referred to as web harvesting) has matured over the past two decades into a robust and standardized  “process of collecting portions of the World Wide Web, preserving the collections in an archival format, and then serving the archives for access and use” [1]. Through a description of its own web archiving program, the United States (US) Library of Congress highlights how web archives result from deliberate collection mandates and evaluation and selection efforts (as opposed to ad hoc or random selection) and demonstrate adherence to cataloguing standards and other best practices [2].

In “The History of Web Archiving”, Masashi Toyoda and Masaru Kitsuregawa describe how the advent of the Internet in the 1990s was quickly followed by the creation in the US of the Internet Archive in 1996, with a search interface – the “Wayback Machine” – introduced in 2001 to facilitate access to the web pages it held [3]. The Internet Archive gained recognition as a library in 2007 in the State of California, and remains to this day a San Francisco-based non-profit digital library “offering access to millions of free books, movies, and audio files, plus an archive of 450+ billion web pages” [4]. Traditional bricks-and-mortar libraries and academic institutions all over the world have since followed suit with the implementation of their own web archiving programs that are further complemented by international collaborative efforts to archive the World Wide Web. Although the disconnect between what has been archived and what among this has been made publicly accessible makes it difficult to confirm current numbers, a Wikipedia entry last updated on 9 April 2016 accounts for 81 web archiving initiatives in place to date [5]. When you consider their varied collection targets and strategies, it is clear that vast expanses of the Internet are being saved for generations to come.

Interestingly, those involved in, or preoccupied with web archiving are adding their own layers to the cyberscape they are striving to preserve. A quick review of the 81 web archiving initiatives listed under the Wikipedia entry I consulted revealed that most of the web archives they represent – even if only by virtue of the entities that administer them – have a presence on social media. The Internet Archive is no exception, with its own blog (https://blog.archive.org/) and Twitter account (@internetarchive). The same is true for the International Internet Preservation Consortium (IIPC) – an organization comprised of member institutions from over 45 countries “dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage” [6]. Like the Internet Archive, the IIPC maintains a WordPress blog (https://netpreserveblog.wordpress.com/), is active on Twitter (@NetPreserve) and further has its own IIPC YouTube channel. Anticipated beneficiaries of web archiving have even taken to social media, with Web Archive History – a group “[f]or historians who use, think about, and work with web archives” [7] – having established a Twitter account (@HistWebArchives) for themselves in 2014.

While all of the aforementioned activity by web archiving specialists and enthusiasts suggests their confidence in the staying power of online content, web archiving may be far from a fail-safe when it comes to capturing content generated through social media. In 2010, the US Library of Congress acquired the Twitter Archive under what came to be known as the Twitter Research Access Project; but as of 2015, was “still grappling with how to manage an archive that amounts to something like half a trillion tweets” [8] and has yet to provide researchers with access to it. Jill Lepore inspires even less hope in The New Yorker article, “The Cobweb: Can the Internet be archived?”, by reminding us that the average lifespan of a web page is only 100 days, with other web content falling victim to reference rot (i.e., link rot and content drift) and overwriting on a regular basis [9]. When you combine this reality with the limitless information exchange and self-publication made possible by social media, what legacy can be left by the average site owner and blogger such as you and I?

Although web archiving initiatives around the world are undoubtedly dedicated to frequent capture of the Internet, change remains a constant in cyberspace with which it is difficult to contend. If we are not on a web archives’ radar, where does this leave us? As COM0011 comes to an end, this question weighs heavy on my mind. What will become of the content we generated here?

References (that I hope will still be retrievable following publication of this post):

  1. Web Archiving [ca. 2012]. Retrieved from http://www.netpreserve.org/web-archiving/overview
  2. Web Archiving (n.d.). Retrieved from https://www.loc.gov/webarchiving/
  3. Toyoda, M. & Kitsuregawa, M. (2012). The History of Web Archiving. Proceedings of the IEEE, 100, 1441-1443. Retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6182575
  4. Internet Archive [ca. 2009]. In Twitter. Retrieved 15 April 2016, from https://twitter.com/internetarchive?lang=en
  5. List of Web archiving initiatives (9 April 2016). In Wikipedia, The Free Encyclopedia. Retrieved 16 April 2016, from https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
  6. Mission & Goals [ca. 2012]. Retrieved from http://www.netpreserve.org/about-us/mission-goals
  7. Web Archive History [ca. 2014]. In Twitter. Retrieved 16 April 2016, from https://twitter.com/histwebarchives
  8. Scola, Nancy (11 July 2015). Library of Congress’ Twitter archive is a huge #FAIL. Politico Magazine. Retrieved from http://www.politico.com/story/2015/07/library-of-congress-twitter-archive-119698.html
  9. Lepore, Jill (26 January 2015). The Cobweb: Can the Internet be archived?. The New Yorker. Retrieved from http://www.newyorker.com/magazine/2015/01/26/cobweb

3 thoughts on “Web archiving: where does it leave us?

  1. Hi there. What an interesting topic, lots of reference to back up your information on your blog. Great question, and you have given us something to think about. A well written post also, thank you very much.

  2. Makes you think about how the web has become and intirinsic part of our work and social lives when you think about where will all this work go and will it’s future relevance be captured 5, 10 or 15 years from now when information has started up on the web?

  3. Interesting post. To be honest, I had not really given web archiving much thought, which is a serious oversight on my part since we know that our posts are never really lost on the Internet. Best to keep everything organized for future reference – something better than cut and paste on a usb stick. lol

    Thanks for sharing,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.