Web Artifact Permanence

What are our best options for preservation?

A Provocation for the Open Pedagogy Community

I do want to point out a big reason I moved to self-hosted and institutional solutions was this idea that commercially hosted stuff was too fickle. In 2006, it seemed that every week a new site shut down. For better or worse (mostly worse) monopoly consolidation has changed that dynamic a bit. There are other good reasons for self-hosting or doing institutional hosting, but durability is more downside than upside of these options, and we might want to let our students know that if they want something to stay up, self-hosting may not be the best choice.

I don’t necessarily agree. I agree more with his first instinct:

My galaxy brain goes towards the idea of federation, of course. The idea that everything referencing something should store a copy of what it references connected by unique global identifiers (if permissions and author preferences permit), and that we need a web that makes as many copies of things as the print world did, otherwise old copies of the Tuscaloosa News will outlast anything you are reading today on a screen. Profligate copying, as Ward Cunningham has pointed out, is biology’s survival strategy and it should be ours as well.

This is a topic near and dear to my heart. Of course my technical abilities allowed me to move my blog from Blogger, to homespun software, to Wordpress, and a separate LiveJournal, and a separate Jekyll blog, and now have a consolidated archive in Jekyll. My blog is a git repo, which is mirrored on Github. The posts are entirely contained in a directory of flat files (3326 of them, not including this one) that I can make infinite copies of. After 17 years of doing this, I’ve found flat files to be the most dependable, portable data storage format for blog posts. I’ve hosted the site at a total of two different hosts. The first one went out of business in dramatic, web 1.0 fashion. I’ve been on Dreamhost for over 12 years.

But I still think about the impermanence of this thing quite a bit. I actually think about future audiences more than I think about current audience. (Helped by the fact that there is almost no current audience.) I think about it in relation to SWIM, especially the parts of SWIM that are informed by Vannevar Bush’s Memex. The Memex was to be a machine in the home, the data a local copy. But copying from others was paramount. Obviously the web changes that, but I still have strong feelings that this data should first belong to the user, and the user should be able to access, manipulate, import and export that data however they see fit.

The two issues–data ownership and data permanence–are somewhat intertwined. We need to own our data, but we also need to know how to be responsible with our own data, and we need mechanisms to preserve that data. We need organizations thinking about these issues.