This is the second paper of two from the Endings Project at the University of Victoria. The first paper, by Stewart Arneil, outlines the challenges of building a sustainable digital humanities (DH) project from the perspective of a programmer who has worked with researchers to build a number of DH project sites. This second paper, written by a librarian, considers the role of libraries as DH preservation partners. A recent special issue of Digital Humanities Quarterly, edited by the Endings Project team, provides an in-depth discussion of the many sustainability challenges faced by DH researchers (Holmes et al.).
In 2018, the Endings Project undertook a survey of 127 DH projects with the goal of better understanding the various ways in which DH projects come to an end. The earliest projects represented in the survey began in the 1980s, but the vast majority were started after 2001. Of those projects, spanning a four-decade period, only 24% were considered by their principal investigators to be “complete,” and only about 10% were archived in a stable, long-term environment with active preservation services (“Survey Results”).
As digital projects proliferate in the humanities, the question of preserving non-traditional research outputs like websites, databases, and software tools becomes a pressing one. Many researchers turn to academic libraries for solutions. A recent discussion on the Humanist listserv underscores the gap between faculty expectations and library capacity (Wall et al.). In most cases, faculty are hopeful that their libraries will adopt a project wholesale and agree to keep the entire software stack—all of the different applications and dependencies that allow the application to function—viable over the long term. This is not a scalable proposition for even very well-funded libraries. The gap between what is desired by faculty and what is sustainable for libraries creates a tension that is difficult to resolve in a way that is satisfactory to both parties.
When we talk about the preservation of digital objects and platforms, we must first acknowledge that “persistence is a function of organizations, not a function of technology” (DOI Foundation). This may be a slight overstatement, because of course organizations do use technology in order to preserve digital content, but the point is that technology is just that—a set of tools that are developed and used by human beings who are funded by organizations to carry out specific functions. There is no technical design choice that will absolutely future-proof information containers, particularly over the very long term. GLAM organizations (galleries, libraries, archives, and museums) are unique in their mission to collect, organize, and store information in ways that can preserve access to knowledge over hundreds or thousands of years. The deluge of digital information raises many new questions about what should be preserved, and about how libraries can organize their limited resources to take on this work.
In this paper, we will examine six different approaches to digital preservation to determine the strengths and weaknesses of each, considering both technical and resource implications. The approaches are dark archiving, preserving objects and metadata in a repository, web harvesting, emulation, preservation of dynamic/social sites, and archiving static versions.
Dark Archiving is the practice of taking bit-level copies of digital files and preserving them in storage that is not publicly accessible. This may mean that the library takes a copy of a project’s entire file and folder infrastructure, a copy of the database, or a disk image from a server. These files are then put into deep storage in a local storage network, into cloud-based storage (e.g., Amazon Glacier), or into offline storage like tape drives. There are a couple of benefits to this approach: it is a manageable amount of work for researchers and librarians, and it ensures that at least one redundant copy of the project exists on fairly reliable storage infrastructure. This is a strategy that libraries may leverage as a kind of relationship appeasement where a better option is not available.
From the library perspective, the disadvantages of dark archiving significantly outweigh the advantages. Project file systems are often not designed in a way that will be easy to understand for someone outside of the project team and are rarely well documented. Databases and applications suffer from obsolescence as computing environments evolve, and these are also often poorly documented for outside users. Almost all file formats beyond unadorned text files will eventually need to be migrated to avoid obsolescence. Network and cloud storage have ongoing annual costs. Once a library has accepted hundreds of disk images or file systems it will be very difficult for librarians and archivists to understand what is in each of those data dumps, let alone try to recreate the projects in a way that can be accessible to users. This final point is really the critical one. Libraries are in the business of collecting information for the benefit of future scholars. Uncatalogued and poorly documented dumps of hundreds of project files stand very little chance of ever being reanimated into something that will be usable in the future. By accepting data dumps for dark archiving, libraries are using limited resources to assume responsibility and liability for collections that will ultimately provide no benefit to our user base. When weighed against the many competing priorities for time and money, this type of “archiving” does not meet our basic criteria for discoverability and accessibility.
Of the DH projects surveyed by the Endings Project, most of those with active long-term archiving solutions are relying on library repositories to store the research objects and metadata that have been produced (Goddard). In this scenario, the project exports the digital objects that were either created or collected as objects of study. This may include texts, images, and audio/video files, along with any structured metadata that has been produced to describe them. Libraries will often enhance object-level metadata in order to bring it into line with international standards for the automated processing and exchange of information—e.g., Dublin Core, the Extended Date Time Format, Library of Congress Subject Headings, and FAST Authority records.
There are a number of advantages to this approach. The main one is that most libraries already have a publishing platform for digital collections, and workflows are in place to manage those objects over the long term. Repositories ensure that a project’s research objects will be discoverable and accessible online, and that they will be available for the use of scholars around the globe in perpetuity. This archiving solution also relieves the researcher of any further responsibility for funding or maintaining their own online platforms.
The repository approach tends to be preferred by academic libraries because it leverages existing staff, workflows, and infrastructure while ensuring that objects are available for use. From the researcher perspective, however, this approach has some serious flaws that compromise the integrity of the archived project. The most significant problem is that research objects are divorced from the original context of the project website. The website is not preserved, the original URLs are not preserved, and any interpretive information that was provided by the project website is lost. This may include the erasure of information about the original research objectives, research methods and process, and project contributors. This last problem is especially problematic given the amount of precarious labour used in most research projects. When the project website disappears, it is difficult for members of the project team to demonstrate their contributions to potential employers, or to show the scope and impact of the project.
Web harvesters are software applications that capture static copies of websites at different points in time. They are pieces of software that follow links through a website, scraping the content and saving the site as a WARC (Web ARChive) file that is a self-contained, portable copy of the content. The Internet Archive’s Wayback Machine is one of the best-known web archives, documenting harvested versions of some websites as far back as 1996. WARC files can be viewed in a browser via a WARC viewer, such as those available at the Wayback Machine or Perma. Many academic libraries subscribe to the Archive-It platform, which uses the Heritrix web harvester. Archive-It permits libraries to maintain accessible copies of important web information from institutional and government websites that change often.
WARC files can be hosted in a web service like Archive-It, which will allow them to be viewed in a standard web browser, or the WARC files themselves can be submitted to a data archive for purposes of preservation. In this scenario, a user would have to download the WARC file in order to open it from their local computer. WARC files, like most file formats, will have to be repackaged and migrated as standards change or they will become obsolete and inaccessible to the average end-user.
Most Canadian research libraries already have web harvesters available (see the Council of Prairie and Pacific University Libraries’ Archive-It membership list and the Canadian Web Archiving Coalition site), and most DH research projects are not so large as to create a significant storage cost burden within a subscription service like Archive-It. On the surface, web harvesting would appear to be an ideal solution for archiving DH projects, as this permits the entire project site to be preserved, along with all of the contextual and interpretive content that it contains.
Web harvesters, however, have some known challenges. Dynamic design features on a website that require a human to click (e.g., to play a video, to zoom a map) or mouseover (e.g., to access a menu) may not be archived properly. Streaming media is not always available for download (Germain). Most critically for a lot of DH projects, a web crawler cannot launch database searches from search boxes. Since web servers follow links, they cannot directly access most content that is held in a backend database. This limitation is particularly serious “as database-driven websites become the norm, and where content is increasingly generated dynamically based on user-initiated HTTP requests, either through form-based queries or via AJAX-driven user interactions” (Davis).
While web capture using Archive-It might be a viable strategy for some DH research outputs, effective capture of others might require the use of additional web archiving technologies, together with more labour-intensive interventions, and still other web-based projects might evade capture persistently. Alternatively, it may be possible to web-archive select parts of a project, and to publish those as objects that preserve some sense of the whole project. Web archiving is a useful component of an overall digital preservation strategy, but it is not a complete solution for most projects.
Emulation involves making a bit-level copy of a piece of software, including all of the data that it contains. The hardware of the machine on which that piece of code originally ran is then emulated in software to allow the program to be executed and experienced exactly as it was originally designed. This solution will allow a user to interact with an obsolete piece of software without rewriting the code for contemporary hardware and operating systems.
Emulation can be a good approach for a small, self-contained program like an interactive CD-ROM or a video game. The Internet Arcade, from the Internet Archive, provides a number of old arcade games in emulation, so that they can be played in a modern web browser. Emulated software provides a faithful replication of the original application without requiring that code or file formats be rewritten or migrated.
There are several challenges with emulation. The software container must include the entire environment on which a piece of software ran, including the original operating system, which is sometimes legally difficult, since libraries cannot purchase licenses for obsolete versions of commercial operating systems. Emulated software is like a walled garden. It is not connected to the internet in any true sense, so data cannot go in or out. Corrections, edits, or additions to the data are impossible, and it is not easy to get the data out of the system in a usable way. Emulation is also an expensive solution that requires significant technical intervention. According to David Rosenthal, “emulation is more expensive than migration-based strategies[…]. There is a risk that diverting resources to emulation, with its higher per artefact ingest cost, will exacerbate the lack of resources” (Rosenthal 1).
Ultimately, emulation is a strategy that can work quite well for small pieces of software that are self-contained and not very technically complex. This does not describe most modern DH projects, many of which are built on complex stacks of software like Drupal or WordPress that have many software dependencies and are built for highly networked environments: “Technical, scale and intellectual property difficulties make many current digital artefacts infeasible to emulate” (Rosenthal 31). Few libraries have sufficient on-staff expertise to manage emulation projects, and no library has the resources to offer emulation for faculty projects at scale (Hagenmaier et al).
Containerization using software like Docker is akin to emulation. A container packages up all software and dependencies, allowing a project to be easily redeployed on a new server or in a new environment. A containerized project is easy to transfer into the stewardship of a library or archive compared to a project that is not virtualized, because it does not require a library to directly procure, install, and configure every application and dependency.
Software that is containerized, however, still has to be upgraded and managed, or it can become a serious security liability. Since the container itself is software, it will also have to be upgraded over time. Containerization therefore does not address the major sustainability concerns, namely that monitoring, patching, and upgrading a large number of applications is expensive in staff time and transfers a huge amount of technical debt and cybersecurity liability to the library.
A container could be archived in a relatively static way, by publishing the container file in a data archive or repository. In this scenario, an end user would have to download the container from the repository in order to run it on their local hardware infrastructure. This would certainly create a barrier to access for all but the most determined users, but it would be a means of preserving a faithful copy of the project with all functionality intact. Unless the container is intermittently repackaged, however, it will eventually become obsolete and unusable.
Social sites, those dynamic sites that invite user contributions, pose some of the most complicated archiving challenges. One significant challenge is that social sites usually do not present the same information to all users. A user’s experience of a social site will be dictated by algorithms that consider their stated interests, previous activity, and the other users with whom they are connected, either explicitly (e.g., as friends or by following) or by merit of interacting with their content (e.g., liking or commenting). Sites with a lot of user-contributed data often have complex issues related to privacy, especially those that allow users to limit access to certain aspects of their account or activity. Intellectual property issues may also be present on sites where users may have uploaded articles, images, or videos to which they do not own the copyright.
It is almost impossible to archive a social site in any comprehensive way, unless you effectively just keep the software stack running. Researchers often ask libraries to do just this, but there is a great deal of literature explaining why that is an impossibly complex and expensive proposition for any library, or really for any organization (Bote et al.; Johnston; Corrado; Goddard and Seeman; Chassanoff and Altman; Weber et al.). Even multi-national technology companies like Google and Microsoft do not keep all of their platforms running forever. The largest library-led social media archiving project ever undertaken was the Library of Congress effort to archive all public tweets made on Twitter. Even one of the world’s largest and most well-funded libraries has still not been able to make this archive public due to concerns about legal liability, and due to the cost of storage and computational resources that would be needed to provide useful access to this collection.
In his article “The Age of Algorithms,” Clifford Lynch argues that, when it comes to dynamic social sites, “the real stewardship goal[…] must be to capture some meaningful sense of the system’s behavior and results for the present and the future. Comprehensiveness is quixotic.” This might entail video walkthroughs of the platform, screenshots, and other documentation about the design, purpose, and features of the site. This brings us back to the main argument of this paper, which is that libraries can realistically preserve only static representations of digital projects, but cannot adopt the whole project software stack, particularly when that project has interactive features that encourage users to submit their own content.
There is no archiving option that is without cost to libraries. Putting objects in a repository requires initial migration and metadata efforts, and over time requires platform migrations, fidelity checks, format migrations, storage, and backup. Web archiving services like Archive-It charge according to the size of the collection, so costs go up as new sites are added. It is likely that WARC files will also have to be repackaged and migrated to new platforms over time. Long-term preservation strategies including backups, the creation of standards-compliant storage containers, and off-site redundancy all have costs related to software, computation, network capacity, and storage. Maintaining these services and infrastructure requires the time of programmers, sysadmins, librarians, archivists, and other costly experts.
As University of Victoria librarians have explored long-term archiving options with the DH researchers and programmers on the Endings Project, it has become very clear that DH projects with the best chance of surviving over time are those that have been designed, from the beginning, to end gracefully. In order to articulate what that looks like, we have developed the Endings Principles for Digital Longevity (Endings Project). These principles express the desired end-state for DH projects that can be archived in their entirety without losing functionality or data. The Endings Principles underscore five key areas of work: data, documentation, processing, products, and release management.
Data must be stored in open formats to ensure ongoing accessibility. The integrity and accuracy of data must be verified over the course of the project to ensure that the final, archived product is of the highest possible quality. This is particularly important because once a project is archived, the site will sit on a library server with no graphical editing interface, and no easy provision for the researcher to have access to fix errors or replace objects.
Documentation should be complete and comprehensive. The archived project will, ideally, outlive the project team. Scholars of the future will require documentation to understand the scope of the project including the ways in which data was collected and described, and decisions that were made about what to include and how to present it.
Processing code requires “relentless validation” so that the static files produced will conform strictly to standards for HTML/CSS/XML tags and data structures (“Endings Principles”). While modern browsers are fairly tolerant of code errors, software clients of the future may be less forgiving. Even a small error can prevent a site from loading, rendering the whole project inaccessible. It is not easy, and may not be possible, for library programmers to track down and fix coding errors across potentially hundreds or thousands of archived sites using nonstandard syntax. Martin Holmes and Joey Takeda, DH programmers from the Endings Project, have built diagnostic tools for checking and validating TEI and HTML output (see the Endings Project GitHub). Diagnostics should be run against code to ensure conformance to standards and to check for security holes. They should be run against data to ensure validity and conformance to standards and field definitions.
Release Management is an approach wherein changes are packaged up into periodic releases, so that all changes to data and code are tested and validated before they are released to the production site. Clear versioning and citation information should be available on every page of the site.
One significant challenge in creating a static site is that many DH projects require end-user search in order to allow scholars to find relevant project resources. As mentioned above, Holmes and Takeda have created a codebase to address this, which supports a pure JSON search engine requiring no backend for any XHTML5 document collection (Endings Project GitHub). There are also existing tools that can turn a relational database into a static web document. DeepArc, developed by the Bibliothèque nationale de France, maps the content of a database to an XML schema and maps the content to an XML document.
Creating a DH project that is fully archivable requires technical and design decisions that work towards this goal from the very outset of a project. Researchers who would like to rely on their college or universities libraries as an archiving partner must talk to librarians during the planning phases of their project to see what is possible at any given institution. The University of Victoria Libraries have developed a set of Library Services for Grant-Funded Research Projects (fondly called the “Grants Menu”) to encourage faculty members to talk to their libraries as they develop grant applications. The Grants Menu lays out the in-kind value of a variety of library preservation services, so these can be included as institutional contributions in grant proposals. These early conversations allow researchers to make informed choices about how their technology decisions will impact the long-term sustainability of project outputs.
Not all libraries have mature service offerings around digital preservation for research projects, although academic libraries increasingly offer at least repository and web archiving services. Emulation is rarely offered, and very few libraries will consider adopting an entire software stack, particularly one that permits end-user content creation. The Endings Project lays out a set of principles that will satisfy the researcher’s desire to preserve both research objects and the contextual project website, while respecting the resource constraints of libraries. Ultimately, the Endings Project and this paper argue that DH projects should be designed, from the beginning, to eventually render static outputs that can be archived in a scalable, sustainable way that limits technical overhead, maintenance time, and cost.
Bote, Juanjo, et al. “The Cost of Digital Preservation: A Methodological Analysis.” Procedia Technology, vol. 5, 2006, pp. 103–111, doi.org/10.1016/j.protcy.2012.09.012.
Chassanoff, Alexandra, and Micah Altman. “Curation as ‘Interoperability with the Future’: Preserving Scholarly Research Software in Academic Libraries.” Journal of the Association for Information Science and Technology, vol. 71, 2021, pp. 325–337, doi.org/10.1002/asi.24244.
Corrado, Edward M. “Software Preservation: An Introduction to Issues and Challenges.” Technical Services Quarterly, vol. 36, no. 2, Apr. 2019, pp. 177–189, doi.org/10.1080/07317131.2019.1584983.
Davis, Corey. “Archiving the Web: A Case Study from the University of Victoria.” Code4Lib Journal, vol. 26, 2014, journal.code4lib.org/articles/10015.
DOI Foundation. “DOI System and the Handle System.” The Identifier Resources: Factsheets, 2017, www.doi.org/factsheets/DOIHandle.html.
The Endings Project. “Endings Principles for Digital Longevity.” Version 2.2.1, Mar. 2023, endings.uvic.ca/principles.html.
The Endings Project. “Endings Project Survey Results.” Version 2.2.1, Mar. 2023, endings.uvic.ca/principles.html.
Endings Project GitHub. “projectEndings.” GitHub, 2023, github.com/projectEndings/.
Germain, Raven. “Known Web Archiving Challenges.” Archive-It User Guide, 28 Oct. 2021, support.archive-it.org/hc/en-us/articles/209637043-Known-Web-Archiving-Challenges.
Goddard, Lisa. “Developing the Read/Write Library.” Scholarly and Research Communication, vol. 7, no. 2/3, 2016, doi.org/10.22230/src.2016v7n2/3a255.
Goddard, Lisa, and Dean Seeman. “Negotiating Sustainability: Building Digital Humanities Projects that Last.” Doing More Digital Humanities, Routledge, 2019, pp. 38–57.
Hagenmaier, Wendy, et al. “Software Preservation Services in Cultural Heritage Organizations: Mapping the Landscape.” iPRES 2019: 16th International Conference on Digital Preservation, Amsterdam, 16–20 Sept. 2019, pp. 417–419, ipres2019.org/static/proceedings/iPRES2019.pdf.
Holmes, Martin, et al. Project Resiliency. Special issue of Digital Humanities Quarterly, vol. 17, no. 1, 2023, www.digitalhumanities.org/dhq/vol/17/1/index.html.
Johnston, Leslie. “Digital Humanities and Digital Preservation.” The Signal, Apr. 2013, blogs.loc.gov/thesignal/2013/04/digital-humanities-and-digital-preservation/.
Library of Congress. “Update on the Twitter Archive at the Library of Congress.” Dec. 2017, blogs.loc.gov/loc/files/2017/12/2017dec_twitter_white-paper.pdf.
Lynch, Clifford. “Stewardship in the Age of Algorithms.” First Monday, vol. 22, no. 12, Dec. 2017, dx.doi.org/10.5210/fm.v22i112.8097.
Rosenthal, David S. H. Emulation & Virtualization as Preservation Strategies. LOCKSS Program, Stanford University Libraries, 2015. UNT Digital Library, digital.library.unt.edu/ark:/67531/metadc799755/.
Wall, John, et al. “Institutional Support for DH Websites.” Humanist Discussion Group, Sept. 2021, www.dhhumanist.org/volume/35/245/.
Weber, Chela Scott, et al. “Total Cost of Stewardship: Responsible Collection Building in Archives and Special Collections.” OCLC Research, Mar. 2021, doi.org/10.25333/zbh0-a044.