Skip to main content

From Archive to Interaction: Two Case-Studies in Exhibiting Digital Collections

Published onMar 19, 2024
From Archive to Interaction: Two Case-Studies in Exhibiting Digital Collections

You're viewing an older Release (#1) of this Pub.

  • This Release (#1) was created on Mar 19, 2024 ()
  • The latest Release (#2) was created on May 03, 2024 ().

Far too often, argues Ryan Cordell, “the computer” has been “treated as a window to the physical archive rather than as an integrated remediation of the archive.” He implores scholars to “reckon with mass digitized historical texts as new and discrete bibliographic objects” (190). But while curated archives mediate the histories they represent, they nevertheless play a necessary role in connecting end users—be they researchers, librarians, or the public—with primary materials (Blouin 102–103). Such acts of mediation have become all the more fraught in the context of the digital humanities, as archivists and scholars use archival holdings not only to access materials, but also to prepare and analyze them for exhibition.

Cordell’s call to action, for us to “take the digitized text seriously within its own medium” (217) foregrounds how due excitement over material made available through mass digitization must be tempered by our acknowledging practical limitations of exhibiting material from digital collections. These limits are apparent not only in the application of computer-mediated analyses on questions of traditionally humanist inquiry, as Nan Z. Da argues, but also in the early stages of corpus creation.1 Nowhere is the potential for reduction more relevant than in the context of historical documents, for which curated research outputs such as exhibitions remain, for many end users, their only form of interaction with archival materials.

Optical Character Recognition (OCR), the computer-assisted method of deriving text from image files, is a critical step in the many levels of mediation between a primary source and its appearance as digital object. OCR creates a new layer of machine-readable text, a format of structured data that can be read by a computer, which lies atop the primary source text contained within image files. In the context of corpus creation and later, exhibition, researchers add additional layers of mediation when extracting and transforming data from the digital object. It is these layers, and specifically how the limitations posed by OCR outputs impact corpus collection, with which we are primarily concerned.

This study seeks to outline the hurdles, benefits, and impacts of archival analysis at scale by comparing two case studies, each with a different approach to corpus creation and exhibition. The first project, Food Riddles and Riddling Ways (the Riddle Project),2 follows a top-down approach using search strings of relevant keywords to aggregate data from existing primary source databases. The second project, Ciphers of “The Times,”3 uses a bottom-up approach that focuses exclusively on one digital collection to create a machine-readable corpus for syntax-level computational analysis. While the two approaches create datasets from similar source material, they introduce mediation from opposite directions—the top-down approach by narrowing an existing dataset and the bottom-up approach by constructing a corpus through acts of transcription. We identify the information-seeking behaviours directing each method and how they negotiate the uncertainties of compiling imperfect OCR data from historical collections. In both cases, we understand OCR not as a passive interlocutor but rather as an invisible curator in its own right, revealing and obscuring data with substantial impact on curated outputs.

This layer of unseen (or yet to be fully recognized) mediation at the technological level will only exacerbate existing archival silences that, as Rodney G. S. Carter writes, are introduced by the “groups [who] create the records that will eventually enter the archives” and who “use their power to define the shape an archive takes” (217). Whereas archival bias is increasingly understood to be visible in the absence of marginalized voices in a collection, digital silences that manifest as null search results or lost data points are deceptively inconspicuous. OCR errors and omissions are silences introduced by human actors: they cannot be clearly attributed to any one group but are, as Cordell writes, “remediated by mass digitization, a phrase that shorthands elaborate systems of scholarship, preservation, bureaucracy, human labour, machine processes, and economics” (188). Without clear attribution, the challenge of how to identify OCR-induced silences and, moreover, how to address them at both the level of collections and their curated outputs becomes all the more difficult.

Originating from the context of library outreach, both projects aim to create meaningful interactions between end users and primary materials through exhibitions. Despite the difficulties of working with digitally-mediated collections, we maintain that digital curation allows users to engage with primary sources at a scale and with a broader perspective than possible for those scrutinizing a single page or object. While both projects draw from similar primary materials—historical manuscript and printed documents—the secondary interactions they mediate are quite different. As such, benefits of data collection, analysis, and exhibition are not equally distributed. They depend as much on the initial quality of OCR as they do on the methodology used to extract textual data, and ultimately, on the ways in which that data is presented to the public, all of which we make apparent in the two case studies that follow.

“Dirty” OCR and the Question of “Imperfect” Data

Typically built for simplicity and ease of access, digital databases' many layers are hidden below the surface, invisible to end users who interact with only the most topical levels of the browsing interface. Those hidden layers are unique to every database and include metadata description, indexing standards, and the OCR software used to extract textual data from image files, not to mention which corrective measures, if any, are employed. OCR information and accuracy ratings are still infrequently shared with end users. Neither are the specific parameters of the software and processes used to generate machine-readable text consistently publicized. The question then remains of whose responsibility it is—researchers or the institutions and publishers who offer digital collections—to resolve or make apparent this issue to end users?

Depending on variables unique to every database and, indeed, to every individual document, creating accurate OCR-rendered text can be one of the most time-intensive and challenging steps in producing a dataset from a digital archive.4 Some projects use paid labour to correct OCR-generated transcriptions. Others, like the “What’s on the Menu?” project at the New York Public Library, use a hybrid model—relying on OCR to generate bounding boxes but “discarding the OCR text in favor of text supplied by human volunteers” (Rawson and Muñoz 286). In the NYPL model, OCR still determines what text from the original image is to be transcribed, mediating what content appears in the resulting transcriptions. This returns us to Cordell’s observation that OCR constitutes not merely an auxiliary feature of the primary archival object, but rather a constitutes “a new edition of that text” (196).

Many historical newspapers and periodicals and their OCR outputs are now available thanks to ongoing mass digitization projects.5 Despite such enhanced searchability of historical titles, the quality of OCR ranges widely, such that any machine-readable corpus derived from digitized materials is only as “complete” as the OCR outputs from which it is sourced. Media theorists argue against such language of completion—imploring researchers to understand and contend with error-prone digital collections as new editions—in the bibliographic sense—of their material sources (see Cordell; See Rawson and Muñoz).

Such theoretical considerations, however important, need be tempered with the practical aims of using digitized historical materials for public-facing research.6 In order to create exhibitions, we need source material, and the material of digital exhibitions is data. This data need not be perfect, but must be good enough to be collected, processed, and transformed for exhibition, typically via interactive features such as dashboards or visualizations. The notion that scholars must work with “clean” data has been critiqued by Rawson and Muñoz, based on their experience with the NYPL menu crowdsourcing project, who argue that cleaning insists “a normative order by wiping away what is different… suppos[ing] things already have a rightful place, but they are not in it” (280). Again, however important it is for researchers to take such matters into consideration, the realities of exhibiting data sourced from digital collections necessitates certain pre-processing steps exemplified by our two case studies. We do not agree that accounting for incorrect OCR transcripts and maintaining the uniqueness of the digitized object are mutually exclusive goals. Silences created by erroneous OCR will persist whether we address them or not. And so, rather than refusing to process textual data—which would in effect leave us unable to analyze or exhibit it—we proceed on the belief that fostering some engagement, however limited, is better than no engagement at all. That these steps are necessarily reductive remains a reality of exhibiting digitized materials within the limitations of current digital infrastructures. Let us turn now to discussion of creative approaches to data management and visualization leveraged in our two case studies.

The Riddle Project—A Top-Down Approach

In 2018, the McGill Library acquired over 1,300 nineteenth-century culinary and medical receipt manuscripts from a mansion in England’s Doncaster region. It contained a curiosity: a handwritten diagram depicting a plan for a meal table setting (Figure 1), in which each food dish was described in riddle form.7 It was with this Doncaster Ur-text that the Riddle Project began. To place this curiosity, called an Enigmatic Bill of Fare (EBoF), in context, we sought to curate a dataset of riddling menus and newspaper and periodical content advertising or reporting on these events that could answer such questions as: How were these riddles exchanged and how did they influence each other (See Drezner)? Where and how could we find others? Was this a local phenomenon? How did it fit into larger practice of riddling? How could we exhibit this material and any emerging trends in a meaningful way?

Figure 1. Eliza Smithson’s 1805 notebook contained riddling place settings, above left, as well as some solutions, above right (33).

While the Doncaster EBoF was our starting point, investigation proved that EBoFs were but one of many culinary riddling practices. Initial instances of riddling menus and events were most often found in manuscript form, requiring meticulous manual transcription by student research assistants. In 2019-20, the McGill Library ran a pilot project using Handwritten Text Recognition powered by Quartex (an Adam Matthew Digital tool) on the Doncaster Recipes Collection. Powered by artificial intelligence, the software was in its beta phase and manual transcriptions were required to train the AI as well as generate word and character accuracy statistics. The team therefore invested hundreds of hours in manual transcription—adding a corrective layer to any source material that was derived from OCR and generating new transcripts as necessary. With these reliable transcripts for the original manuscript, new and sometimes unexpected keywords emerged that enabled discovery of other riddling events across time and geographic location.

The Riddle Project used a top-down approach, aggregating individual instances of keyword search matches from across different databases. This method pulled together documents from various existing historical sources and collections. Serendipitous discovery played an important role in uncovering related and relevant keywords,8 and the happy accident of discovery was crucial to uncovering “hidden connections,” to use Allen Foster and Nigel Ford’s term for creative research connections (322). Specifically, early serendipitous discoveries of EBoFs in primary sources—notebooks, menus, and historical publications—led the team to use iterative keyword searches in other historical databases including Gale, EBSCO, and JSTOR to uncover other instances of culinary riddling practices as advertised in journals and newspapers. Chaining these searches, moving from one newly discovered keyword search to the next, substantially increased the size of the data set, particularly when the chain led from EBoFs in the UK to the “Conundrum Social” in North America. The Conundrum Socials were events built around a riddling menu, a phenomenon that became popular in the nineteenth century.

Our data set remained limited by geographical logistics for on-site visits, and also by what was digitized and the quality of OCR outputs. As new databases of interest emerged—new “berry patches,” to use Marcia Bates’ metaphor for evolving searches9—so too did new data clusters scattered through digitized historical newspaper databases across the UK, Canada, and the United States. As the dataset grew, so did the human hours required to clean, transcribe, and process the data. At its largest, the team included five research assistants working exclusively on processing instances of riddling events, from the initial instance in 1733 in the United Kingdom through events in the 1970s in Ontario, Canada. This top-down approach combined results from machine-readable editions of the newspapers riddling events and manuscript EBoFs. The Riddle project eventually curated a corpus of 1,406 instances of riddling events, as advertised in newspapers accessed via twelve different repositories (Figure 2).

Figure 2: This interactive map shows all the identified riddling events, and the temporal filters and facets through which users can interact with the data. Hosted on GitHub, this content, created by the Riddle Project team, was displayed on a Touch Table accompanying the physical exhibition at the McGill Library.

Curating these findings for exhibition prompted us to present data visually; mapping the changing locations of these events therefore emerged as a priority. Geospatial information was needed for every riddling event, and this was entered manually into the database. The newspapers in which instances of riddling events were sourced offered a definite publication location, even if location of the specific conundrum was unavailable. In the exhibition phase and beyond, relevant GIS data for each entry powered an interactive map hosted on GitHub, alongside timeline visualizations10 that allowed end users to modulate the underlying data, filtering by year or source database. The final texts with which users interact are the map, the accompanying curated “storymaps” offering specific narrative trends emerging from the data, and the interactive timelines.11 To use the language of Sadia Khan and Ibrar Bhatt, in curating the Riddle Project data set we have used a top-down method to “harness[...] pre-existing content, transforming it through the application of criteria that assess and promotes belief” (1). We then published these new “packets of filtered information” for a broader audience through the exhibition and interactive elements (Khan and Bhatt 1). Although many levels removed from the original Doncaster Ur-Text, these curated online offerings connect end users with riddling practices in a multi-layered and information-rich context.

Ciphers of The Times—A Bottom-up Approach

The Times of London began publication in 1785 as one of the first British newspapers in public circulation. In 2002, The Times became available to readers and researchers through The Times Digital Archive, a Gale-Cengage resource that today offers search capabilities and full-text facsimiles of more than 1.4 million newspaper issues, or some 12 million pages. This critical mass of data, we initially presumed, presented an unparalleled opportunity for public exhibition.

The existence of a digitized archive did not, in this case, imply that the collected materials were immediately useful for computational analysis and exhibition. To read The Times at scale, even within a limited time range, requires multiple steps of data extraction and processing. We therefore narrowed our search focus to the “Agony Column.” In the nineteenth century, the front page of The Times was made up of six columns of whichthe agonies” occupied the second, couched between commercial advertisements, shipping news, and birth and death notices. As one contemporary commentator described it in 1843, the Agony Column, then still in its infancy, was “devoted to advertisements of the pathetic, appealing, interesting, remonstrating, despairing, or denouncing order” (Hayward 35). Matthew Rubery writes that personal ads akin to those found in the Agony Column were typically anonymous, addressed to someone familiar to the author, short and frequently “deliberately cryptic” (54). In the heyday of the Agony Column, from 1850 to 1870, correspondents would often place messages in the form of riddles, ciphers, and other kinds of encoded language. One example appears below, giving both the flavour of such messages and a glimpse of the poor quality of OCR (Figure 3).

Figure 3. An example agony advertisement from February 1853, addresses “M. L. L.” and is sent by a despondent mother begging her runaway daughter to return home (TO M. L. L., 1)

The agonies are a paradoxical site of publicly-accessible yet cryptic correspondence. Critics note that the closest contemporary analog to this kind of public attention to private life was the Victorian novel. Agony advertisements, says Rubery, likewise “appeared to conceal hidden meanings” accessible only through the sustained attention of an astute readership (55). As another nineteenth century commentator put it, agony ads became, in a sense, a “novel in embryo” (“Our Novels” 417). Despite the discursive nature of the Agony Column, which eludes clear description, no one has yet used digital tools to explore how the literary character of the Agony Column impacted or drew influence from the dominant form of narrative media at the time: the novel.

Our interest in the interrelationship between these two emerging textual forms expanded into a more general need to compare and distinguish word- and sentence-level syntactic features of the Agony Column to contemporary novels. The project took a more definitive shape upon our realization that with a critical mass of agony ads at our disposal we, at least in theory, had access to a requisite volume of data to describe the similarities and differences between novels and the agony advertisements, their literary counterparts in newspapers. To test this relationship—the extent to which novels and newspaper agony ads share similar thematic and stylistic features—we used a classification model to characterize descriptive and syntactic elements that distinguished nineteenth century novels from newspaper ads.12 While the specifics of this model and its methodology are beyond the scope of discussion here, it suffices to say that running this kind of analysis requires a corpus of newspaper text comparable in size and interoperability to the large corpus of Victorian novels amassed for the same purpose.

We define building a corpus from the bottom up as a process by which researchers aim to cull a limited quantity of data from a critical mass of digitized archival material. Creating a corpus from the bottom up gives researchers control of selecting material of interest, choosing an appropriate extraction method, and then filtering and validating resulting data. We can map out the information-seeking process of this bottom-up approach using the model of Alistair Sutcliffe and Mark Ennis, which breaks the process into four activities: problem identification, articulation of information need, query formulation, and results evaluation (322). Our “problem,” to use the language of Sutcliffe and Ennis, arose out of our access to the archive: the text of the Agony Column was available to us, but not in a useful format for computational analysis.

From this point, our need was for a clean corpus to query. The affordances of bespoke corpus creation can, from the perspective of ensuring machine-readability, pose its own set of challenges, as was our experience with extracting the text of The Times from its digital record.

To overcome OCR-related issues, we chose to take the labour-intensive step of transcribing a limited portion of The Times. Gale does not make statistics on the overall OCR accuracy of The Times Digital Archive publicly available. However, their Digital Scholar Lab permits sorting pages by their OCR score, which provides insight on the general quality of OCR transcripts. Between 1850 and 1880, the height of the Agony Column’s popularity, less than fifty front pages of The Times have an OCR score of 98% or greater, the minimum benchmark for a machine readability with the most machine learning models.13 Compare this with the nearly thirty-five thousand pages that have an OCR score of 80% or less, the majority being largely unusable to us in their existent state. Understanding limitations posed by poor OCR outputs, Gale made available to us the entirety of The Times Digital Archive in the form of high-quality image files. We experimented with a number of image-to-text platforms, ultimately using Transkribus, a proprietary AI-powered text recognition software, because it allows users to create and train OCR-models using their own data. Over several months, we used manually-typed transcripts of sample Agony Columns from the 1860s to train and validate a number of transcription models. We then tested each on a slightly expanded image set of one hundred Agony Columns. Our final model, although 97% accurate in matching printed letters to machine text in our training material, preformed much more variably on the first 3000 pages of Agony Columns transcribed at scale. In response, we expanded our testing dataset by adding another decade’s worth of printed material, approximately doubling the input to 6000 pages. Even then, the inconsistency of our results, and our enduring need for uncorrupted text, meant further filtering steps were necessary at the evaluation stage, progressively whittling-down the size of our useable corpus (Figure 4).

Figure 4. (A) “THE PERSON who ADVERTISED for the NEXT of KIN”. A sample of the Agony Column published on April 15, 1861 from The Times Digital Archive. (B) The “dirty” OCR-text provided by Gale via the Digital Scholar Lab site. (C) A sample pane from the Transkribus software showing the corrected baselines on the same Agony Column prior to OCR. (D) The improved OCR output following model training in Transkribus.

Top down or bottom up?

There are a number of limitations to the top-down approach used in the Riddle Project. Some are readily apparent even from a theoretical standpoint. When researchers build a corpus primarily from user-queried data they risk narrowing the scope of their results to the kinds of terms for which they are already searching. And while such a method may be adequate to identify the prevalence of key terms, it risks overlooking relevant information hiding just beyond the capabilities of even the most advanced search tools. In the context of OCR, the limitations of the top-down approach become even more evident. While it is possible to rectify some OCR mistakes using wildcards searches to account for common errors or variations in spelling, the question of poor OCR accuracy leads us to acknowledge that minor corrections may not compensate for most OCR-related search issues

Even more troubling is the fact that many digital databases and collections permitting keyword searching do not readily disclose OCR statistics, suggesting that these issues are not immediately apparent to the many researchers who use them.14 Currently, it rests with the researcher to search out OCR accuracy ratings and processes, and to decide how to proceed with the necessarily incomplete data available. Surely more can be done by both database managers and researchers to increase the visibility of OCR issues, and its resultant curatorial role. And while page-level accuracy scores require accurate transcriptions for comparison, surely digitization and transcription initiatives could make further effort to provide information on what is excluded by the process of OCR. While many scholars with a background in the digital humanities are innately attuned to such issues, this is not the case for all end users. For example, Gale’s Digital Scholar Lab is increasingly geared toward facilitating student research. The onus to address OCR-related concerns rests not solely with the institutions and publishers in charge of digital collections. Systematically surveying the state of our digital archives is an effective first step toward improving their use and a goal to which researchers can productively contribute. Such meta-analyses of research methodology are common in the natural and physical sciences and serve as a necessary vantage point to better collective use of emerging technologies. Researchers will increasingly take the lead on pushing for transparency around the tools used by publishers in the creation of OCR statistics, as stakeholders actively using digital collections. Although transparency does not immediately fix the issues posed by OCR it would allow researchers to make informed decisions about which collections they use.

In contrast, the bottom-up approach becomes especially valuable in cases where the administrators of a digital collection do not make OCR statistics available. Working from the bottom up, the Ciphers Project team was obliged to examine all aspects of the archival image file including its size, the number of pixels per inch, and differences in how individual newspaper issues were scanned to microfilm. While it was immediately obvious that the OCR outputs available via The Times Digital Archive site were inconsistent (see Figure 4.b.), it was only when we scrutinized the digital object that the underlying issues became apparent, as well as potential solutions for improving OCR.

A final point of comparison between these two approaches involves differences in the kind of data they furnish and their potential toward curatorial outreach or research output. While the top-down approach uses strategies nearly ubiquitous in database-centric research, the variety of source materials—which ranged from handwritten historical ephemera to digital surrogates—made it is close to impossible to work with next generation DH methodologies on the resulting dataset. It was too labour-intensive to create a data set of the required size. Simply put, the range of analyses (and thus the range of potential avenues for exhibition) available to researchers with a corpus the size of that made available by the bottom-up approach is many times greater than that sourced using the top-down method. The bottom-up approach can furnish detailed transcripts at a scale unavailable through traditional search-chaining strategies.15 The Ciphers project, for example, is currently using complete sentences taken from The Times to compare syntax-level features of newspapers to novels (for which reliable data already exists) from the same period.

Nevertheless, the top-down method is a valuable methodology for use in curatorial outreach projects when the aim is not comprehensiveness but rather to highlight novel or interesting aspects of historical collections and their times. So long as curators make end users aware of the many layers of mediation present, curated exhibits are important tools for connecting the public with historical records. The Riddle Project attempted to strike this balance by exhibiting primary sources alongside the interactive, data-based visualisations.16

The choice of whether to use a top-down or bottom-up approach to collect text data from digitized historical archives is one of balancing trade-offs with available resources and assessing their affordances to exhibition with the extant limitations of machine-readability. Access to proprietary collections, even when their contents are sub-standard, presents another, non-trivial issue. Without institutional subscriptions or financial support, many of the collections used in both projects remain cloistered behind paywalls and thus cannot be searched nor a preferable method of data collection determined. While each method includes labour-intensive steps at different points in the process of corpus creation, both face the challenge of mitigating data loss due to poor OCR scores. And while the bottom-up approach can be time and resource intensive for first-time users of OCR software, it is currently one of a limited number of ways for researchers to address data accuracy issues at their source.

Creative Exhibiting

The top-down and bottom-up approaches each attempt to balance the limitations of digital collections with the intention of exhibiting them to both scholarly and general audiences. Both examples show that the most effective modes of exhibition are not usually the most convenient. While it would be relatively simple to fill a display case with paraphernalia relating to historical newspapers or riddling menus, the growing availability of digital collections and the many potential modes of analyzing them compel us to revisit how we approach exhibitions, and our expectations of what end-users will take away from them.

Not every end user will be a scholar of digital media, aware of the complex layers of mediation imposed by digitization. With this in mind, the curated outputs of both projects intended to make their mediated nature a feature of the exhibition, readily apparent to end users, rather than bury these limitations in technical outputs. The data visualization component to the Ciphers project, for instance, includes a data table, where users can search the corpus for individual words and phrases. Outstanding transcription errors, and even those that contribute and potentially distort analysis are available to view. These transcripts are displayed next to graphic visualizations, enabling attentive end-users to begin to grasp the limitations of our analysis. The digital Riddles exhibit, in addition to offering visualizations of riddling events over space and time, connects individual events (as data points) back to the source material from which that event was recognized—establishing a link between the primary object and its digital surrogate. While it would be nearly impossible for end users to view the magnitude of primary material that goes into the creation of a single datapoint, connecting them to exemplars reveal the otherwise intangible materiality of the objects behind the screen. In both cases, digitization is repurposed in service of fostering new kinds of connections through the affordances of digital outputs.

Any exposition as to the nuances of digital research must be balanced against an exhibition’s primary aim: connecting and engaging end users with library collections. Embedding methodological considerations into research outputs, while interesting to scholars, can quickly become dry for lay audiences. Even when it is possible to do so, what does taking these extra steps achieve? Does it dispel a layer of mediation that otherwise stands in the way of understanding? Do users even care about these layers in the first place?

Scale is exciting. It draws an audience. Part of the wonder on which both the Riddles and Ciphers projects capitalize comes from our ability to say we analyzed hundreds or thousands of historical documents and that by attending our exhibit you, viewer, will have read them too. But it is our view that we as curators owe more to our audience as well as to our objects. In addition to inspiring reverence or wonder, exhibits of digital collections must also foster discussion, and even problematize their own methodologies and findings. If not, we risk ignoring layers of mediation or exacerbating their misapprehension.

To this end, there remains much room for creativity and innovation in how we as researchers and curators communicate the nuances of method to end users. As the example of the Twine game (one component of the Ciphers exhibition) shows, digital research findings need not always be translated at scale. Often, offering a mode of engagement that acts to summarize larger, more discursive outputs is sufficient. Nor are these engagements limited to the final, exhibition stage of public-facing research. End users can be involved in the preliminary stages of data collection, something the New York Public Library menu crowdsourcing project aimed to do by soliciting transcription help, or at the analysis step, as txtlab’s Citizen Readers project encourages.17 The former approach puts users into direct contact with digital collections, bridging one of the many layers of mediation between primary source and curated output. Ironically both approaches reduce scale to a singular, interactive object, without sacrificing its integrity or nuance. But the curators themselves both exhibit and reap the rewards of visualizing user engagement at scale.


Opining on “fraught” practice of describing archives, Wendy M. Duff and Verne Harris write that acknowledgement is but the first step toward reconciling archival silences, that speaking out on erasure “breaches a circle of knowledge, allows in, invites in, fresh and disturbing energies” (278). It is thus imperative, they conclude, that “we document and make visible these biases” if only to lay bare “the values of the archivists who create them” (277–278). This logic persists in the digital context, where practices once assumed by archivists alone are now the work of people and digital tools in tandem. Unseen acts of mediation, wherever introduced, are silent for only as long as we fail to speak of them.

Such acknowledgement compels us to consider how current digitization procedures shape both the way we interact with digital collections and the research produced from those interactions. Equally, it forces us be creative in the ways we exhibit and attempt reconcile these layers of mediation in digital collections. Recast in this light, curators may see the hurdles introduced by OCR not as roadblocks to exhibition but as opportunities for imaginative recourse. Whether or not we choose to correct OCR or manipulate the databases we compile, curatorial end products must in one way or another reflect those decisions, momentous if only for the fact they can silently reinscribe machine-introduced bias. We are of course cognisant of the many practical limitations—temporal, monetary, or otherwise—that interfere with such work. We do, however, owe it to our expanding digital archives and the growing audiences they inspire to try.

Works Cited

Anderson, Ian G. “Are You Being Served? Historians and the Search for Primary Sources.” Archivaria, vol. 58, Fall 2004, pp. 81–129.

Bates, Marcia J. “The Design of Browsing and Berrypicking Techniques for the Online Search Interface.” Online Review, vol. 13, no. 5, 1989, pp. 407–424.

Blouin, Francis X. “Archivists, Mediation, and Constructs of Social Memory.” Archival Issues, vol. 24, no. 2, 1999, pp. 101–112.

Carter, Rodney G. S. “Of Things Said and Unsaid: Power, Archival Silences, and Power in Silence.” Archivaria, no. 61, 2006, pp. 215–234.

Cordell, Ryan. “‘Q i-Jtb the Raven’: Taking Dirty OCR Seriously.” Book History, vol. 20, no. 1, 1, 2017, pp. 188–225. Project MUSE,

Da, Nan Z. “The Digital Humanities Debacle.” The Chronicle, 27 Mar. 2019,

Dash. 2015. Plotly, 27 Mar. 2023. GitHub,

Doncaster Recipes Collection—Archival Collections Catalogue. 27 Sep. 2022,

Drezner, Nathan. “How Do Riddles Move from EBoF to EBoF?” The Riddle Project, 24 Nov. 2019,

Duff, Wendy M., and Verne Harris. “Stories and Names: Archival Description as Narrating Records and Constructing Meanings.” Archival Science, vol. 2, no. 3, Sep. 2002, pp. 263–85,

Foster, Allen, and Nigel Ford. “Serendipity and Information Seeking: An Empirical Study.” Journal of Documentation 59, no. 3 (2003): 321–340.

Fyfe, Paul. “An Archaeology of Victorian Newspapers.” Victorian Periodicals Review, vol. 49, no. 4, 2016, pp. 546–577.

Hayward, Abraham (unsigned). “The Advertising System.” Edinburgh Review, vol. 77, 1843, p. 35.

Holley, Rose. “How Good Can It Get?: Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs.” D-Lib, vol. 15, no. 3/4, 2009,

Khan, Sadia, and Ibrar Bhatt. “Curation.” The International Encyclopedia of Media Literacy, John Wiley & Sons, Ltd, 2018, pp. 1–9. Wiley Online Library,

Martin, Kim, and Anabel Quan-Haase. “‘A Process of Controlled Serendipity’: An Exploratory Study of Historians’ and Digital Historians’ Experiences of Serendipity in Digital Environments.” Proceedings of the Association for Information Science and Technology, vol. 54, no. 1, 1, 2017, pp. 289–297. Wiley Online Library,

McGill University Library, et al. “The Riddle Project.” The Riddle Project, Accessed 25 Oct. 2022.

“Our Novels.” Temple Bar: A London Magazine for Town and Country Readers, vol. 29, July 1870, pp. 410–424.

Rawson, Katie, and Trevor Muñoz. “Against Cleaning.” Debates in the Digital Humanities 2019, edited by Matthew K. Gold and Lauren F. Klein, University of Minnesota Press, 2019, pp. 279–292. JSTOR,

Riddle Project, The. Riddling Events, 1733-1971. McGill University Library. Accessed 15 May 2023,

Rubery, Matthew. The Novelty of Newspapers: Victorian Fiction After the Invention of the News, Oxford University Press, 2009.

Smithson, Eliza. Smithson Riddle Book. 1805, McGill University Rare Books and Special Collections, MSG-1230-1-1,

So, Richard Jean, and Edwin Roland. “Race and Distant Reading.” PMLA, vol. 135, no. 1, 2020, pp. 59–73,

Sutcliffe, Alistair, and Mark Ennis. “Towards a Cognitive Theory of Information Retrieval.” Interacting with Computers, vol. 10, no. 3, June 1998, pp. 321–351. Silverchair,

Tanner, Simon, et al. “Measuring Mass Text Digitization Quality and Usefulness Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive.” D-Lib Magazine, vol. 15, no. 7/8, 2009,

Citizen Readers, 21 September 2023,

“THE PERSON who ADVERTISED for the NEXT of KIN.” Times, 15 Apr. 1861, p. 1. The Times Digital Archive. Accessed 18 May 2023.

“TO M. L. L.” Times, February 1853, p. 1. The Times Digital Archive. Accessed 18 May 2023.

No comments here
Why not start the discussion?