Order of Multitudes

From Victorian Novels to Data Visualization: A Conversation with Catherine DeRose

Catherine DeRose is the Program Manager for Yale University’s Digital Humanities Lab, where she teaches workshops on data visualization and analysis, and works with Yale faculty and students to help them use digital tools to expand their research. She is also a Lecturer in Yale’s Department of Statistics and Data Science, and has published widely on pedagogy and digital humanities. Dr. DeRose earned her Ph.D. in English from the University of Wisconsin, Madison, where she focused on issues of text preservation in the work of Victorian authors. In this interview she discusses the history of information overload, developing digital humanities research questions, and how a nineteenth-century racehorse foiled a digital analysis tool. 

Sarah Pickman: How did you first become interested in using digital tools as part of humanities research? Can you speak briefly about what tools or techniques you used in your own dissertation work?

Catherine DeRose: My engagement with digital humanities was very much a snowball effect. The first year in my PhD program, I was a project assistant for Susan Bernstein, an English professor who was interested in whether a computer could detect what she referred to as “the signal of seriality.” Many Victorian novels were initially published as serial installments, and Susan hypothesized that there may be something linguistically distinctive about them that a computer could identify, possibly informing our understanding of what separates monthly from weekly installments or serialized from nonserialized novels. I was tasked with learning a digital tool for testing that hypothesis. I learned the tool, DocuScope, by working with another professor in my program, Michael Witmore, who was himself beginning a long-term, multi-institutional project to create digital tools for studying the development of early English print. I quickly became part of that project team, too, for which I collaborated with colleagues in English, Computer Science, and the Library. Through those partnerships, new projects emerged, and by the time I reached the dissertation stage, I was a digital humanist. 

During graduate school, I was mostly working with prepackaged software for text and network analysis. It wouldn’t be until I joined the Yale Digital Humanities Lab that I would learn programming languages and begin working with maps and image computation. Although I was increasingly drawn to digital tools, my dissertation work was largely devoid of them! Instead, my dissertation tackled questions at the center of the digital humanities—what happens to texts when old and new media collide? How do we prioritize, access, and archive materials when we can’t possibly read or store them all?

Sarah Pickman: Your dissertation examined how Victorian authors thought about preserving their own work, in a time of an explosion of print materials. Some historians have recently raised concerns that most written media, correspondence, etc. from the early twenty-first century and onwards only exists digitally, which might make it vulnerable in terms of being preserved for the future. Are there any lessons we might draw from the nineteenth-century actors you’ve studied?

Catherine DeRose: Even though we try to understand today’s transition from print to electronic forms by studying what came before us, we tend not to acknowledge that Victorians engaged in a very similar process of looking back to grapple with their own changing media ecology. When we use phrases such as the “print explosion” and “information revolution,” we’re usually foregrounding technological innovation—improved printing technologies, the electric telegraph, the railway. I’m interested in the ways Victorian authors looked back to explore the process by which information is collected and preserved across material forms. Older media—such as inscribed clay tablets and handwritten journal entries—are juxtaposed with (then modern) telegrams and typewritten documents. The results often serve to remind us of the durability of older, pre-print forms and the importance of their continued preservation to future knowledge. 

They also challenge us, as consumers of texts, to make a conscious effort to consider how form and presentation influence our readings. Wilkie Collins’s The Woman in White is fundamentally a novel about texts—how they are created, managed, and interpreted. We have eleven narrators and a multitude of texts that come in the form of memoranda, letters, newspapers, books, legal settlements, medical certificates, church subscriptions, journal entries, church registers, messages in sand, and inscriptions on bodies and a tombstone. How do we assemble a coherent narrative from such an array? What criteria do we use to determine what information is trustworthy? If these questions feel familiar, it’s because they’ve become even more pressing with recent advances in text and image generation. Software can now produce “fake” content that can—on the surface—pass as authentic.

Sarah Pickman: Similarly, earlier in our “Conversations” series we spoke with John Durham Peters, who noted how our contemporary concern with “information overload” was also a problem for nineteenth-century publics. How did Victorians make sense of the explosion of text and information they encountered?

Catherine DeRose: Information overload was definitely a concern for the Victorians, too! Instead of mass digitization, they were contending with an explosion of print in the middle of the century. Aileen Fyfe points to the popularity of quarterly—and then weekly and daily—reviews that emerged to help readers navigate the surge. They also had a lot of new techniques for aggregating information that developed or improved over the period—statistics, mapping techniques, timetables, library cataloging systems. 

My favorite Victorian response, however, is one that prefigures the digital humanities. The passage, which I first came across in an article by Andrew Stauffer, appears in London’s Daily News in 1869. The writer begins by lamenting, “Must we not pity the historians of the future if they should at any time be so conscientious as to turn over the mountains of waste paper which are now being shot by cartloads into the [British] Museum?” (At the time, the British Library was part of the Museum.) That lament was pretty typical for the period; other writers likened the work of future historians to that of sifting through sand to find what’s important. But what I like about the Daily News writer is that they go on to propose a technical solution: “May we hope that when things come to such a crisis, human labor of the literary sort may be in part superseded by machinery? Machinery has done wonders, and when we think of what literature is becoming, it is certainly to be wished that we could read it by machinery, and by machinery digest it” (September 15, 1869). It would take another seventy or so years, but half of their hope—that we would, at least in part, read and consume texts by machinery—would come true. For the other half, I do not think that computers should supplant human readers!

Sarah Pickman: In general, what are some things you think are important for researchers to understand or think about before embarking on a digital humanities project—what makes for a research question that lends itself well to methods like text mining or creating visualizations?

Catherine DeRose: First and foremost, researchers should be prepared for just how time consuming it is to create a dataset. Even if the texts you want to work with have already been digitized, they might not have gone through optical character recognition (OCR) software, meaning they’re not yet amenable to computation. If they’re handwritten or the front is unusually intricate or the scan quality is poor, you might not even be able to run OCR software on them reliably, in which case you’ll have to transcribe the texts by hand. Once you have your texts, there will likely still be some trial and error to determine how best to process them—should you set everything to lowercase? What stop words (words you want the computer to ignore) should you use? At what level do you want to organize the texts and count (word, sentence, chapter, volume)? The list goes on! My main advice, then, is to mentally prepare—and build into your project timeline—for a lengthy curation period. While it’s not the most glamorous part of a digital project (visualizing the results is much more fun!), you will learn a lot about your texts and the kinds of questions you want to ask of them by putting in the time to curate them carefully.

The other piece of advice I have is, make sure you return to the source texts themselves for some spot checking. Computers are very good at counting, but sometimes they may be counting something you didn’t expect. A fellow Victorianist, Meredith Martin, has a terrific anecdote about performing an n-gram search (a way of tracking the use of a word or phrase over time) for the word “syntax” in the nineteenth century. She saw a huge spike in usage during the first few decades. But upon closer inspection (i.e., upon reading passages featuring the word), she discovered that there was a really famous race horse called Doctor Syntax, and he won a lot of races between 1814 to 1823. Computers are great at showing patterns, but we—the human researchers—have to determine what that pattern means and why it might be significant. For making those determinations, we often return to traditional humanities methods.

Sarah Pickman: You’re currently the Program Manager of Yale’s DHLab, which means that you spend a great deal of time helping Yale students and faculty expand their research with digital tools. Do you have any favorite projects that you’ve worked on, things that you think have asked particularly provocative questions? 

Catherine DeRose: Students and faculty bring such interesting projects to the DHLab that it’s hard to choose a favorite! I can, however, share two examples that I tend to draw on to show the range of creative, engaging work underway on campus. The first, BlakeTint, is a recent project led by Sarah Weston (PhD student in English and Art History) in collaboration with the DHLab to explore William Blake’s use of color. Scholars have long studied Blake and his use of color, but this interface provides a new way to analyze his work on micro and macro levels. Users can search by time, illuminated manuscript, or individual plate to see how the color palettes change. The second project, Glitch Lyric: Neural Networks and Poetry, by Roger Pellegrini (Yale College ‘16) explores computer-generated poetry and what makes “human-generated writing human.” His work is interested in moments where computers sound like humans, humans sound like computers, and we can’t readily tell the difference. He presented his project at the DHLab’s first Beyond Boundaries symposium (a half-day conference at which digital work happening across campus is showcased), where he put up pairs of stanzas and asked the audience—students, faculty, and staff—to identify which were written by humans and which were generated by a computer. For the pair that included Walt Whitman, over half of the room guessed wrong. What are the implications of that for how we think of voice and creativity? While the digital humanities gives us the ability to ask new kinds of questions, it also gives us opportunities to revisit older assumptions through a new lens.