Colton Valentine is a PhD candidate in English at Yale. His research focuses on travel writing, affect, and the reception of French literature in Victorian England. His writing has recently appeared or is forthcoming in MLQ, Victorian Poetry, The Henry James Review, and the Los Angeles Review of Books.
One of my longstanding research interests in Yale’s English Department has been nineteenth-century literary cosmopolitanism: the ways Victorian writers interpreted and drew influence from their continental counterparts. There’s an old chestnut, in this field, that French literature was considered “poisonous” i.e. decadent, immoral, and lascivious. Recent work by scholars like Margaret Cohen, Sharon Marcus, and Juliette Atkinson has challenged that cliché on a granular scale, and I’ve contributed with papers that attend to Henry James’s criticism and translations of novels by Joris-Karl Huysmans and Pierre Louÿs. Yet I’ve long felt that working with broader corpuses is essential for assessing the cultural status of French literature in Victorian England. Distant reading has many epistemic pitfalls, but it’s a useful tool for probing our enduring associations between nineteenth-century literature and national(ist) insularity. So I decided to dust off my old coding skills, learn Python, and begin surveying database options for a project at Yale’s Digital Humanities Laboratory (DH Lab) called “The Poisonous French Book?”
Through my work at the Lab, I began to encounter a new set of methodological issues around data access. We hear often in the academy today, indeed in the Sawyer project title (The Order of Multitudes: Atlas, Encyclopedia, Museum), that we live in an era of unprecedented multitudes—that scholars must contend with ever-expanding digitized archives and their surfeits of (hyper)textual input. Yet the opposite problem was true for my project. I had planned to build my corpus from HathiTrust’s data—which is generously available to partner institutions—but found it only had a third of the relevant periodicals titles as ProQuest, the leading academic database aggregator. Further, Yale may be a subscriber to ProQuest, yet a separate access level is required for the raw data necessary to run text mining analysis. At my request, Yale’s DH Lab here generously purchased the ProQuest permissions, but many researchers lack that level of institutional support.
Initial corpora purchase, however, is only the beginning of the text mining process. This past semester, my project goal was to move from a stage of corpus identification and early pipeline development to a pipeline that processed large quantities of data and presented initial results to a public audience. Analysis, it turned out, was the easy step; the devil was in the data processing. Relatively speaking, ProQuest delivers superbly clean raw data: individual XML files for each article that include metadata on title, author, date periodical, etc. But there is still plenty of noise: from accents that render in bug-triggering characters; to ambiguous signifiers like a last name search for Anatole France that returns all mentions of the country; to misspelled author names like Balzac as Balsac and Huysmans and Louÿs with practically any vowels.
By the semester’s end, with the aid of guided research grant at the DH Lab, I’d developed a full data processing pipeline that that could deal with these bugs. It accepts ProQuest XML files, stores the metadata, extracts main texts, and runs what’s called Named Entity Recognition (NER) on the main text to extract information on “entities” like people and locations. The pipeline then tests whether any of the entities matches a list of ~100 French authors, with a fuzziness modifier added to deal with Huysmans vowel issues. If any positive matches are found, the sentence around the entity is extracted as a string and sentiment analysis is performed—with numbers saved for both author sentences and the full article. All that information is saved to a json file, with one json per article with at least one author match. The data analysis pipeline then extracts information from the jsons to calculate sentiment averages—and it will eventually also use the strings to study collocates and topic clusters. Collocates and topic clusters offer two sources of information on the words that tend to appear together in text files; they point us to trends in, for instance, the adjectives or metaphors that qualify an author’s name. The next stage in the project will be to develop an initial visualization model that presents these results to other students and researchers.
Throughout this work, I’ve learned that access is not merely a question of purchasing ProQuest data files. It a question of having that data processed into clean, curated corpora so that you can begin to perform analysis. It is a question of having the time on your doctoral program or tenure clock to step away from the normal cycle of reading, publishing, and teaching—and build archives to which, due to copyright issues, you may not have access when you leave the given institution. And, when you’re finally ready for analysis, it is a question of having the raw computing power necessary to run your files. As built, the processing pipeline for “The Poisonous French Book?” currently takes 5-10 minutes to run per XML file, which would require weeks of personal computer time to run at scale without the GPU accelerated hardware at Yale’s DH lab. It seems likely, in the future, that such services will become part and parcel of the library resources at top universities—usable so long as you have the right VPN connection.
The more dimensions I found in access, the more I came to feel that a corrective was needed for our associations around contemporary multitudes—a corrective written not just for specialists in Victorian literary culture but for a broader audience. The result is the piece I’ve developed with the support of the Sawyer grant, provisionally titled “Access Precedes Essence,” which is slated to appear in the Stanford Humanities Review next year. The piece will draw on my own experience with “The Poisonous French Book?” but it will situate that narrative in a broader scholarly and public conversation around text mining and digital access. I’m particularly interested in how legal and institutional barriers inform corpora construction and preservation—how data can end up siloed at universities who pay for raw data, fund research for pipeline development, and then run analysis on local servers. While DH work is often associated with the start-up maxim “move fast and break things,” it can also be painstakingly manual and rife with litigation fears. When it comes to data mining, I’d suggest, our experience of contemporary multitudes is also one of antiquarian scarcity. More than a cautionary corrective, I envision this piece as a call for universities and legislators to work together to develop new solutions around data access.