Order of Multitudes

Building an Interactive Database

Dr. William Watson may be a musicologist by formal training, but his work demonstrates his ambitions to contribute to the broader project of interactive digital humanities. As part of his dissertation work at Yale University, Dr. Watson built the Digital Index of Late Medieval Song (DILMS), a comprehensive database that describes every copy of every vernacular polyphonic song written down in fifteenth-century Europe. In our conversation, he revealed how he hopes the humanities will engage with data and serve as a model for other fields in the future. 

Allison Chu: How did you get the idea for DILMS and how does this compare to other manuscript collections and other sources of information that were already available?

William Watson: I got the idea for DILMS in a seminar that Anna Zayaruznaya led, where I interacted a lot with this peculiar book called A Catalogue of Polyphonic Songs, 1415-1485 by David Fallows. It’s 500 or 600 pages of lists of songs and the manuscripts in which they appear. It’s basically every copy of every song that Fallows knew about when this book was published in 1999. It’s an incredible accomplishment, but its uses are constrained. I found myself having to go to DIAMM [the Digital Image Archive of Medieval Music] to get more information about the original manuscripts in which the copies of songs were located, and I had to cross check this with other databases online. This process became extremely frustrating—I was looking at many places trying to find this information. About a year later, when I was looking for my dissertation project, I thought that I would start by digitizing this catalogue, because that would give me a sort of panoptic power over the repertoire.

AC: Tell me about the process of building DILMS. It sounds like you had to consult multiple other sources such as DIAMM to find the information you needed.

WW: I first spent a long time thinking about how I wanted to structure it. If the lore is true, there is only one PDF of the Fallows catalogue, made by Michael Cuthbert in the early years of this century while he was still a graduate student at Harvard. He shared it, but it’s not a very good scan. There are lots of pages that the early 2000s OCR engine missed; you can’t effectively search it. I gradually landed on the idea that I needed to make this into a SQL database (a relational or tabular database), that would allow me to do a lot more with it. For the manuscripts, I did end up consulting DIAMM a lot. I also came to the realization that DIAMM has to be used carefully and critically because they want to collect all that has been said about a particular manuscript. This means they don’t have a commitment to taking a stand on what information is to be trusted and what is not, what has been supplanted, and what we should be looking at. To get the information about the songs, I took the scans of the Fallows catalogue that I had, dumped as much information as I could into raw text files, and went through them to do a little cleaning and reformatting. Then I wrote a Python script that took in those text files and exported them as .csv files containing all the information I wanted, formatted the way I wanted, so that I could import them into a SQL database. In data science terms, you might say that I built a one-off “data pipeline.”

AC: In building this database, you had to make decisions about what to include and what not to include. To go back a little bit, how did you decide what was important and what was extraneous to your project?

WW: One of the ways I can answer this question is to ask, what counts as a copy of a song? Or rather, what counts as a song inscription? That is not an easy or an obvious question to answer, but I ended up settling on the fuzzy criterion that a song inscription must aspire to a complete status as a written record or instantiation of a song. If this song is copied into a manuscript, fine. If this song was contrafacted and is in a different manuscript but is now a sacred hymn with Latin text, that’s also fine, because it aspires to completeness as a thing in itself. If a song has been painted on the ceiling of a church in France, and you can look up at it and sing it? Also fine, because it aspires to completeness! But one line of a song quoted in a poem does not aspire to “completeness” in the same way—intertextuality is something different. In fact, literary and musical citations are the most numerous things that did not make it into the database’s initial build. 

AC: In order to categorize all of this information, what organizing principles did you decide on? Why did you end up using SQL, and could you talk a bit about your process?

WW: A SQL database consists of a collection of data storage tables that can be queried in various ways. Actually, at the core, everything in a SQL database is a table (sort of). The powerful thing about it is that you can join tables together horizontally or vertically, you can aggregate rows in tables by counting and adding and conditioning on values of fields, you can do lots and lots of transformative things. So “organizing” in SQL really means: what do I want the data storage tables to be like? I like to think about the data storage tables as SQL’s understanding of the object types in that universe. Most SQL databases are built for commercial purposes, so in the classic example is a customer is an object type, and a product is an object type, and perhaps an order is an object type. All customers have these attributes like address and name. In my case, an inscription became an object type, a manuscript became an object type, a layer of a manuscript is an object type, and a song is an object type (sort of). When you query the database you can reorder it any way you want, which is a significant advantage over things like printed catalogues. Both printed catalogues and SQL store information according to a well-ordered scheme, but when you interact with a printed catalogue or even some simple electronic databases, you have to adjust your query to the well-ordering principle of the catalogue. With SQL, there is some of that going on, but it’s more about “what is the organization that I care about, and how can I transform what is available to me in order to produce that.”

AC: Of the three guiding metaphors of the Sawyer Seminar (atlas, encyclopedia, and museum), would you say that DILMS is the closest to an encyclopedia?

WW: There are aspects to DILMS that are very much like an encyclopedia, like its aspirations to comprehensiveness within a defined domain, but there are other aspects that are very much not like that. I think of an encyclopedia as a collection of chunks of information that are discrete but may be internally very complex. An encyclopedia is there to be perused. But if you’re just using DILMS to look things up, that’s not exciting. It wants you to do things with it, it wants you to manipulate what’s there and be creative with it, to use it to produce things for yourself. For instance, I was able to use it to build a probabilistic model that estimates the likelihood that songs from the 15th century have survived into the present day, and how that likelihood changes based on language, geography, or chronology. What do those differences tell us about reception history? Both this probabilistic model and a social network analysis, which are the two major computational things I did with DILMS for the dissertation, are more than collections of information organized into tables. And I hope that DILMS will ultimately act more as something to be interacted with and reshaped—so maybe closer to a museum exhibit or an archive in the way that people can construct questions to ask the database and thereby change its meaning. 

AC: What’s the future of DILMS? Is this something that you imagine to be publicly accessible? 

WW: Eventually it will be publicly accessible. The idealistic part of me says this knowledge should be published as soon as possible, because a lot of this has been aggregated and reformatted from other collections of knowledge, and publication would be a way of acknowledging my indebtedness to the long lineage of scholars that came before me. But if I want to publish DILMS, then I want people to be able to actually engage with it through SQL, because I think of DILMS as a series of arguments about how one should engage with data broadly speaking, and also how one should engage with the fifteenth-century song repertoire in particular. 

AC: Let’s talk more about interacting and engaging with data. You have this very specific focus on fifteenth-century song, but you also seem to have this much larger ambition too. What are you hoping people will be able to take away from your project?

WW: I have a degree in math and a background in tech, and I just completed a data science training fellowship, so I’ve been thinking about “data” quite seriously for most of my adult life. I’ve consistently observed that the easiest thing to ignore about data—also the most important not to ignore, and the hardest to sustain focus on—is what the data really are and where they come from. There are lots of ways to frame this problem depending on the field you’re coming from. One popular way is to observe that, etymologically speaking “data” are “things that have been given,” but there’s a pretty long-running consensus that a better word would be “capta,” or “things that have been taken.” And this distinction speaks to our temptation to dive in and let the numbers “speak for themselves.”

Sometimes I can get frustrated with a tendency within the humanities to use “reading” as a metonym for all meaning-making. You read a text, you read a work of art, you can even read an event. Yet, as a musicologist I’m deeply invested in listening as a form of meaning-making, and as a computationally-minded person I’m pretty committed to the idea that counting is just as powerful. In order to count things, you need to know why this is one thing, and why that is another thing, and what kinds of general similarity they share that justifies aggregating them together. That is a not trivial form of meaning-making, it’s a “theoretically involved” activity.

This is all to say that, while I want people to get excited about particular datasets, I really want people to figure out for themselves what it means to count things. And I think DILMS is well positioned to help people do that, partially because (let’s be frank) fifteenth-century vernacular polyphony is not an area of study with super-high stakes. There are no time limitations or external pressures that come along with DILMS, especially when compared to biomedical data or politically sensitive data. The “unimportance” of the humanities gives us the luxury of thinking about these things, but it comes with an obligation to model how they can be done well, so that people in other fields can gain a new set of perspectives on how to do those things in their own work. After all, isn’t that sort of the job of the humanities, to model reflexive meaning-making so that people who aren’t humanists can also take these methods on and change how they interact with the world? I like to think that it is.