I work heavily with technology as part of my creative practice, where there is an extremely strong relationship between data or information and the eventual outcome. For the past five or so years I have been working with machine learning, where the idea of a dataset—the information that you give to an algorithm from which it learns—is fundamental. At the same time, the dataset is also contentious—from the content and the categories to its very definition.
If a dataset is defined, as the Cambridge Dictionary does, as “a collection of separate sets of information that is treated as a single unit by a computer”, does that mean all data—“information [that] can be read, manipulated, and programmatically transformed by a computer”—can become a dataset? Or that a dataset stops being such when it gets taken out of the context of the algorithmic? Increasingly the idea of the dataset has become inseparable from machine learning and artificial intelligence for its crucial role in its functionality: what is contained in a dataset becomes the knowledge that an algorithm has in order to create its world. If a dataset only contains cats, whatever is trained on it will only be able to understand cats: it sees or creates everything as a cat.
Datasets need to be extremely large in order for the algorithms to have enough information to make inferences; they also need to be cleaned and standardized in order for them to be usable. In this way, parallels with encyclopaedias start to emerge. Encyclopaedias are a prime example of how various objects are classified; they illustrate the search for a universal system to describe the world. Datasets can then be understood as a type of a “contemporary encyclopaedia.” Both datasets and encyclopaedias try to record everything in the world, both make decisions as to what is important enough to be recorded, and significantly, both are systems of organizing knowledge that may not have immediate author attributions following individual entries. There is a further parallel: when encyclopaedias start to describe the world, they freeze it in a way. There is no immediate way to update a book once it has been printed. In a similar way, datasets offer a frozen snapshot of the world.
There are a number of famous datasets frequently mined for information in research papers, or used to run and test code that are ten, sometimes fifteen years old. They do not reflect the world that we live in now. Once produced, they are very rarely reviewed or updated. People assume that because algorithms and models are using these datasets as benchmarks, they will be constantly refreshed, but this is often not the case. These datasets are also now very hard to find in their entirety. Despite their importance—both to the machine learning community and to broader society as cultural artifacts—no one is really looking after them, archiving them, making updates to them. ImageNet, a large canonical dataset, is now almost impossible to find in its entirety. It has been offline for all but 1000 categories for over a year now. With an account on ImageNet, it is still possible to download the images from the original links, but many of the original files would have disappeared, as many of them originally came from Flickr. This means it is now impossible to trace back its fifteen years as a dataset—what it might have been in and how it might have impacted other systems. These datasets are working objects that degrade and fall apart over time. In this sense, datasets are material objects as well. They’re not complete, and they degrade. These datasets are just like the encyclopaedia, a physical material book object; they’re both static. They both exist as snapshots of a moment in time. They need to be updated to reflect the new things going on. They both need to be cared for, and if they’re not used for their knowledge, they will disintegrate.
The construction of datasets mirrors that of encyclopedias—it is anonymous, hidden; all of the labor disappears and becomes invisible—but it does have a key difference. People doing the work of labelling, or finding the imagery for the datasets, tend not to be experts in the field; it is done for the most part by mechanical turkers who are paid very small amounts of money per task and who want to work as rapidly as possible. Imagery tends to come from the Internet, which is one layer of standardization. Then the images are returned to individuals who decide what image “fits” the term in question. Each time that decision is made, it falls more and more towards the conventional, and the options are skewed. This causes datasets to reflect society in the same way that encyclopedias reflect the people who construct them, and datasets can reinforce cultural stereotypes. People assume that with technology and the Internet, it is easier to have a more inclusive database with a nuanced approach to representation, but all of the choices that occur when constructing the encyclopedia or the dataset are still there. Data can still be warped, manipulated, ignored, or lost, whether it was generated five, fifteen, or five hundred years ago.
Artist and ResearcherView Bio
Anna Ridler is a London-based artist and researcher who works with information and data. She holds an MA in Information Experience Design from the Royal College of Art and a BA in English Literature and Language from Oxford University. In her creative work, she creates her own handmade datasets through a process of classifying and selecting materials, to explore themes of data collection, storytelling, and technology. In this interview, Anna explains how datasets, both electronic and print, are material objects that require regular updates and maintenance. Despite our attempts to organize information through new crowd-sourced technological media, we still need to consider the effects of standardization and bias.