Could social scientists and humanities scholars be replaced by bots?

From the December 17, 2010, issue of the journal Science comes a News of the Week piece “Google Opens Books to New Cultural Studies.” It sketches the ongoing research of a mathematician, Erez Lieberman Aiden, who is studying word frequencies using all of Google Books as his data source. Here’s the abstract of the technical publication.

By analyzing the growth, change, and decline of published words over the centuries, the mathematician argued, it should be possible to rigorously study the evolution of culture on a grand scale.

The researchers have revealed 500,000 English words missed by all dictionaries, tracked the rise and fall of ideologies and famous people, and, perhaps most provocatively, identified possible cases of political suppression unknown to historians. “The ambition is enormous,” says Nicholas Dames, a literary scholar at Columbia University.”

Just what the humanities needs! More studies of domination and resistance.

I already tend to view quantitative research with a skeptical eye, although I appreciate the insight statistics can provide when they’re well done. However this project immediately rubs me the wrong way and its not just the fawning praise of its massiveness. (Quantitative research, motto: “Size Matters”)

Why are word frequencies significant? The whole thing seems like a glorified version of those inane word count “studies” done by Ok Cupid like this one on gays and straights or this one on whites and non-whites.

To understand a word’s meaning (as opposed to its definition) you have to look at context, style, and tone. In short you have to read to interpret. Which is why I was particularly intrigued by how the researchers and their collaborators at Google navigated the copyright controversy associated with the Google Books project.

The project almost didn’t get off the ground because of the legal uncertainty surrounding Google Books. Most of its content is protected by copyright, and the entire project is currently under attack by a class action lawsuit from book publishers and authors. [Peter Norvig, head of research at Google] admits he had concerns about the legality of sharing the digital books, which cannot be distributed without compensating the authors. But Liberman Aiden had an idea. By converting the text of the scanned books into a single, massive “n-gram” database – a map of the context and frequency of words in history – scholars could do quantitative research on the tomes without actually reading them.

Take that hermeneutics! Now we can interpret texts without reading them.

I’ll allow that there is a significance to word frequencies. After all stuff like this (from the online supplement to the technical report) is pretty cool:

Why it matters, I’m not so sure. But it is kinda neat.

Leiberman Aiden is a student of genomics by training, so by naming his new study “culturomincs” he’s playing a little word game. Genomics is a very broad field of study that features as its principle methodology using math and statistical modeling to draw conclusions about gene frequency within a population. For instance, genomics might inform clinal studies showing how gene frequencies vary geographically among humans. Culturomics seems to apply this idea through analogy, taking words for genes and language for the genome.

But there are limitations to how far this analogy can go. There is no part of the genome that is beyond the purview of genomics, but culturomics, with its data set limited to scanned pages of Google Books does not consider “culture” in its entirety. Admittedly Google Books is great in scope, “It currently includes 2 trillion words from 15 million books, about 12% of every book in every language published since the Gutenberg Bible in 1450.” But even to intentionally limit the study to language and set aside all non-linguistic aspects of culture, you are still bound only to the written word. And the published written word at that.

More importantly genetics, of which genomics is but a subfield, is only one of many different perspectives for understanding an organism or population. You are not your genes. There is much, much more that goes into making a living organism what it is than just its genetic composition. As the father of identical twin girls I can testify to this. Although each sister’s DNA is a perfect copy of the other’s they are very different people. Genomics helps geneticists draw conclusions about gene frequencies, it does not tell us which genes are turned on and off or how genes interact with their environment.

Culturomics, by analogy, doesn’t come close to its grand claim to be a rigorous study of the evolution of culture. It can provide us with some interesting information about word frequency, however. The question is, what are you going to use this new tool for?

“This is a wake-up call to the humanities that there is a new style of research that can complement the traditional styles,” says Jon Orwant, a computer scientist and director of digital humanities initiatives at Google.

Humanities scholars are reacting with a mix of excitment and frustration. If the available tools can be expanded beyond word frequency, “it could become extremely useful,” says Geoffrey Nunberg, a linguist at the University of California at Berkeley. “But calling it ‘culturomics’ is arrogant.” Nunberg dismisses most of the study’s analyses as “almost embarrassingly crude.”

Perhaps a problem lies in how the culturomic search has been limited to only variation in word use over time, when there are so many other variables to be considered. In the results above, showing that “spilt” became less popular as “spilled” became more dominant, one is left wondering… So what? That’s an interesting description of language and the documentary function is valuable, but what good is it? How can such descriptions inform what we know about the way humans use language to interact with one another?

What questions would you ask of culturomics? You can run your own culturomic experiments here.

Culture (blue) vs. society (red), 1800-2000:

Matt Thompson

Matt Thompson is Project Cataloger at The Mariners’ Museum in Newport News, Virginia, and currently working on a CLIR ‘hidden collections’ grant to describe the museum’s collection of early 20th Century photography. He has a doctorate in anthropology from the University of North Carolina and a Masters in information science from the University of Tennessee.

8 thoughts on “Culturomics?

  1. Matt: “one is left wondering… So what? That’s an interesting description of language and the documentary function is valuable, but what good is it? How can such descriptions inform what we know about the way humans use language to interact with one another? ”

    That’s what I was thinking, it there really nothing more to it than this?

  2. ‘Revolution’ yields quite an interesting result. But then of course you have to correlate such a quantitative pattern with a qualitative explanation rooted in the specific conditions of each cycle.

    That said, the decline after 1977 is telling, isn’t it?

  3. (SarcOn)Once the models are adequately developed they will have predictive value as well.(SarcOn)

    There’s actually nothing you can’t do with a computer! (And a nice, cushy, Federal Grant.)

  4. Perhaps if you actually read the primary source, rather than news reports about the primary source, you’d understand why it matters.

    Also, you say that “there is no part of the genome that is beyond the purview of genomics, but culturomics, with its data set limited to scanned pages of Google Books does not consider “culture” in its entirety”. That completely misrepresents the fact that the number of genomes “published” is almost infinitesimally small compared with the number of books. Like the interaction between language and culture, the interactions between genes and environment/culture are also fascinating. In both cases, studying the one is not an end in itself, its a means towards trying to understand the much harder questions.

  5. If anyone is interested in thinking further about what happens when algorithms and hermeneutics collide in the culturomics mode, you might find a recent essay of mine of interest:

    Culturomics is just one expression among others (e. g. sociocultural modeling and simulation) where “big data” is actively seeking to colonize the “sociocultural.” As such, it raises basic questions about the distinctiveness of the ethnographic project and the grounding of different interpretive claims to knowledge.

Comments are closed.