When you think of scholarship you might think first of publications, articles and books, but that is just the final product. Yes it is polished through countless hours of research, writing, and responding to reviewers, however all that work is built on an even more time consuming foundation of collecting raw materials. In cultural anthropology this includes field notes, journals, marked up literature, audio recordings, transcripts, and maybe photographs and video. I think I even have a few 3-D objects squirreled away in banker’s boxes. Although we seldom refer to it as such all of this is “data,” it is information awaiting interpretation.
We take great pride in our finished products. Peer reviewed publications are still the coin of the realm. Our attitudes towards data in cultural anthropology are less clear. Are our data worth saving? What have you done with your data? How would you feel about sharing your data with others?
For Harvard political science professor Gary King there is a problem with data in scholarship, generally. The data used to produce most published works usually are not publicly available. Typically it’s sitting on the researcher’s hard drive and only he or she has access to it. That might sound like normal behavior to you, but keep in mind that hard drives are fragile things. For the purpose of this report I went in search of my dissertation data and found it on an old PC in the attic…
In cultural anthropology, a discipline where so much of the self is invested and sacrificed in the work of collecting data I intuit that many of my colleagues would be very reluctant to share their raw materials. They are private for many reasons. Perhaps we see them as full of mistakes and prejudices. Perhaps they include confidential information. Perhaps they are embarrassing as that unfinished novel in your top desk drawer. We don’t write fieldnotes for an audience!
After all, what would you have to gain from open data? At least with a publication you might burnish your reputation by earning a citation in the works of others. If someone ever did use our data, we’d want to get credit for it. But mostly we want to be in control of our data and say what parts, if any, will be seen by others.
At first face this might seem like a problem primarily for the hard sciences. All those tables and graphs, they’re based on something. But once they’re published you can’t pop open the hood and see what’s making them run. Tables and graphs are just images, they’re not interactive (yet). Shouldn’t readers be able to check if an author’s data match their claims? That’s just good empiricism.
Okay, similar problem in cultural anthropology. You read an article and the author inserts a quote from an informant or maybe muses on an observation. But what was the broader context? What came before or after? Maybe its irrelevant. The point is you can’t go back and check, unless you’re an expert in the same field site and know the same locals you won’t have access to that information. What if you could? Where could anthropology take us next if ethnographers produced not just finished products but also datasets that could be mined by new eyes, trained with a different focus to pick out details for another agenda.
We may have to tackle this problem sooner than you think. NSF, NIH, and many others are requiring that funding recipients implement a plan for data archiving, even making their data publicly available. Fortunately there is a cheap and easy way to do this.
Dataverse is an open source, online repository for data. It’s free to use and very user friendly, signing up is about as complicated as Gmail. Dataverse is for creators. It allows authors to control which data audiences are allowed to view, it preserves data files into the future, and generates citations for datasets. Dataverse is for readers too. It documents changes in data over time, allows audiences to check to see the data behind the claims, and allows an author to publicly distribute data without readers having to check for permission. Plus there’s a nice GUI to simplify interaction with the digital archive.
EXAMPLE 1: Using Dataverse at the end of a study.
For existing publications archiving your raw data is as simple as signing up for an account and uploading your files. This will preserve your files in a digital repository, making it easy to return to them for composing future publications. If you want to make parts of your dataset available to others you can do that, or you can keep it private. Recently Dataverse has created a plugin that integrates with OJS (Open Journal Systems), the publishing platform behind many of the best open access publications, allowing authors to submit their data for archiving alongside the article submission.
Some real world examples:
EXAMPLE 2: Using Dataverse from the beginning of a study.
The best way to insure that your data is ready for long-term preservation is to plan for it from the start. Create a Dataverse account at the beginning of your research project and build the data archive as each stage of the project is completed. This is better than the cloud because files are enriched with metadata which allows for faceted searching. You can choose to release all or part of the study whenever you are ready. Also, if you are working as part of a research team you can use Dataverse as a platform to share data files with authorized team members while still excluding the general public.
Where are my bits? Your bits are housed in a Dataverse Network and there are many preexisting networks for you to choose from including Harvard University and the University of North Carolina Odum Institute. When you create an account with one of these institutions your files are going to live on their servers. Yes, you can create your own Dataverse Network on your own server. It is open source software, all you have to do is have the technical expertise to install it and the resources to pay for server space and you can have your own Network.
Most users are going to be satisfied with having a Dataverse on someone else’s Dataverse Network. Once you have your account set up you begin by creating a Study. Studies are composed of data files and their associated metadata. You can easily sort your files and describe them to your heart’s content. Collections can be formed out of groups of Studies.
Still sound crazy? If you’re at an elite R1 you might already have a full time “data archivist” on staff who can help you with the process. It’s an increasingly common library service even at second tier R1 schools. Some schools like Emory and George Mason already have Dataverses set up on the big Networks, so talk to your librarian and you can skip a step and go right to Collections and Studies. If you’re flying without professional librarian support give the good people at Harvard or UNC a call. The have full time employees to help researchers use their online tools.
Limitations of Dataverse
A word of caution. Dataverse has no curatorial oversight, the researcher must take primary responsibility for management of their own data. You get the server space and software, but its on you to use it properly. Thus researchers can upload missing or inadequate data, or they may use proprietary file formats. There is no built in migration of file formats, so expired formats will age and become more difficult to read with time. That researchers have uploaded their data does not guarantee that they have complied with standard ethical requirements such as IRB and confidentiality agreements. Because researchers are writing their own metadata, that metadata may be inaccurate. Or it may be impossible to interpret raw data if researchers fail to share coding documentation.
Additionally, Dataverse was not created with cultural anthropologists in mind. It was originally developed by the Harvard Institute of Quantitative Social Science to hold tabular data. It’s meant to hold the files for polls, surveys, and experiments and boasts robust, built-in statistical tools that won’t be useful to all of us. However, because the software is open source development of new tools for qualitative data is already underway.
While you can upload any kind of file to archive some file types are going to be better supported than others. For example I tried and failed to upload a 2.8GB .avi file. The user interface offers no loading bar to illustrate how far along the upload process is, so after an hour I gave up. This was not a problem as I uploaded .docx and .pdf, which passed through in flash. If you need to archive large files such as video you will have to contact your Dataverse Network host and request assistance. They will be happy to help you by archiving files over 2GB in size on the back end.
Although Dataverse can simply and easily be used as a private archive, changing the working culture of cultural anthropology to embrace an open data framework may be the biggest obstacle. We all have our own idiosyncratic way of organizing files in our hard drives. This is what real science looks like. Its messy, it sweeps bad empiricism under the rug. All the rhetorical flourishes in the world won’t hide the fact that organized data makes for better science, even if its just “science.” Perhaps in the future graduate programs will also include the practical education of training student researchers to create metadata and proactively care for their data as they go. Until then it may be a hard sell to convince non-quantitative anthropologists that they need to do more than back up their files to an external hard drive every now and then.
When you put something on paper that information is already in a fairly good state for long term storage. We take for granted the precarity of digital information. As the days of anthropologists composing their notes on index cards fade we all should be more concerned about the long term organization and storage of our digital data. These files will be the historical record of twenty-first century anthropology, but only if we care for them properly.
Thanks to Ole Villadsen who contributed with research for this post.