Dataverse: an open source solution for data sharing

When you think of scholarship you might think first of publications, articles and books, but that is just the final product. Yes it is polished through countless hours of research, writing, and responding to reviewers, however all that work is built on an even more time consuming foundation of collecting raw materials. In cultural anthropology this includes field notes, journals, marked up literature, audio recordings, transcripts, and maybe photographs and video. I think I even have a few 3-D objects squirreled away in banker’s boxes. Although we seldom refer to it as such all of this is “data,” it is information awaiting interpretation.

We take great pride in our finished products. Peer reviewed publications are still the coin of the realm. Our attitudes towards data in cultural anthropology are less clear. Are our data worth saving? What have you done with your data? How would you feel about sharing your data with others?

For Harvard political science professor Gary King there is a problem with data in scholarship, generally. The data used to produce most published works usually are not publicly available. Typically it’s sitting on the researcher’s hard drive and only he or she has access to it. That might sound like normal behavior to you, but keep in mind that hard drives are fragile things. For the purpose of this report I went in search of my dissertation data and found it on an old PC in the attic…

This is not secure storage.
This is not a secure storage method.

In cultural anthropology, a discipline where so much of the self is invested and sacrificed in the work of collecting data I intuit that many of my colleagues would be very reluctant to share their raw materials. They are private for many reasons. Perhaps we see them as full of mistakes and prejudices. Perhaps they include confidential information. Perhaps they are embarrassing as that unfinished novel in your top desk drawer. We don’t write fieldnotes for an audience!

After all, what would you have to gain from open data? At least with a publication you might burnish your reputation by earning a citation in the works of others. If someone ever did use our data, we’d want to get credit for it. But mostly we want to be in control of our data and say what parts, if any, will be seen by others.

At first face this might seem like a problem primarily for the hard sciences. All those tables and graphs, they’re based on something. But once they’re published you can’t pop open the hood and see what’s making them run. Tables and graphs are just images, they’re not interactive (yet). Shouldn’t readers be able to check if an author’s data match their claims? That’s just good empiricism.

Okay, similar problem in cultural anthropology. You read an article and the author inserts a quote from an informant or maybe muses on an observation. But what was the broader context? What came before or after? Maybe its irrelevant. The point is you can’t go back and check, unless you’re an expert in the same field site and know the same locals you won’t have access to that information. What if you could? Where could anthropology take us next if ethnographers produced not just finished products but also datasets that could be mined by new eyes, trained with a different focus to pick out details for another agenda.

We may have to tackle this problem sooner than you think. NSF, NIH, and many others are requiring that funding recipients implement a plan for data archiving, even making their data publicly available. Fortunately there is a cheap and easy way to do this.

Enter Dataverse

Dataverse is an open source, online repository for data. It’s free to use and very user friendly, signing up is about as complicated as Gmail. Dataverse is for creators. It allows authors to control which data audiences are allowed to view, it preserves data files into the future, and generates citations for datasets. Dataverse is for readers too. It documents changes in data over time, allows audiences to check to see the data behind the claims, and allows an author to publicly distribute data without readers having to check for permission. Plus there’s a nice GUI to simplify interaction with the digital archive.

EXAMPLE 1: Using Dataverse at the end of a study.
For existing publications archiving your raw data is as simple as signing up for an account and uploading your files. This will preserve your files in a digital repository, making it easy to return to them for composing future publications. If you want to make parts of your dataset available to others you can do that, or you can keep it private. Recently Dataverse has created a plugin that integrates with OJS (Open Journal Systems), the publishing platform behind many of the best open access publications, allowing authors to submit their data for archiving alongside the article submission.

Some real world examples:

EXAMPLE 2: Using Dataverse from the beginning of a study.
The best way to insure that your data is ready for long-term preservation is to plan for it from the start. Create a Dataverse account at the beginning of your research project and build the data archive as each stage of the project is completed. This is better than the cloud because files are enriched with metadata which allows for faceted searching. You can choose to release all or part of the study whenever you are ready. Also, if you are working as part of a research team you can use Dataverse as a platform to share data files with authorized team members while still excluding the general public.

Techy stuff

Where are my bits? Your bits are housed in a Dataverse Network and there are many preexisting networks for you to choose from including Harvard University and the University of North Carolina Odum Institute. When you create an account with one of these institutions your files are going to live on their servers. Yes, you can create your own Dataverse Network on your own server. It is open source software, all you have to do is have the technical expertise to install it and the resources to pay for server space and you can have your own Network.

Most users are going to be satisfied with having a Dataverse on someone else’s Dataverse Network. Once you have your account set up you begin by creating a Study. Studies are composed of data files and their associated metadata. You can easily sort your files and describe them to your heart’s content. Collections can be formed out of groups of Studies.

Architecture

Still sound crazy? If you’re at an elite R1 you might already have a full time “data archivist” on staff who can help you with the process. It’s an increasingly common library service even at second tier R1 schools. Some schools like Emory and George Mason already have Dataverses set up on the big Networks, so talk to your librarian and you can skip a step and go right to Collections and Studies. If you’re flying without professional librarian support give the good people at Harvard or UNC a call. The have full time employees to help researchers use their online tools.

Limitations of Dataverse

A word of caution. Dataverse has no curatorial oversight, the researcher must take primary responsibility for management of their own data. You get the server space and software, but its on you to use it properly. Thus researchers can upload missing or inadequate data, or they may use proprietary file formats. There is no built in migration of file formats, so expired formats will age and become more difficult to read with time. That researchers have uploaded their data does not guarantee that they have complied with standard ethical requirements such as IRB and confidentiality agreements. Because researchers are writing their own metadata, that metadata may be inaccurate. Or it may be impossible to interpret raw data if researchers fail to share coding documentation.

Additionally, Dataverse was not created with cultural anthropologists in mind. It was originally developed by the Harvard Institute of Quantitative Social Science to hold tabular data. It’s meant to hold the files for polls, surveys, and experiments and boasts robust, built-in statistical tools that won’t be useful to all of us. However, because the software is open source development of new tools for qualitative data is already underway.

While you can upload any kind of file to archive some file types are going to be better supported than others. For example I tried and failed to upload a 2.8GB .avi file. The user interface offers no loading bar to illustrate how far along the upload process is, so after an hour I gave up. This was not a problem as I uploaded .docx and .pdf, which passed through in flash. If you need to archive large files such as video you will have to contact your Dataverse Network host and request assistance. They will be happy to help you by archiving files over 2GB in size on the back end.

Although Dataverse can simply and easily be used as a private archive, changing the working culture of cultural anthropology to embrace an open data framework may be the biggest obstacle. We all have our own idiosyncratic way of organizing files in our hard drives. This is what real science looks like. Its messy, it sweeps bad empiricism under the rug. All the rhetorical flourishes in the world won’t hide the fact that organized data makes for better science, even if its just “science.” Perhaps in the future graduate programs will also include the practical education of training student researchers to create metadata and proactively care for their data as they go. Until then it may be a hard sell to convince non-quantitative anthropologists that they need to do more than back up their files to an external hard drive every now and then.

When you put something on paper that information is already in a fairly good state for long term storage. We take for granted the precarity of digital information. As the days of anthropologists composing their notes on index cards fade we all should be more concerned about the long term organization and storage of our digital data. These files will be the historical record of twenty-first century anthropology, but only if we care for them properly.

Thanks to Ole Villadsen who contributed with research for this post.

Matt Thompson is Project Cataloger currently working to describe a collection of approximately 14,000 photographs produced by the Army Signal Corps during WWII. He has a doctorate in anthropology from the University of North Carolina and a Masters in information science from the University of Tennessee.

8 thoughts on “Dataverse: an open source solution for data sharing

  1. Excellent! Though, while this seems very promising, I am loathe to explain this to my IRB. (On the other hand, that hasn’t stopped me before.)

  2. This is very interesting. Thank you, Matt!
    Some very random thoughts that have been running through my head for some time now: I work for an interdisciplinary research institute in Burkina Faso, and we will be uploading our data to the institute’s nascent database. This isn’t a problem (I don’t think) for the natural/physical scientists, but I balk. I doubt that my field notes will be uploaded — yes, messy and lots of personal info and opinions. But I also have interviews that are rife with names. Creating code names for the interviewees is not a problem, but what about all of the people that they name? Those names are important information for me, but probably not for others. So I keep one version of the transcriptions on my laptop (and Dropbox, and external harddrive) and another on the database? And does this mean that I don’t upload the recordings? Or at least they won’t ever be made available?
    Then I wonder about my colleagues who don’t have the same understanding of human research subjects that I do. (Yes, IRB, but also the warning in the Introduction [or Preface?] to Wilk’s Household Ecology.) What will my colleagues be uploading, and will they understand my reluctance to make all of my data available?

  3. What’s great about this idea/infrastructure is that it addresses a principal weakness of ethnography. As my adviser told me when I was doing my dissertation fieldwork, “90% of what you collect you’ll never use.” That ratio seems to hold up in all my projects–I’ve got far more data than I ever use in a direct sense, at least in terms of the final written projects. That’s highly inefficient and it would be great if others could work with it. I just saw an epigram from Marylin Strathern, something to the effect that ‘the strength of ethnography is that it generates more data than the ethnographer recognizes at the time.’ That probably holds true after the fact–that even when we work through an analysis there’s probably much more there than we’ve realized. So, enticing idea!

  4. These sorts of initiatives are popping up in other fields. I am, at this very moment, preparing the advertising annual credits data I assembled for my research on the Japanese advertising industry for the data exchange network associated with Connections the journal of the International Network for Social Network Analysis. The requirements for submission are as follows:

    The new DEN feature is to meet the goal of providing citable references for datasets and instruments. Submissions must include an electronic version of the network dataset and/or instrument and a short article (not to exceed 2,500 words) describing the data being submitted. These articles need not be as detailed as a full codebook, but should provide enough information that other researchers may appropriately use the data or measures. Additionally, the article should contain any information about the context from which the data were collected that may be relevant to others for appropriately using the data. All materials submitted for the DEN will be peer-reviewed to ensure the utility and usability of the data/instrument. Data should be submitted in the most generic format possible (preferably in Excel). Accepted DEN contributions must be described fully and clearly and any threats to validity should be made transparent.

    For ethnographic data, a codebook will not be appropriate. Still, the idea of a short article describing the data, where, when and how it was collected and topics it might be relevant to strikes me as a good idea.

  5. Dick: data archiving and open data is something that is growing in the sciences and social sciences, so while it may seem like a hurdle to cultural anthros now in fact it will be expected of us all in the near future.

    Karen: I wonder if open data becomes mainstream in cultural anthropology whether it will change the way we write fieldnotes in the first place! A kind of Hawthorne effect could set in. Remember the data creator sets the permissions for visibility, so only the files you want to be public will be released.

    John H.: Absolutely 90% of data goes unused! In the olden days if you were A REALLY BIG DEAL then when you retired your papers could be cataloged in some special collection somewhere so that future generations of grad students could pour over your work. Now everyone can archive everything. What if we did it Mark Twain style and released all our data X-number of years after our deaths?

    John McC.: Dataverse differs from DEN in that the materials are not peer reviewed. It is simply a repository. And yes, I think we would need to be in the practice of composing at least a README.txt that explains to our audience what goes where, kind of like a narrative table of contents.

  6. Matt … thanks for the reply. I don’t know if I will change the way I write fieldnotes, but maybe younger people, more used to having all their writings in the public view on social media will. And our nascent database will/does have metadata attached to each data file where context, etc., should be entered.
    Do you know of any guidelines that address ethical concerns about storing human subject data online?

  7. I am so glad you asked an ethics question. The best way to answer this is to return to the anthropologist’s role as creator of representations of others: we have multivalent obligations to those peoples and histories. This is (ideally) reflected in how we represent them in film, ethnography, government reports, etc. Each of these communications packages information in a particular form, or if you like, genre. The data repository is just another such form. Think of it as a type of publication.

    Assuming you have informed consent from the subject then they are aware that your research is going to result in publication. I think best practices should include making explicit the genre or form of those publications, including digital archives. In the case of old research it would be appropriate to make a concerted effort to find the subject in order to make clear that this publication form would be used. Failing that I would advise the anthropologist to use their own best judgement in making public any files about someone who granted informed consent, but is not aware of the digital archive.

    What about a case where you do not have informed consent? Such as when making observations or engaging in informal conversations. Pseudonyms should be used except, I think, in the case of public officials. If the setting were in public then I think that would be acceptable. For observations and informal conversations made in private, I would think twice about making that public even if pseudonyms were used. Again, I would advise the anthropologist to use their own best judgement. If the representation was such that you could maintain the confidentiality of the people as if they had given consent then I would consider it. But if it were the case that a local would see through your pseudonyms and identify the people you’re writing about, then there could be some unintended consequences in which case I wouldn’t advise making it public. Keep in mind that in Dataverse the author has administrative control over who can view the files.

    In sum, I don’t believe that data repositories really challenge our professional ethics in any new way because it is just an extension of what we’ve been doing all along: creating representations that result in publications.

Comments are closed.