Vulnerable Knowledge: DataRefuge and the Protection of Public Research

Judith Butler has written that “resistance is the mobilization of vulnerability,” arguing that precariousness animates action. This suggests that rather than a state of docile subjugation, vulnerability is a source of empowerment. A particularly revealing example of this relationship between power and vulnerability is evidenced in the current status of federal climate science data. This data is increasingly vulnerable, as it is now maintained by an administration that has openly disavowed its credibility. At the same time, its vulnerability is directly tied to the potential power it wields in upsetting the authority and legitimacy of this administration. The power and vulnerability of climate data are positively correlated.

On its first day in office, the incoming administration ordered all mention of climate change removed from the official White House website. This, and the new president’s vow to eliminate Obama-era environmental policies, suggest a broad mistrust of science (climate science particularly) among the executive branch and its supporters. Suspecting that this could endanger decades of accumulated scientific data and research, UPenn’s Environmental Humanities program and Penn Libraries have initiated the DataRefuge project (#DataRefuge, @DataRefuge), facilitating a series of DataRescue events around the country designed to ensure that federal climate and environmental data remain publicly available under the current administration – a clear illustration of resistance stemming from the mobilization of vulnerability.

The following is an email conversation with one of the initiative’s organizers, Patricia Kim (@lowerendtheory) – Ph.D. candidate in Art History and Program Coordinator and Graduate Fellow at the Penn Program in Environmental Humanities (PPEH).

Can you briefly explain the DataRefuge project?

DataRefuge is a public, collaborative project launched in late November 2016 that creates trustworthy, research-quality copies of federal climate and environmental data that are disappearing from their agency websites or whose access is made more difficult. As an advocacy project, it tells stories about why climate and environmental data is vital to various individuals, institutions, as well as human and nonhuman communities. In addition, we are coordinating with our many collaborators to organize DataRescue events where advocates and volunteers help retrieve copies of the most valuable and vulnerable data nominated by researchers.

The need for such a endeavor seems fairly evident. Were there any similar groups or projects that inspired the PPEH initiative, or with whom you are are collaborating?

The End of Term Harvest Project (EOT), part of the Internet Archive, is a collaborative initiative that began in 2008 at the end of George W. Bush’s presidency to archive federal websites to the Internet Archive. With each presidential transition, many agency websites change, regardless of party, depending on each administration’s priorities. Sometimes administrations add more information to websites like the Environmental Protection Agency, while others take down certain datasets. EOT has a web crawler that will go into specific URLs and seed them to the Internet Archive, ensuring the public accessibility to the data. In addition to EOT, the Environmental Data Governance Initiative (EDGI) is another collaborator comprised of an international network of researchers who develop technical tools to both track changes to federal agency websites and download data that are “hard to crawl” and seed to the Internet Archive.

Is data politically neutral, and only when it is employed as evidence for some argument does it become politically charged? Or is data born politicized, given the circumstances of its observation and collection?

I love the French word for data because I think it gets to the heart of your question: données, from the verb donner or “to give,” meaning “that which are given.” A datum is given by an actor or a stakeholder who has established a political, social, or economic need for particular kinds of knowledge. Data are thus concrete in the sense that they are anchored in constructed measurements of qualities and organized observations and trends.

This understanding of data raises questions, such as, what is it that we are giving and taking? For what purpose are we giving and taking certain kinds of observations? In this sense, data is never neutral—it is always politically, historically, and socially situated.

Just because data are born contingent does not mean that they are inherently false and untrustworthy. The interpretation, (de/re)contextualization, and even worse, the disavowal of the validity of data are other modes of politicization that are particularly dangerous, since this knowledge serves to inform policies, and in many ways are matters of life and death for vulnerable communities.

The information you’re protecting was created with public money, so ostensibly the data belongs to the public, not the executive branch. Given this, would it be illegal or unconstitutional for the federal government to destroy it or make it inaccessible? Is there congressional or judicial oversight to prevent such an action?

Yes, we the people own federal data, and any actual, material destruction of that data is illegal. While they are not “deleting” anything, they are clearly suppressing public information. However, incoming administrations create new budgets to reprioritize funding needs—including which websites to maintain and what kinds of research to fund. The Internet needs upkeep and care, which requires labor; the digital has a material dimension that is as likely to degrade and rot as a house plant. If you defund certain aspects or take down sites, then it is harder to access this information. You can submit a Freedom of Information Act (FOIA) request to the specific agency. Yet that is often a long and thorny process, impeding any other research that may have taken place in the meantime.

I think what goes hand in hand with loss of access to vulnerable data are active impediments to the further collection of valuable data. Defunding agencies and organizations that sponsor research, like the National Endowment for the Humanities, National Endowment for the Arts, and the National Institute of Health, for instance, all put research and future data collection at risk.

Is digital information in any sense more fragile than analog? Is there a digital analogy to the spectacle of burning books?

There is a materiality, fragility, and ecology to digital information. Without care and maintenance, digital information is subject to degrade and disappear. For instance, as my colleague Steve Dolph has analyzed, the language of digital archiving and preservation mirrors that of ecology and agriculture. For instance, the technical term for a URL that no longer exists is “link rot.” The digital rots, expires, and degrades.

Certainly there are comparisons to draw between the spectacle of burning books and limiting access to public data, in that they are both examples of material loss. I think the main difference is that taking down websites and neglecting URLs is not spectacular, but rather silent and almost imperceptible to the broader public that does not include specialists and researchers whose work relies on access to this data, which, in some ways makes this kind of digital loss more insidious. The advocacy work of DataRefuge has drawn attention to the ways in which limiting access and defunding knowledge production are as pernicious as burning books.

Are there particular types of data the DataRefuge project is focused on protecting? Climate data broadly? NOAA and NASA datasets?

We are first focusing on datasets that have been nominated and identified by experts, specialists, and researchers as vulnerable and valuable through a survey that we have circulated across various networks. At DataRescue Philly, we focused primarily on NOAA datasets and were able to seed 3,692 URLs to the Internet Archive and furthermore download ~1.5 terabytes of data from “uncrawlable” websites. At that event, all datasets from the National Center for Environmental Information (NCEI), the National Environmental Satellite, Data, and Information Service (NESDIS), the National Marine Fisheries Service (NMFS) as well as a significant portion of the Office of Oceanographic and Atmospheric Research (OAR) were successfully seeded.

Other DataRescue events have worked on preserving data from the Department of Energy, Interior, NASA, NOAA, EPA, USDA, OSHA etc. What constitutes climate and environmental data is considerably broad, because of course, everything (transportation, public health, infrastructure, etc.) is impacted by global, anthropogenic climate change.

Any ballpark conception of how much federal climate data is out there? How much data have you been able to copy so far at DataRescue events?

The Internet is vast, and as for how much climate and environmental data is out there—A LOT! As for how much has been saved—this number is constantly growing because of the distributed efforts, but we ball-parked at the beginning of February that approximately 55,000 URLs have been seeded and about 2 terabytes harvested—but these numbers could be wrong and do not account for duplicated pages.

How does a DataRescue event work? Where is the rescued data being stored? Will the data you copy be publicly accessible? Are there plans to build query-able databases of rescued data?

DataRescue events require a lot of time, kindness, and collaboration among different kinds of experts. DataRefuge has created a workflow that is efficient and can be easily adapted by and scalable to different locations and groups.

Each DataRescue event focuses on a particular agency or branches within several agencies. DataRefuge has a (growing) list of datasets that have been nominated by researchers that guides the goals of each event.

Individuals with a broad number of skill sets participate in these events—from people who have basic tech skills that can use the web crawler to “seed” URLs to the Internet Archive, to coders and hackers who can write special scripts to retrieve data that is harder to access with just the crawler.

Then there are individuals like me who have no tech skills and are interested in documentation and storytelling. Documentation entails publicizing the event through social media and/or writing mini-ethnographies during each DataRescue event. Storytelling is a DataRefuge initiative that seeks to develop different stories or use-cases for climate and environmental data, mapping and writing why this information is vital to various communities beyond the climate science circles.

Librarians are the true heroes of this story. In order to host a DataRescue event, the participation of librarians and archivists, without whom the project and the retrieved data would be meaningless to researchers in the future, is critical. After the information is retrieved, it is the tireless work of data librarians and archivists to examine the information and fill out its metadata before dumping it into servers and repositories throughout North America.

In addition to being held in the Internet Archive, harvested data are stored across multiple repositories, and can be publicly accessed within the DataRefuge ckan. Our partners at the Libraries Network, a coalition of research libraries across North America are working to preserve born-digital government data as well.

What constitutes evidence today? Beliefs and belief systems are usually constructed out of some kind of evidence (from ancient texts to first-hand observations). Do you perceive any recent shifts in what popularly qualifies as reliable evidence? Are beliefs more resilient/resonant than evidence, or are beliefs only as strong as the evidence upon which they are based?

Data are that which are given by and for a group of stakeholders, and often form the foundation for evidence. In some cases, the data are the evidence. Evidence and data collection need a group of experts or a power-wielding agency to give the information and claims legitimacy. Belief systems and facts depend on a set of givens and indeed need evidence to work—but the ethics of their constitution and interpretation may vary.

Our president believes in national polls when they reflect well on him, but will disavow the same polls when they are negative. In other words, he will legitimize the evidence as true when it serves his needs. In this case, his belief in his own greatness is more resilient and resonant than the evidence and data. Unfortunately, the fact of manmade climate change, the evidence for global warming, and the data that demonstrate a wetter, hotter planet are more stubborn than his self-interests.

As you mention, when confronted with evidence that contradicts him, the new president’s reaction is to presume the evidence must be wrong. There’s no amount of evidence that can shake his certitude. Do you think this suggests a disrespect for the process of scientific knowledge that deals in levels of confidence and probabilities? Or is dismissiveness toward climate change data driven more by the fear that it undermines the legitimacy of the new administration’s policies – economic and environmental?

I don’t think it is scientific or probabilistic knowledge that the new administration disrespects. Wall Street, for instance, deals with a lot of knowledge and data that incorporate probabilities. This is, in some sense, what big banks and large corporations (EXXON, Carl’s Jr., Goldman Sachs) use as their bread and butter.

I think they just don’t care about scientific data and knowledge that pose threats to their own economic interests. Or even worse, they pick and choose which probabilities to accept or cast doubt on in order to serve their needs. Scott Pruitt (new head of the EPA), exemplifies this. On manmade climate change, he has written, “scientists continue to disagree about the degree and extent of global warming and its connection to the actions of mankind…”

What is given is that the fossil fuel industry or hydraulic fracking pose threats to public health and safety. The data that demonstrate this are seen as harmful because they threaten the ways in which these power-wielding agencies make money. Thus, those that profit from environmental degradation see the data itself as harmful, as opposed to the effects it reveals.

Science, as we know it today, didn’t develop to predict the future. It developed to more accurately be able to connect effects to causes, which allows humans a greater ability to know what outcomes to expect from certain actions. Do you think that resistance to the idea of anthropogenic climate change might in some part be because it is often depicted as climatologists being prophetic, divinatory, or reading the future? And that perhaps if the concept of climate change is treated less like a prediction and more like the effect of a cause that the issue might have more public traction?

I think we need to distinguish between people that are willing to connect climate change as the cause to any of the disastrous effects, and those that connect human activities to global warming. I think some U.S. power-holders resist the idea of anthropogenic climate change because it would force them to completely re-imagine what it means to be a society. It would force us to overhaul our energy systems, to change the way that we live as an ecosystem, and to re-imagine our material worlds and material relations.

I am not sure if climatologists are necessarily painted as prophets—particularly since the effects of climate change are already so pervasive in present day-to-day life. However, you do raise a good point about the temporality of climate change and global warming.

On the one hand, climate change and geological processes occur on a timescale that is longer and slower than humans seem to be able to bear—but we can see, feel, and emotionally respond to disasters like floods and hurricanes—which impact the poorest of us by the way—much of which are consequences of climate change. Because we can see these discrete disasters and crises instantaneously and immediately, we are better able to respond—but of course there are examples where the response is inadequate. How to register these multiple existing temporalities together, or explain how these two incommensurate temporalities are part of the same phenomenon of change is the challenge.

Most importantly, if we communicate and tell more stories about the effects of global warming in various communities, climate change would gain more public traction as a public health, economic, and social concern. It is not enough to make the data available—we must make it legible to the multiple publics it threatens.

Savage Minds

Notes and Queries in Anthropology

Vulnerable Knowledge: DataRefuge and the Protection of Public Research