How to make downloading a PDF as easy as ripping an iTunes Song

We live in a wondrous era for electronic information. If you have a few thousand songs sitting on your hard drive there are all kinds of programs which will help you organize and catalog your information automatically. Software can identify the song and automatically add information about the song title, the artist, album, etc. There is even software that can automatically download the cover art for each album. When I take pictures with my digital camera it automatically saves extra information about the date I took the picture, what camera I used, and even the aperture and other settings. But there is one kind of information that remains in the dark ages: academic texts.

While songs and photos are rich in computer-readable metadata, most PDF files contain very little. You are lucky if you can even click inside the PDF file to copy and paste the article title. So, while there are many programs that will let you keep track of your PDF files in the same way that iTunes or iPhoto keeps track music and photos (my favorite is Bookends), one still has to open up the PDF, read the information, and then manually type it in to the database.

That you can open a PDF and read the data is a big difference between PDF files and other kinds of media. Not all songs have their title as the chorus. But precisely because of this, much less effort has gone into making it easy to automate the entering of such data into databases. Some academic websites will let you download citation data – but if the file is already sitting on your hard drive you can’t always figure out what database it came from. And this is another part of the problem with PDF metadata: the fact that there are so many different academic search engines, none of which is exhaustive.

So what is the solution?

If you download a recent issue of the Annual Review of Anthropology from the Annual Review web site you will see a DOI link at the top of the document. It will look like this:


According to Wikipedia, a DOI is:

a standard for persistently identifying a piece of intellectual property on a digital network and associating it with related current data, the metadata, in a structured extensible way.

In other words, it is like a cross between a web address (URL) and an ISBN number. Like an ISBN number, it identifies a particular work. Like a URL a user can be directly linked to that work via various software tools. While your web browser doesn’t (yet) understand a DOI address, you can resolve the above DOI by going to this link, or going to this page and entering the text “10.1146/annurev.anthro.33.070203.143706” in the box. Some bibliographic software, such as EndNote, can even resolve DOIs for you.

At present, when you get to the destination web page you still need to manually make the extra step of download the bibliographic metadata yourself and importing it into your bibliographic software. The software still won’t scan PDFs for you and automatically import the PDF and any associated metadata the way iTunes will when you rip a CD. But hopefully such a day is not far off.

I’m not crazy about the DOI solution however. It seems all wrapped up in the notion that metadata is itself a kind of intellectual property, not public data. If your school doesn’t subscribe to a particular database it isn’t clear that you will even have access to the page at the other end of the link. This issue has already come up with iTunes which is linked to the proprietary Gracenote database, prompting others to create alternatives, such as freedb. I’m not completely sure that the same issues apply with DOIs, which don’t really contain the metadata so much as direct you to the site where you (might) be able to access the metadata, but it seems to me that expectations about academic metadata need to be raised to at least the same level as what we’ve come to expect from our music.

One thought on “How to make downloading a PDF as easy as ripping an iTunes Song

  1. Greg Restall has described how he makes his papers available through iTunes. Also note that the PDF format does provide support for simple metadata (title, author, a number of date fields, etc.) but most people do not make use of it. If the supported fields are not enough, XMP can be used to encode pretty much every piece of metadata imaginable for a range of formats, including PDF.

    I think that part of the problem with PDF meta-data is that so much of academia uses software like Word which embeds nonsense metadata (e.g: the filename and account name as ‘title’ and ‘author’); or generate their PDFs with methods that strip metadata (via PS or DVI for example).

Comments are closed.