The basic format doesn’t include any requirement that text be selectable or searchable, while data presented as charts and tables is often impossible to export in any useable way.
It’s the standard file format for nearly every academic paper, political briefing and research note. But a new report by the World Bank suggests that the venerable pdf is keeping valuable information buried in servers, unread and unloved.
The working paper — released, naturally, as a pdf — examines which reports released by the organisation are widely read, or even read at all. Of the 1,611 reports the study looked at, only 25 were downloaded more than 1,000 times in the five-year period between 2008 and 2012. At the other end of the scale, over 31 per cent of the reports the group looked at — 517 separate research papers — were not downloaded a single time.
“It is, however, important to keep in mind that many policy reports were not intended to reach a large audience,” note the report’s authors, Doerte Doemeland and James Trevino, “but prepared to assess very specific technical questions or inform the design of lending operations.” As for which reports were actually read, the pair state that “more expensive, complex, multi-sector, core diagnostics reports on middle-income countries with larger populations tend to be downloaded more frequently.” The portable document format, or “pdf”, was invented by Adobe in 1993 as a way of rendering documents with rich text formatting and inline images in a consistent way across multiple computing platforms and various software packages. A document saved as a pdf should always look the same, no matter where it is being viewed, a fact which has made it popular for the digital release of complex reports.
Blocks data analysis
But owing to the way such documents are rendered, pdfs often give up machine readability in favour of human readability. The basic format doesn’t include any requirement that text be selectable or searchable, while data presented as charts and tables is often impossible to export in any useable way.
That then makes it impossible to mine the documents for the data they contain and so create databases of new information pulling together disparate sources. Despite efforts to create “pdf to html” converters, they still need human oversight to check for errors of interpretation.
Nathanial Manning, a fellow for the White House’s open data project, argued in The Guardian that it’s understandable that the format is used. “There are often numerous different documents used to make a single project report, including Excel models, GIS shapefiles, and Photoshop charts.
“The ease of taking screenshots and putting it all into a pdf report, and sending it along via e-mail is completely understandable. But this is like funding James Cameron to make Avatar, and then releasing it in a black and white flipbook. We are missing all the good stuff. This has to change.” — © Guardian Newspapers Limited, 2014