Tackling the US Government's PDF Mountain With Computer Vision


Adobe's PDF format has entrenched itself so deeply in US government document pipelines that the number of state-issued documents currently in existence is conservatively estimated to be in the hundreds of millions. Often opaque and lacking metadata, these PDFs – many created by automated systems – collectively tell no stories or sagas; if you don't know exactly what you're looking for, you'll probably never find a pertinent document. And if you did know, you probably didn't need the search. However a new project is using computer vision and other machine learning approaches to change this almost unapproachable mountain of data into a valuable and explorable resource for researchers, historians, journalists and scholars. When the US government discovered Adobe's Portable Document Format (PDF) in the 1990s, it decided that it liked it.