The Vatican Secret Archives comprise 600 collections of texts spanning 12 centuries, most of which are nearly impossible to access. The Atlantic reports that a team of scientists is hoping to change that with help from some high school students and artificial intelligence software. In Codice Ratio is a new research project dedicated to analyzing the vast majority of Vatican manuscripts that have never been digitized. When other libraries wish to make a digital archive of their inventory, they often use optical-character-recognition (OCR) software. Such programs can be trained to recognize the letters in a certain alphabet, pick them out of hard-copy manuscripts, and convert them to searchable text.
Not even the highest Roman Catholic church archivists know what's hiding in the archives' endless volumes, which are carefully stored near the Sistine Chapel. Only a tiny amount of these archives have been digitized–the rest is an endless ocean of inaccessible papers and parchments. Going through its tomes in search of something would be a task that not even the goddess Minerva herself would be able to accomplish. Some libraries have used technology to digitize their collections, like Optical Character Recognition software that's trained to recognize fixed, separated individual letter shapes. However, OCR is useless when it comes to the endless variety of free-flowing cursive styles featured in many of the Vatican's tomes, which go all the way back to the eighth century.
In 1633, Galileo Galilei was charged with heresy for claiming that Earth orbits the sun. A transcript of his trial is safely tucked away in the Vatican Secret Archives, along with thousands of other documents dating back to the eighth century. But it's hard for scholars to search through them without reading every word. To remedy that, researchers figured out how to digitize handwritten Latin text into a computer-readable format, The Atlantic reports. Classics scholars and high school students helped train a machine learning program, and then the program took it from there, transcribing several pages from the archives, the researchers report in a preprint posted to arXiv.
Somewhere within the Vatican exists the Vatican Secret Archives, whose 53 miles of shelving contains more than 600 collections of account books, official acts, papal correspondence, and other historical documents. Though its holdings date back to the eighth century, it has in the past few weeks come to worldwide attention. This has brought about all manner of jokes about the plot of Dan Brown's next novel, but also important news about the technology of manuscript digitization. It seems a project to get the contents of the Vatican Secret Archives digitized and online has made great progress cracking a problem that once seemed impossibly difficult: turning handwriting into computer-searchable text. In Codice Ratio is "developing a full-fledged system to automatically transcribe the contents of the manuscripts" that uses not the standard method of optical character recognition (OCR), which looks for the spaces between words, but a new way that can handle connected cursive and calligraphic letters.
But a new project could change all that. Known as In Codice Ratio, it uses a combination of artificial intelligence and optical-character-recognition (OCR) software to scour these neglected texts and make their transcripts available for the very first time. If successful, the technology could also open up untold numbers of other documents at historical archives around the world. OCR has been used to scan books and other printed documents for years, but it's not well suited for the material in the Secret Archives. Traditional OCR breaks words down into a series of letter-images by looking for the spaces between letters.