Nougat: Neural Optical Understanding for Academic Documents
Blecher, Lukas, Cucurull, Guillem, Scialom, Thomas, Stojnic, Robert
–arXiv.org Artificial Intelligence
The majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost. Existing Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial. Converting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format.
arXiv.org Artificial Intelligence
Aug-25-2023
- Country:
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Neural Networks > Deep Learning (1.00)
- Pattern Recognition (0.66)
- Natural Language > Text Processing (0.66)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence