Nougat: Neural Optical Understanding for Academic Documents

Blecher, Lukas, Cucurull, Guillem, Scialom, Thomas, Stojnic, Robert

Aug-25-2023–arXiv.org Artificial Intelligence

The majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost. Existing Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial. Converting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format.

machine learning, natural language, pattern recognition, (19 more...)

arXiv.org Artificial Intelligence

Aug-25-2023

arXiv.org PDF

Add feedback

Country:
- South America > Brazil
  - Paraná > Curitiba (0.04)
- Oceania > Australia
  - Victoria > Melbourne (0.04)
  - New South Wales > Sydney (0.04)
- North America > United States
  - New York (0.04)
  - Pennsylvania > Philadelphia County
    - Philadelphia (0.04)
  - New Jersey > Essex County
    - Newark (0.04)
  - Michigan > Washtenaw County
    - Ann Arbor (0.04)
  - California > San Francisco County
    - San Francisco (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Text Processing (0.66)
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Pattern Recognition (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found