AMuRD: Annotated Multilingual Receipts Dataset for Cross-lingual Key Information Extraction and Classification

Abdallah, Abdelrahman, Abdalla, Mahmoud, Elkasaby, Mohamed, Elbendary, Yasser, Jatowt, Adam

Sep-18-2023–arXiv.org Artificial Intelligence

Key information extraction involves recognizing and extracting text from scanned receipts, enabling retrieval of essential content, and organizing it into structured documents. This paper presents a novel multilingual dataset for receipt extraction, addressing key challenges in information extraction and item classification. The dataset comprises $47,720$ samples, including annotations for item names, attributes like (price, brand, etc.), and classification into $44$ product categories. We introduce the InstructLLaMA approach, achieving an F1 score of $0.76$ and an accuracy of $0.68$ for key information extraction and item classification. We provide code, datasets, and checkpoints.\footnote{\url{https://github.com/Update-For-Integrated-Business-AI/AMuRD}}.

arxiv preprint arxiv, dataset, information extraction, (10 more...)

arXiv.org Artificial Intelligence

Sep-18-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States > New York
    - New York County > New York City (0.05)
- Europe > Austria
  - Tyrol > Innsbruck (0.04)
- Asia > Vietnam
  - Khánh Hòa Province > Nha Trang (0.04)
- Africa > Middle East
  - Egypt > Cairo Governorate > Cairo (0.04)

Genre:
- Research Report (1.00)
- Overview (0.68)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Text Mining (1.00)
  - Artificial Intelligence
    - Natural Language > Information Extraction (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found