AITopics

2507.04878

Country:

Europe > Spain (0.88)
Asia > Middle East > UAE (0.28)

Genre: Research Report (0.82)

Industry:

Information Technology (0.68)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Enstad, Tita, Trosterud, Trond, Røsok, Marie Iversdatter, Beyer, Yngvil, Roald, Marie

Comparative analysis of optical character recognition methods for S\'ami texts from the National Library of Norway

arXiv.org Artificial IntelligenceJan-13-2025

Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S\'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S\'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S\'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for S\'ami languages, even with a moderate amount of manually annotated data.

artificial intelligence, machine learning, proceedings, (18 more...)

2501.073

Country:

Europe > Norway (0.71)
North America > United States (0.69)

Genre: Research Report > New Finding (0.54)

Industry: Energy > Oil & Gas (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Humphries, Mark, Leddy, Lianne C., Downton, Quinn, Legace, Meredith, McConnell, John, Murray, Isabella, Spence, Elizabeth

Unlocking the Archives: Using Large Language Models to Transcribe Handwritten Historical Documents

arXiv.org Artificial IntelligenceNov-1-2024

This study demonstrates that Large Language Models (LLMs) can transcribe historical handwritten documents with significantly higher accuracy than specialized Handwritten Text Recognition (HTR) software, while being faster and more cost-effective. We introduce an open-source software tool called Transcription Pearl that leverages these capabilities to automatically transcribe and correct batches of handwritten documents using commercially available multimodal LLMs from OpenAI, Anthropic, and Google. In tests on a diverse corpus of 18th/19th century English language handwritten documents, LLMs achieved Character Error Rates (CER) of 5.7 to 7% and Word Error Rates (WER) of 8.9 to 15.9%, improvements of 14% and 32% respectively over specialized state-of-the-art HTR software like Transkribus. Most significantly, when LLMs were then used to correct those transcriptions as well as texts generated by conventional HTR software, they achieved near-human levels of accuracy, that is CERs as low as 1.8% and WERs of 3.5%. The LLMs also completed these tasks 50 times faster and at approximately 1/50th the cost of proprietary HTR programs. These results demonstrate that when LLMs are incorporated into software tools like Transcription Pearl, they provide an accessible, fast, and highly accurate method for mass transcription of historical handwritten documents, significantly streamlining the digitization process.

large language model, machine learning, natural language, (20 more...)

2411.0334

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States > New York (0.04)
North America > United States > North Carolina (0.04)
(4 more...)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

arXiv.org Artificial IntelligenceDec-19-2023

REE-HDSC: Recognizing Extracted Entities for the Historical Database Suriname Curacao

Sang, Erik Tjong Kim

We describe the project REE-HDSC and outline our efforts to improve the quality of named entities extracted automatically from texts generated by hand-written text recognition (HTR) software. We describe a six-step processing pipeline and test it by processing 19th and 20th century death certificates from the civil registry of Curacao. We find that the pipeline extracts dates with high precision but that the precision of person name extraction is low. Next we show how name precision extraction can be improved by retraining HTR models with names, post-processing and by identifying and removing incorrect names.

certificate, detection, transkribus, (15 more...)

2401.02972

Country:

South America > Suriname (0.40)
Europe > Netherlands > Gelderland > Nijmegen (0.04)
North America > Curaçao > Willemstad (0.04)
Asia > India (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science (0.95)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
(2 more...)

Couture, Beatrice, Verret, Farah, Gohier, Maxime, Deslandres, Dominique

The Challenges of HTR Model Training: Feedback from the Project Donner le gout de l'archive a l'ere numerique

arXiv.org Artificial IntelligenceNov-12-2023

The arrival of handwriting recognition technologies offers new possibilities for research in heritage studies. However, it is now necessary to reflect on the experiences and the practices developed by research teams. Our use of the Transkribus platform since 2018 has led us to search for the most significant ways to improve the performance of our handwritten text recognition (HTR) models which are made to transcribe French handwriting dating from the 17th century. This article therefore reports on the impacts of creating transcribing protocols, using the language model at full scale and determining the best way to use base models in order to help increase the performance of HTR models. Combining all of these elements can indeed increase the performance of a single model by more than 20% (reaching a Character Error Rate below 5%). This article also discusses some challenges regarding the collaborative nature of HTR platforms such as Transkribus and the way researchers can share their data generated in the process of creating or training handwritten text recognition models.

base model, htr model, transcription, (17 more...)

2212.11146

Country:

North America > Canada > Quebec > Montreal (0.07)
Europe > France (0.05)
Europe > Austria > Vienna (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision > Handwriting Recognition (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

arXiv.org Artificial IntelligenceJul-5-2023

Artificial Intelligence in archival and historical scholarship workflow: HTS and ChatGPT

Spina, Salvatore

This article examines the impact of Artificial Intelligence on the archival heritage digitization processes, specifically regarding the manuscripts' automatic transcription, their correction, and normalization. It highlights how digitality has compelled scholars to redefine Archive and History field and has facilitated the accessibility of analogue sources through digitization and integration into big data. The study focuses on two AI systems, namely Transkribus and ChatGPT, which enable efficient analysis and transcription of digitized sources. The article presents a test of ChatGPT, which was utilized to normalize the text of 366 letters stored in the Correspondence section of the Biscari Archive (Catania). Although the AI exhibited some limitations that resulted in inaccuracies, the corrected texts met expectations. Overall, the article concludes that digitization and AI can significantly enhance archival and historical research by allowing the analysis of vast amounts of data and the application of computational linguistic tools.

large language model, machine learning, natural language, (15 more...)

2308.02044

Country:

North America > United States > New York > Monroe County > Rochester (0.04)
Europe > Italy > Sicily (0.04)
Europe > France (0.04)
(2 more...)

Genre:

Research Report (0.84)
Workflow (0.51)

Industry:

Education (0.65)
Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

#artificialintelligenceMar-1-2021, 20:17:54 GMT

U of T researchers train AI to read difficult-to-decipher medieval texts

In a move that could transform manuscript studies, University of Toronto researchers have partnered with a team in the United Kingdom to develop a program that can read and transcribe the handwritten Latin found in 13th-century legal manuscripts. While scholars have been making digital images of these manuscripts for years, transcribing and comparing these texts is painstaking and tedious work that can take years or even decades to complete. That's because medieval handwriting can often look crabbed and unintelligible, with non-standardized spellings, hyphenations, abbreviations, calligraphic flourishes and any number of distinct "hands." But machine-reading software called Transkribus promises to change the field. Using artificial intelligence (AI), the software can theoretically be trained to read any type of handwriting, in any language – and Michael Gervers, a professor of medieval social and economic history at U of T Scarborough, says it could eventually be applied across medieval studies.

manuscript, read difficult-to-decipher medieval text, transkribus, (9 more...)

Country:

North America > Canada > Ontario > Toronto (0.57)
Europe > United Kingdom (0.26)
North America > Mexico (0.06)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.32)

#artificialintelligenceDec-11-2020, 22:08:23 GMT

Machine learning and big data are unlocking Europe's archives

From wars to weddings, Europe's history is stored in billions of archival pages across the continent. While many archives try to make their documents public, finding information in them remains a low-tech affair. Simple page scans do not offer the metadata such as dates, names, locations that often interest researchers. Copying this information for later use is also time-consuming. These issues are well-known in Amsterdam, which is trying to disclose its entire archives.

algorithm, archive, transkribus, (14 more...)

Country:

Europe > Netherlands > North Holland > Amsterdam (0.26)
Europe > Finland (0.06)
Europe > Italy (0.05)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.41)

#artificialintelligenceJul-6-2020, 19:01:34 GMT

Digital Cultural Heritage

The Gaelic Handwriting Recognition Project is converting 500k words of traditional narrative documents to digital text and training the first automatic handwriting recogniser for the Gaelic language, using the Transkribus platform (https://transkribus.eu/Transkribus/). This will provide the foundation for an ambitious future research programme, which will develop novel language technologies for Gaelic and innovative ways of researching traditional narrative through these technologies. Once finalised, the Gaelic handwriting recogniser will be made available worldwide through Transkribus.

artificial intelligence, digital cultural heritage, handwriting recogniser, (1 more...)

Industry: Health & Medicine > Public Health (0.40)

Technology: Information Technology > Artificial Intelligence (0.79)

#artificialintelligenceJan-1-2020, 21:30:55 GMT

The small wonderful ways AI is changing our lives for the better

It's easy to get cynical about artificial intelligence (AI). China is using facial recognition against the Uighurs. NYT: 'One Month, 500,000 Face Scans: How China Is Using A.I. to Profile a Minority' Google's participating in the development of autonomous weapons. The Intercept: 'Google Continues Investments in Military and Police AI Technology Through Venture Capital Arm' And facial recognition programmes are still struggling to recognise black faces. But last year I also saw another side.

artificial intelligence, neuralink, website, (16 more...)

Country:

Asia > China (0.45)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Europe > Netherlands > Zeeland (0.04)
(3 more...)

Industry:

Government (0.96)
Information Technology (0.70)
Law > Statutes (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.46)