transkribus
Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite
Torterolo-Orta, Yanco Amor, Macicior-Mitxelena, Jaione, Miguez-Lamanuzzi, Marina, García-Serrano, Ana
This article presents the experiments and results obtained by the GRESEL team in the IberLEF 2025 shared task PastReader: Transcribing Texts from the Past. Three types of experiments were conducted with the dual aim of participating in the task and enabling comparisons across different approaches. These included the use of a web-based OCR service, a traditional OCR engine, and a compact multimodal model. All experiments were run on consumer-grade hardware, which, despite lacking high-performance computing capacity, provided sufficient storage and stability. The results, while satisfactory, leave room for further improvement. Future work will focus on exploring new techniques and ideas using the Spanish-language dataset provided by the shared task, in collaboration with Biblioteca Nacional de España (BNE).
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Spain > Aragón > Zaragoza Province > Zaragoza (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Europe > Spain > Andalusia > Jaén Province > Jaén (0.04)
- Information Technology (0.68)
- Media (0.46)
Comparative analysis of optical character recognition methods for S\'ami texts from the National Library of Norway
Enstad, Tita, Trosterud, Trond, Røsok, Marie Iversdatter, Beyer, Yngvil, Roald, Marie
Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S\'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S\'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S\'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for S\'ami languages, even with a moderate amount of manually annotated data.
- Europe > Norway (0.71)
- North America > United States (0.69)
Unlocking the Archives: Using Large Language Models to Transcribe Handwritten Historical Documents
Humphries, Mark, Leddy, Lianne C., Downton, Quinn, Legace, Meredith, McConnell, John, Murray, Isabella, Spence, Elizabeth
This study demonstrates that Large Language Models (LLMs) can transcribe historical handwritten documents with significantly higher accuracy than specialized Handwritten Text Recognition (HTR) software, while being faster and more cost-effective. We introduce an open-source software tool called Transcription Pearl that leverages these capabilities to automatically transcribe and correct batches of handwritten documents using commercially available multimodal LLMs from OpenAI, Anthropic, and Google. In tests on a diverse corpus of 18th/19th century English language handwritten documents, LLMs achieved Character Error Rates (CER) of 5.7 to 7% and Word Error Rates (WER) of 8.9 to 15.9%, improvements of 14% and 32% respectively over specialized state-of-the-art HTR software like Transkribus. Most significantly, when LLMs were then used to correct those transcriptions as well as texts generated by conventional HTR software, they achieved near-human levels of accuracy, that is CERs as low as 1.8% and WERs of 3.5%. The LLMs also completed these tasks 50 times faster and at approximately 1/50th the cost of proprietary HTR programs. These results demonstrate that when LLMs are incorporated into software tools like Transcription Pearl, they provide an accessible, fast, and highly accurate method for mass transcription of historical handwritten documents, significantly streamlining the digitization process.
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > New York (0.04)
- North America > United States > North Carolina (0.04)
- (4 more...)
- Research Report > New Finding (0.48)
- Research Report > Experimental Study (0.46)
REE-HDSC: Recognizing Extracted Entities for the Historical Database Suriname Curacao
We describe the project REE-HDSC and outline our efforts to improve the quality of named entities extracted automatically from texts generated by hand-written text recognition (HTR) software. We describe a six-step processing pipeline and test it by processing 19th and 20th century death certificates from the civil registry of Curacao. We find that the pipeline extracts dates with high precision but that the precision of person name extraction is low. Next we show how name precision extraction can be improved by retraining HTR models with names, post-processing and by identifying and removing incorrect names.
- South America > Suriname (0.40)
- Europe > Netherlands > Gelderland > Nijmegen (0.04)
- North America > Curaçao > Willemstad (0.04)
- Asia > India (0.04)
The Challenges of HTR Model Training: Feedback from the Project Donner le gout de l'archive a l'ere numerique
Couture, Beatrice, Verret, Farah, Gohier, Maxime, Deslandres, Dominique
The arrival of handwriting recognition technologies offers new possibilities for research in heritage studies. However, it is now necessary to reflect on the experiences and the practices developed by research teams. Our use of the Transkribus platform since 2018 has led us to search for the most significant ways to improve the performance of our handwritten text recognition (HTR) models which are made to transcribe French handwriting dating from the 17th century. This article therefore reports on the impacts of creating transcribing protocols, using the language model at full scale and determining the best way to use base models in order to help increase the performance of HTR models. Combining all of these elements can indeed increase the performance of a single model by more than 20% (reaching a Character Error Rate below 5%). This article also discusses some challenges regarding the collaborative nature of HTR platforms such as Transkribus and the way researchers can share their data generated in the process of creating or training handwritten text recognition models.
- North America > Canada > Quebec > Montreal (0.07)
- Europe > France (0.05)
- Europe > Austria > Vienna (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Vision > Handwriting Recognition (0.75)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Artificial Intelligence in archival and historical scholarship workflow: HTS and ChatGPT
This article examines the impact of Artificial Intelligence on the archival heritage digitization processes, specifically regarding the manuscripts' automatic transcription, their correction, and normalization. It highlights how digitality has compelled scholars to redefine Archive and History field and has facilitated the accessibility of analogue sources through digitization and integration into big data. The study focuses on two AI systems, namely Transkribus and ChatGPT, which enable efficient analysis and transcription of digitized sources. The article presents a test of ChatGPT, which was utilized to normalize the text of 366 letters stored in the Correspondence section of the Biscari Archive (Catania). Although the AI exhibited some limitations that resulted in inaccuracies, the corrected texts met expectations. Overall, the article concludes that digitization and AI can significantly enhance archival and historical research by allowing the analysis of vast amounts of data and the application of computational linguistic tools.
- North America > United States > New York > Monroe County > Rochester (0.04)
- Europe > Italy > Sicily (0.04)
- Europe > France (0.04)
- (2 more...)
- Research Report (0.84)
- Workflow (0.51)
- Education (0.65)
- Health & Medicine (0.47)
U of T researchers train AI to read difficult-to-decipher medieval texts
In a move that could transform manuscript studies, University of Toronto researchers have partnered with a team in the United Kingdom to develop a program that can read and transcribe the handwritten Latin found in 13th-century legal manuscripts. While scholars have been making digital images of these manuscripts for years, transcribing and comparing these texts is painstaking and tedious work that can take years or even decades to complete. That's because medieval handwriting can often look crabbed and unintelligible, with non-standardized spellings, hyphenations, abbreviations, calligraphic flourishes and any number of distinct "hands." But machine-reading software called Transkribus promises to change the field. Using artificial intelligence (AI), the software can theoretically be trained to read any type of handwriting, in any language – and Michael Gervers, a professor of medieval social and economic history at U of T Scarborough, says it could eventually be applied across medieval studies.
- North America > Canada > Ontario > Toronto (0.57)
- Europe > United Kingdom (0.26)
- North America > Mexico (0.06)
- (2 more...)
Machine learning and big data are unlocking Europe's archives
From wars to weddings, Europe's history is stored in billions of archival pages across the continent. While many archives try to make their documents public, finding information in them remains a low-tech affair. Simple page scans do not offer the metadata such as dates, names, locations that often interest researchers. Copying this information for later use is also time-consuming. These issues are well-known in Amsterdam, which is trying to disclose its entire archives.
Digital Cultural Heritage
The Gaelic Handwriting Recognition Project is converting 500k words of traditional narrative documents to digital text and training the first automatic handwriting recogniser for the Gaelic language, using the Transkribus platform (https://transkribus.eu/Transkribus/). This will provide the foundation for an ambitious future research programme, which will develop novel language technologies for Gaelic and innovative ways of researching traditional narrative through these technologies. Once finalised, the Gaelic handwriting recogniser will be made available worldwide through Transkribus.
The small wonderful ways AI is changing our lives for the better
It's easy to get cynical about artificial intelligence (AI). China is using facial recognition against the Uighurs. NYT: 'One Month, 500,000 Face Scans: How China Is Using A.I. to Profile a Minority' Google's participating in the development of autonomous weapons. The Intercept: 'Google Continues Investments in Military and Police AI Technology Through Venture Capital Arm' And facial recognition programmes are still struggling to recognise black faces. But last year I also saw another side.
- Asia > China (0.45)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Europe > Netherlands > Zeeland (0.04)
- (3 more...)
- Government (0.96)
- Information Technology (0.70)
- Law > Statutes (0.48)