lexicography
Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries
Jürviste, Madis, Jakobson, Joonatan
This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs) to the study of 17th and 18th century Estonian dictionaries. The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset. Initial experiments with J. Gutslaff's 1648 dictionary indicate that LLMs have significant potential for semi-automatic enrichment of dictionary information. When provided with sufficient context, Claude 3.7 Sonnet accurately provided meanings and modern equivalents for 81% of headword entries. In a text recognition experiment with A. T. Helle's 1732 dictionary, a zero-shot method successfully identified and structured 41% of headword entries into error-free JSON-formatted output. For digitising the Estonian-German dictionary section of A. W. Hupel's 1780 grammar, overlapping tiling of scanned image files is employed, with one LLM being used for text recognition and a second for merging the structured output. These findings demonstrate that even for minor languages LLMs have a significant potential for saving time and financial resources.
- Europe > Portugal > Lisbon > Lisbon (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > Estonia > Tartu County > Tartu (0.08)
- (3 more...)
Der Effizienz- und Intelligenzbegriff in der Lexikographie und kuenstlichen Intelligenz: kann ChatGPT die lexikographische Textsorte nachbilden?
Arias-Arias, Ivan, Vazquez, Maria Jose Dominguez, Riveiro, Carlos Valcarcel
By means of pilot experiments for the language pair German and Galician, this paper examines the concept of efficiency and intelligence in lexicography and artificial intelligence, AI. The aim of the experiments is to gain empirically and statistically based insights into the lexicographical text type,dictionary article, in the responses of ChatGPT 3.5, as well as into the lexicographical data on which this chatbot was trained. Both quantitative and qualitative methods are used for this purpose. The analysis is based on the evaluation of the outputs of several sessions with the same prompt in ChatGPT 3.5. On the one hand, the algorithmic performance of intelligent systems is evaluated in comparison with data from lexicographical works. On the other hand, the ChatGPT data supplied is analysed using specific text passages of the aforementioned lexicographical text type. The results of this study not only help to evaluate the efficiency of this chatbot regarding the creation of dictionary articles, but also to delve deeper into the concept of intelligence, the thought processes and the actions to be carried out in both disciplines.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > Czechia > South Moravian Region > Brno (0.05)
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- (11 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)
Building another Spanish dictionary, this time with GPT-4
Ortega-Martín, Miguel, García-Sierra, Óscar, Ardoiz, Alfonso, Armenteros, Juan Carlos, Garrido, Ignacio, Álvarez, Jorge, Torrón, Camilo, Galdeano, Iñigo, Arranz, Ignacio, Vorontsov, Oleg, Alonso, Adrián
We present the "Spanish Built Factual Freectianary 2.0" (Spanish-BFF-2) as the second iteration of an AI-generated Spanish dictionary. Previously, we developed the inaugural version of this unique free dictionary employing GPT-3. In this study, we aim to improve the dictionary by using GPT-4-turbo instead. Furthermore, we explore improvements made to the initial version and compare the performance of both models.
- Europe > Spain > Galicia > Madrid (0.04)
- North America > United States (0.04)
Contribuci\'on de la sem\'antica combinatoria al desarrollo de herramientas digitales multiling\"ues
This paper describes how the field of Combinatorial Semantics has contributed to the design of three prototypes for the automatic generation of argument patterns in nominal phrases in Spanish, French and German (Xera, Combinatoria and CombiContext). It also shows the importance of knowing about the argument syntactic-semantic interface in a production situation in the context of foreign languages. After a descriptive section on the design, typologie and information levels of the resources, there follows an explanation of the central role of the combinatorial meaning (roles and ontological features). The study deals with different semantic f ilters applied in the selection, organization and expansion of the lexicon, being these key pieces for the generation of grammatically correct and semantically acceptable mono- and biargumental nominal phrases.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)
- Europe > Spain > Galicia > A Coruña Province > Santiago de Compostela (0.05)
- Europe > France (0.05)
- (14 more...)
"Definition Modeling: To model definitions." Generating Definitions With Little to No Semantics
Segonne, Vincent, Mickus, Timothee
Definition Modeling, the task of generating definitions, was first proposed as a means to evaluate the semantic quality of word embeddings-a coherent lexical semantic representations of a word in context should contain all the information necessary to generate its definition. The relative novelty of this task entails that we do not know which factors are actually relied upon by a Definition Modeling system. In this paper, we present evidence that the task may not involve as much semantics as one might expect: we show how an earlier model from the literature is both rather insensitive to semantic aspects such as explicit polysemy, as well as reliant on formal similarities between headwords and words occurring in its glosses, casting doubt on the validity of the task as a means to evaluate embeddings.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (23 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Spanish Built Factual Freectianary (Spanish-BFF): the first AI-generated free dictionary
Ortega-Martín, Miguel, García-Sierra, Óscar, Ardoiz, Alfonso, Armenteros, Juan Carlos, Álvarez, Jorge, Alonso, Adrián
Dictionaries are one of the oldest and most used linguistic resources. Building them is a complex task that, to the best of our knowledge, has yet to be explored with generative Large Language Models (LLMs). We introduce the "Spanish Built Factual Freectianary" (Spanish-BFF) as the first Spanish AI-generated dictionary. This first-of-its-kind free dictionary uses GPT-3. We also define future steps we aim to follow to improve this initial commitment to the field, such as more additional languages.
- North America > United States (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Europe > Spain > Balearic Islands (0.04)
- (2 more...)
American English Is Now Reliant on Scrabble's Dictionary
In the mid-1970s, top players in an emerging tournament Scrabble scene persuaded the game's corporate owner to adopt a universal lexicon for competition. Players manually scraped five standard college dictionaries, recording every unique two- through eight-letter word (plus inflections) that met the game's rules. When the Official Scrabble Players Dictionary was published, in 1978, players rejoiced. "You can retire the boxing gloves and put up your swords," the Scrabble Players Newspaper wrote. "You now have an arbiter to settle all arguments."
- North America > United States > Indiana (0.05)
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > Connecticut > Fairfield County > Westport (0.05)
- Europe > United Kingdom > England > East Sussex > Brighton (0.05)
Lexicography from Α to Ω
ELEXIS from Α to Ω: Outcomes, Sustainability & Afterlife of a new European Lexicographic Infrastructure ELEXIS Showcase Event 2022 invites representatives of institutions that have become observers, as well as people from the industry, operating in fields such as Language Technology, Machine Translation, language learning, Dictionary Publishing, etc.