Goto

Collaborating Authors

 dictionary


Is the Dictionary Done For?

The New Yorker

Is the Dictionary Done For? The print edition of Merriam-Webster was once a touchstone of authority and stability. Then the internet brought about a revolution. Wars over words are inevitably culture wars, and debates over the dictionary have raged for as long as it has existed. Once, every middle-class home had a piano and a dictionary. The purpose of the piano was to be able to listen to music before phonographs were available and affordable. Later on, it was to torture young persons by insisting that they learn to do something few people do well. The purpose of the dictionary was to settle intra-family disputes over the spelling of words like "camaraderie" and "sesquipedalian," or over the correct pronunciation of "puttee." This was the state of the world not that long ago. In the late nineteen-eighties, Merriam-Webster's Collegiate Dictionary was on the best-seller list for a hundred and fifty-five consecutive weeks. Fifty-seven million copies were sold, a number believed to be second only, in this country, to sales of the Bible. There was good money in the word business.


Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

Masoka, Happymore

arXiv.org Artificial Intelligence

Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at https://github.com/HappymoreMasoka/shona-spacy and a PyPI release at https://pypi.org/project/shona-spacy/0.1.4/. Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature accuracy, while maintaining transparency in its linguistic decisions. By bridging descriptive grammar and computational implementation, Shona spaCy advances NLP accessibility and digital inclusion for Shona speakers and provides a template for morphological analysis tools for other under-resourced Bantu languages.


Do you know your 2025 lingo? As 'parasocial' is named word of the year, take the test to see if you can keep up with this year's trending language

Daily Mail - Science & tech

The truth behind Trump's dramatic late-night Epstein file reversal: It wasn't a gamble, it was a tactic... and White House insiders say it's Democrats who will pay the price Doctor's warning about lesser discussed Mounjaro side effect - which has similar symptom to deadly bowel cancer The incredible new treatment that can cure liver cancer - without surgery, drugs or radiation. Roger had cirrhosis and thought he was going to die. Now he says: 'I'm so grateful' X is DOWN: Elon Musk's social media app crashes for thousands of users around the world Tom Cruise breaks his silence over ex-wife Nicole Kidman's split from Keith Urban: 'Karma' North Korea executes'big shot' couple who became'arrogant' after the success of their business, accusing them of being'anti-republic' Movie icon'lost her virginity to her stepfather at 11', seduced her friend's 17-year-old son... but took a forbidden secret to her grave Charlie Kirk's head of security finally explains the unusual hand signals his team made just moments before kill shot rang out Trump is being utterly humiliated by a dead pedophile. MAGA and his legacy are collapsing. AMANDA PLATELL: Everyone is saying the same thing about pampered Princess Beatrice and her latest PR stunt.


InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

Wu, Yunze, Fu, Dayuan, Si, Weiye, Huang, Zhen, Jiang, Mohan, Li, Keyu, Xia, Shijie, Sun, Jie, Xu, Tianze, Hu, Xiangkun, Lu, Pengrui, Cai, Xiaojie, Ye, Lyumanshan, Zhu, Wenhong, Xiao, Yang, Liu, Pengfei

arXiv.org Artificial Intelligence

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark's difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.


Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries

Jürviste, Madis, Jakobson, Joonatan

arXiv.org Artificial Intelligence

This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs) to the study of 17th and 18th century Estonian dictionaries. The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset. Initial experiments with J. Gutslaff's 1648 dictionary indicate that LLMs have significant potential for semi-automatic enrichment of dictionary information. When provided with sufficient context, Claude 3.7 Sonnet accurately provided meanings and modern equivalents for 81% of headword entries. In a text recognition experiment with A. T. Helle's 1732 dictionary, a zero-shot method successfully identified and structured 41% of headword entries into error-free JSON-formatted output. For digitising the Estonian-German dictionary section of A. W. Hupel's 1780 grammar, overlapping tiling of scanned image files is employed, with one LLM being used for text recognition and a second for merging the structured output. These findings demonstrate that even for minor languages LLMs have a significant potential for saving time and financial resources.


SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

Goworek, Roksana, Karlcut, Harpal, Shezad, Muhammad, Darshana, Nijaguna, Mane, Abhishek, Bondada, Syam, Sikka, Raghav, Mammadov, Ulvi, Allahverdiyev, Rauf, Purighella, Sriram, Gupta, Paridhi, Ndegwa, Muhinyia, Dubossarsky, Haim

arXiv.org Artificial Intelligence

This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning ten low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.


Radiological and Biological Dictionary of Radiomics Features: Addressing Understandable AI Issues in Personalized Breast Cancer; Dictionary Version BM1.0

Gorji, Arman, Sanati, Nima, Pouria, Amir Hossein, Mehrnia, Somayeh Sadat, Hacihaliloglu, Ilker, Rahmim, Arman, Salmanpour, Mohammad R.

arXiv.org Artificial Intelligence

Radiomics-based AI models show promise for breast cancer diagnosis but often lack interpretability, limiting clinical adoption. This study addresses the gap between radiomic features (RF) and the standardized BI-RADS lexicon by proposing a dual-dictionary framework. First, a Clinically-Informed Feature Interpretation Dictionary (CIFID) was created by mapping 56 RFs to BI-RADS descriptors (shape, margin, internal enhancement) through literature and expert review. The framework was applied to classify triple-negative breast cancer (TNBC) versus non-TNBC using dynamic contrast-enhanced MRI from a multi-institutional cohort of 1,549 patients. We trained 27 machine learning classifiers with 27 feature selection methods. SHapley Additive exPlanations (SHAP) were used to interpret predictions and generate a complementary Data-Driven Feature Interpretation Dictionary (DDFID) for 52 additional RFs. The best model, combining Variance Inflation Factor (VIF) selection with Extra Trees Classifier, achieved an average cross-validation accuracy of 0.83. Key predictive RFs aligned with clinical knowledge: higher Sphericity (round/oval shape) and lower Busyness (more homogeneous enhancement) were associated with TNBC. The framework confirmed known imaging biomarkers and uncovered novel, interpretable associations. This dual-dictionary approach (BM1.0) enhances AI model transparency and supports the integration of RFs into routine breast cancer diagnosis and personalized care.


From Dictionary to Tensor: A Scalable Multi-View Subspace Clustering Framework with Triple Information Enhancement

Neural Information Processing Systems

While Tensor-based Multi-view Subspace Clustering (TMSC) has garnered significant attention for its capacity to effectively capture high-order correlations among multiple views, three notable limitations in current TMSC methods necessitate consideration: 1) high computational complexity and reliance on dictionary completeness resulting from using observed data as the dictionary, 2) inaccurate subspace representation stemming from the oversight of local geometric information and 3) under-penalization of noise-related singular values within tensor data caused by treating all singular values equally. Notably, an enhanced anchor dictionary learning mechanism has been utilized to recover the low-rank anchor structure, resulting in reduced computational complexity and increased resilience, especially in scenarios with inadequate dictionaries. Additionally, we introduce an anchor hypergraph Laplacian regularizer to preserve the inherent geometry of the data within the subspace representation. Simultaneously, an improved hyperbolic tangent function has been employed as a precise approximation for tensor rank, effectively capturing the significant variations in singular values. Extensive experimentation on a variety of datasets demonstrates that our approach surpasses SOTA methods in both effectiveness and efficiency.


Inaccuracy of an E-Dictionary and Its Influence on Chinese Language Users

Wang, Xi, Meng, Fanfei, Zhang, Shiyang, Li, Lan

arXiv.org Artificial Intelligence

Electronic dictionaries have largely replaced paper dictionaries and become central tools for L2 learners seeking to expand their vocabulary. Users often assume these resources are reliable and rarely question the validity of the definitions provided. The accuracy of major E-dictionaries is seldom scrutinized, and little attention has been paid to how their corpora are constructed. Research on dictionary use, particularly the limitations of electronic dictionaries, remains scarce. This study adopts a combined method of experimentation, user survey, and dictionary critique to examine Youdao, one of the most widely used E-dictionaries in China. The experiment involved a translation task paired with retrospective reflection. Participants were asked to translate sentences containing words that are insufficiently or inaccurately defined in Youdao. Their consultation behavior was recorded to analyze how faulty definitions influenced comprehension. Results show that incomplete or misleading definitions can cause serious misunderstandings. Additionally, students exhibited problematic consultation habits. The study further explores how such flawed definitions originate, highlighting issues in data processing and the integration of AI and machine learning technologies in dictionary construction. The findings suggest a need for better training in dictionary literacy for users, as well as improvements in the underlying AI models used to build E-dictionaries.


GEAR: A Simple GENERATE, EMBED, AVERAGE AND RANK Approach for Unsupervised Reverse Dictionary

Almeman, Fatemah, Espinosa-Anke, Luis

arXiv.org Artificial Intelligence

Reverse Dictionary (RD) is the task of obtaining the most relevant word or set of words given a textual description or dictionary definition. Effective RD methods have applications in accessibility, translation or writing support systems. Moreover, in NLP research we find RD to be used to benchmark text encoders at various granularities, as it often requires word, definition and sentence embeddings. In this paper, we propose a simple approach to RD that leverages LLMs in combination with embedding models. Despite its simplicity, this approach outperforms supervised baselines in well studied RD datasets, while also showing less over-fitting. We also conduct a number of experiments on different dictionaries and analyze how different styles, registers and target audiences impact the quality of RD systems. We conclude that, on average, untuned embeddings alone fare way below an LLM-only baseline (although they are competitive in highly technical dictionaries), but are crucial for boosting performance in combined methods.