AITopics | multiword expression

Collaborating Authors

multiword expression

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Evaluating the Impact of Verbal Multiword Expressions on Machine Translation

Liu, Linfeng, Ghosh, Saptarshi, Jiang, Tianyu

arXiv.org Artificial IntelligenceAug-26-2025

Verbal multiword expressions (VMWEs) present significant challenges for natural language processing due to their complex and often non-compositional nature. While machine translation models have seen significant improvement with the advent of language models in recent years, accurately translating these complex linguistic structures remains an open problem. In this study, we analyze the impact of three VMWE categories -- verbal idioms, verb-particle constructions, and light verb constructions -- on machine translation quality from English to multiple languages. Using both established multiword expression datasets and sentences containing these language phenomena extracted from machine translation datasets, we evaluate how state-of-the-art translation systems handle these expressions. Our experimental results consistently show that VMWEs negatively affect translation quality. We also propose an LLM-based paraphrasing approach that replaces these expressions with their literal counterparts, demonstrating significant improvement in translation quality for verbal idioms and verb-particle constructions.

machine learning, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2508.17458

Country:

Europe (1.00)
Asia (0.67)
North America > United States > California (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Dancing with Deer: A Constructional Perspective on MWEs in the Era of LLMs

Bonial, Claire, Bonn, Julia, Madabushi, Harish Tayyar

arXiv.org Artificial IntelligenceAug-25-2025

In this chapter, we argue for the benefits of understanding multiword expressions from the perspective of usage-based, construction grammar approaches. We begin with a historical overview of how construction grammar was developed in order to account for idiomatic expressions using the same grammatical machinery as the non-idiomatic structures of language. We cover a comprehensive description of constructions, which are pairings of meaning with form of any size (morpheme, word, phrase), as well as how constructional approaches treat the acquisition and generalization of constructions. We describe a successful case study leveraging constructional templates for representing multiword expressions in English PropBank. Because constructions can be at any level or unit of form, we then illustrate the benefit of a constructional representation of multi-meaningful morphosyntactic unit constructions in Arapaho, a highly polysynthetic and agglutinating language. We include a second case study leveraging constructional templates for representing these multi-morphemic expressions in Uniform Meaning Representation. Finally, we demonstrate the similarities and differences between a usage-based explanation of a speaker learning a novel multiword expression, such as "dancing with deer," and that of a large language model. We present experiments showing that both models and speakers can generalize the meaning of novel multiword expressions based on a single exposure of usage. However, only speakers can reason over the combination of two such expressions, as this requires comparison of the novel forms to a speaker's lifetime of stored constructional exemplars, which are rich with cross-modal details.

artificial intelligence, large language model, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.15977

Country:

North America > United States (1.00)
Europe > United Kingdom > England (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Add feedback

Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax

Zaitova, Iuliia, Hirak, Vitalii, Abdullah, Badr M., Klakow, Dietrich, Möbius, Bernd, Avgustinova, Tania

arXiv.org Artificial IntelligenceMay-12-2025

This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages - English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.06062

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.95)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Evaluating Large Language Models on Multiword Expressions in Multilingual and Code-Switched Contexts

De Leon, Frances Laureano, Madabushi, Harish Tayyar, Lee, Mark G.

arXiv.org Artificial IntelligenceApr-30-2025

Multiword expressions, characterised by non-compositional meanings and syntactic irregularities, are an example of nuanced language. These expressions can be used literally or idiomatically, leading to significant changes in meaning. While large language models have demonstrated strong performance across many tasks, their ability to handle such linguistic subtleties remains uncertain. Therefore, this study evaluates how state-of-the-art language models process the ambiguity of potentially idiomatic multiword expressions, particularly in contexts that are less frequent, where models are less likely to rely on memorisation. By evaluating models across in Portuguese and Galician, in addition to English, and using a novel code-switched dataset and a novel task, we find that large language models, despite their strengths, struggle with nuanced language. In particular, we find that the latest models, including GPT-4, fail to outperform the xlm-roBERTa-base baselines in both detection and semantic tasks, with especially poor performance on the novel tasks we introduce, despite its similarity to existing tasks. Overall, our results demonstrate that multiword expressions, especially those which are ambiguous, continue to be a challenge to models.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.20051

Country:

Europe (0.93)
North America > Mexico (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Annotating Compositionality Scores for Irish Noun Compounds is Hard Work

Walsh, Abigail, Clifford, Teresa, Daly, Emma, Dunne, Jane, Davis, Brian, Cleircín, Gearóid Ó

arXiv.org Artificial IntelligenceFeb-14-2025

Noun compounds constitute a challenging construction for NLP applications, given their variability in idiomaticity and interpretation. In this paper, we present an analysis of compound nouns identified in Irish text of varied domains by expert annotators, focusing on compositionality as a key feature, but also domain specificity, as well as familiarity and confidence of the annotator giving the ratings. Our findings and the discussion that ensued contributes towards a greater understanding of how these constructions appear in Irish language, and how they might be treated separately from English noun compounds.

artificial intelligence, construction, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.10061

Country:

South America > Colombia > Meta Department > Villavicencio (0.05)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
North America > United States > Colorado > Denver County > Denver (0.04)
(5 more...)

Genre: Research Report > New Finding (0.34)

Industry: Education (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Overview of MWE history, challenges, and horizons: standing at the 20th anniversary of the MWE workshop series via MWE-UD2024

Han, Lifeng, Evang, Kilian, Bhatia, Archna, Bouma, Gosse, Doğruöz, A. Seza, Garcia, Marcos, Giouli, Voula, Nivre, Joakim, Rademacher, Alexandre

arXiv.org Artificial IntelligenceDec-25-2024

Starting in 2003 when the first MWE workshop was held with ACL in Sapporo, Japan, this year, the joint workshop of MWE-UD co-located with the LREC-COLING 2024 conference marked the 20th anniversary of MWE workshop events over the past nearly two decades. Standing at this milestone, we look back to this workshop series and summarise the research topics and methodologies researchers have carried out over the years. We also discuss the current challenges that we are facing and the broader impacts/synergies of MWE research within the CL and NLP fields. Finally, we give future research perspectives. We hope this position paper can help researchers, students, and industrial practitioners interested in MWE get a brief but easy understanding of its history, current, and possible future.

artificial intelligence, natural language, text processing, (13 more...)

arXiv.org Artificial Intelligence

2412.18868

Country:

Asia > Japan > Hokkaidō > Hokkaidō Prefecture > Sapporo (0.25)
South America > Colombia > Meta Department > Villavicencio (0.06)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.05)
(24 more...)

Genre: Instructional Material > Course Syllabus & Notes (0.71)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.47)

Add feedback

CoAM: Corpus of All-Type Multiword Expressions

Ide, Yusuke, Tanner, Joshua, Nohejl, Adam, Hoffman, Jacob, Vasselli, Justin, Kamigaito, Hidetaka, Watanabe, Taro

arXiv.org Artificial IntelligenceDec-23-2024

Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation. Existing datasets for MWE identification are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking. MWEs in CoAM are tagged with MWE types, such as Noun and Verb, to enable fine-grained error analysis. Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form, including discontinuous ones. Through experiments using CoAM, we find that a fine-tuned large language model outperforms the current state-of-the-art approach for MWE identification. Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.18151

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(20 more...)

Genre: Research Report (0.84)

Industry: Transportation (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Using large language models to estimate features of multi-word expressions: Concreteness, valence, arousal

Martínez, Gonzalo, Molero, Juan Diego, González, Sandra, Conde, Javier, Brysbaert, Marc, Reviriego, Pedro

arXiv.org Artificial IntelligenceAug-16-2024

This study investigates the potential of large language models (LLMs) to provide accurate estimates of concreteness, valence and arousal for multi-word expressions. Unlike previous artificial intelligence (AI) methods, LLMs can capture the nuanced meanings of multi-word expressions. We systematically evaluated ChatGPT-4o's ability to predict concreteness, valence and arousal. In Study 1, ChatGPT-4o showed strong correlations with human concreteness ratings (r =.8) for multi-word expressions. In Study 2, these findings were repeated for valence and arousal ratings of individual words, matching or outperforming previous AI models. Study 3 extended the prevalence and arousal analysis to multi-word expressions and showed promising results despite the lack of large-scale human benchmarks. These findings highlight the potential of LLMs for generating valuable psycholinguistic data related to multiword expressions. To help researchers with stimulus selection, we provide datasets with AI norms of concreteness, valence and arousal for 126,397 English single words and 63,680 multi-word expressions.

expression, human rating, multiword expression, (15 more...)

arXiv.org Artificial Intelligence

2408.16012

Country:

Europe > Spain > Galicia > Madrid (0.05)
Africa > Kenya > Mandera County > Mandera (0.05)
North America > United States > South Carolina (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Law Enforcement & Public Safety > Terrorism (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game

Samadarshi, Prisha, Mustafa, Mariam, Kulkarni, Anushka, Rothkopf, Raven, Chakrabarty, Tuhin, Muresan, Smaranda

arXiv.org Artificial IntelligenceJul-15-2024

The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 200 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best-performing LLM, GPT-4o, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 8% of the games. Compared to GPT-4o, novice and expert players perform better, with expert human players significantly outperforming GPT-4o. To deepen our understanding we create a taxonomy of the knowledge types required to successfully categorize words in the Connections game, revealing that LLMs struggle with associative, encyclopedic, and linguistic knowledge. Our findings establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities in humans and AI systems.

category, llm, reasoning, (15 more...)

arXiv.org Artificial Intelligence

2406.11012

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Generating Continuations in Multilingual Idiomatic Contexts

Pokharel, Rhitabrat, Agrawal, Ameeta

arXiv.org Artificial IntelligenceNov-4-2023

The ability to process idiomatic or literal multiword expressions is a crucial aspect of understanding and generating any language. The task of generating contextually relevant continuations for narratives containing idiomatic (or literal) expressions can allow us to test the ability of generative language models (LMs) in understanding nuanced language containing non-compositional figurative text. We conduct a series of experiments using datasets in two distinct languages (English and Portuguese) under three different training settings (zero-shot, few-shot, and fine-tuned). Our results suggest that the models are only slightly better at generating continuations for literal contexts than idiomatic contexts, with exceedingly small margins. Furthermore, the models studied in this work perform equally well across both languages, indicating the robustness of generative models in performing this task.

continuation, expression, portuguese, (13 more...)

arXiv.org Artificial Intelligence

2310.20195

Country:

North America > United States (0.14)
Asia > Indonesia > Bali (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback