lexical gap
Crowdsourcing Lexical Diversity
Khalilia, Hadi, Otterbacher, Jahna, Bella, Gabor, Noortyani, Rusma, Darma, Shandy, Giunchiglia, Fausto
Lexical-semantic resources (LSRs), such as online lexicons or wordnets, are fundamental for natural language processing applications. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also, the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual \emph{lexical gaps}, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing tool facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.
- Europe > United Kingdom > UK North Sea (0.05)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.05)
- Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)
- (30 more...)
Locally Measuring Cross-lingual Lexical Alignment: A Domain and Word Level Perspective
Karidi, Taelin, Grossman, Eitan, Abend, Omri
NLP research on aligning lexical representation spaces to one another has so far focused on aligning language spaces in their entirety. However, cognitive science has long focused on a local perspective, investigating whether translation equivalents truly share the same meaning or the extent that cultural and regional influences result in meaning variations. With recent technological advances and the increasing amounts of available data, the longstanding question of cross-lingual lexical alignment can now be approached in a more data-driven manner. However, developing metrics for the task requires some methodology for comparing metric efficacy. We address this gap and present a methodology for analyzing both synthetic validations and a novel naturalistic validation using lexical gaps in the kinship domain. We further propose new metrics, hitherto unexplored on this task, based on contextualized embeddings. Our analysis spans 16 diverse languages, demonstrating that there is substantial room for improvement with the use of newer language models. Our research paves the way for more accurate and nuanced cross-lingual lexical alignment methodologies and evaluation.
- Europe > Germany > Saxony > Leipzig (0.05)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > California (0.04)
- (6 more...)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
Advancing the Arabic WordNet: Elevating Content Quality
Freihat, Abed Alhakim, Khalilia, Hadi, Bella, Gábor, Giunchiglia, Fausto
High-quality WordNets are crucial for achieving high-quality results in NLP applications that rely on such resources. However, the wordnets of most languages suffer from serious issues of correctness and completeness with respect to the words and word meanings they define, such as incorrect lemmas, missing glosses and example sentences, or an inadequate, Western-centric representation of the morphology and the semantics of the language. Previous efforts have largely focused on increasing lexical coverage while ignoring other qualitative aspects. In this paper, we focus on the Arabic language and introduce a major revision of the Arabic WordNet that addresses multiple dimensions of lexico-semantic resource quality. As a result, we updated more than 58% of the synsets of the existing Arabic WordNet by adding missing information and correcting errors. In order to address issues of language diversity and untranslatability, we also extended the wordnet structure by new elements: phrasets and lexical gaps.
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
- Asia > Middle East > Palestine (0.04)
- North America > United States > Pennsylvania (0.04)
- (10 more...)
- Research Report (0.50)
- Workflow (0.46)
Lexical Diversity in Kinship Across Languages and Dialects
Khalilia, Hadi, Bella, Gábor, Freihat, Abed Alhakim, Darma, Shandy, Giunchiglia, Fausto
Languages are known to describe the world in diverse ways. Across lexicons, diversity is pervasive, appearing through phenomena such as lexical gaps and untranslatability. However, in computational resources, such as multilingual lexical databases, diversity is hardly ever represented. In this paper, we introduce a method to enrich computational lexicons with content relating to linguistic diversity. The method is verified through two large-scale case studies on kinship terminology, a domain known to be diverse across languages and cultures: one case study deals with seven Arabic dialects, while the other one with three Indonesian languages. Our results, made available as browseable and downloadable computational resources, extend prior linguistics research on kinship terminology, and provide insight into the extent of diversity even within linguistically and culturally close communities.
- Europe > United Kingdom > UK North Sea (0.09)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.09)
- Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)
- (16 more...)
Representing Interlingual Meaning in Lexical Databases
Giunchiglia, Fausto, Bella, Gabor, Nair, Nandu Chandran, Chi, Yang, Xu, Hao
In today's multilingual lexical databases, the majority of the world's languages are under-represented. Beyond a mere issue of resource incompleteness, we show that existing lexical databases have structural limitations that result in a reduced expressivity on culturally-specific words and in mapping them across languages. In particular, the lexical meaning space of dominant languages, such as English, is represented more accurately while linguistically or culturally diverse languages are mapped in an approximate manner. Our paper assesses state-of-the-art multilingual lexical databases and evaluates their strengths and limitations with respect to their expressivity on lexical phenomena of linguistic diversity.
- Europe > United Kingdom > UK North Sea (0.07)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.07)
- Asia > India (0.05)
- (9 more...)
- Overview (0.68)
- Research Report (0.50)
Word Embedding based Correlation Model for Question/Answer Matching
Shen, Yikang, Rong, Wenge, Jiang, Nan, Peng, Baolin, Tang, Jie, Xiong, Zhang
With the development of community based question answering (Q&A) services, a large scale of Q&A archives have been accumulated and are an important information and knowledge resource on the web. Question and answer matching has been attached much importance to for its ability to reuse knowledge stored in these systems: it can be useful in enhancing user experience with recurrent questions. In this paper, we try to improve the matching accuracy by overcoming the lexical gap between question and answer pairs. A Word Embedding based Correlation (WEC) model is proposed by integrating advantages of both the translation model and word embedding, given a random pair of words, WEC can score their co-occurrence probability in Q&A pairs and it can also leverage the continuity and smoothness of continuous space word representation to deal with new pairs of words that are rare in the training parallel text. An experimental study on Yahoo! Answers dataset and Baidu Zhidao dataset shows this new method's promising potential.
- Europe > Netherlands > South Holland > Delft (0.04)
- Asia > China > Hong Kong (0.04)
- Africa > Burundi (0.04)
Convolutional Neural Tensor Network Architecture for Community-Based Question Answering
Qiu, Xipeng (Fudan University) | Huang, Xuanjing (Fudan University)
Retrieving similar questions is very important in community-based question answering. A major challenge is the lexical gap in sentence matching. In this paper, we propose a convolutional neural tensor network architecture to encode the sentences in semantic space and model their interactions with a tensor layer. Our model integrates sentence modeling and semantic matching into a single model, which can not only capture the useful information with convolutional and pooling layers, but also learn the matching metrics between the question and its answer. Besides, our model is a general architecture, with no need for the other knowledge such as lexical or syntactic analysis. The experimental results shows that our method outperforms the other methods on two matching tasks.
- Research Report > New Finding (0.66)
- Research Report > Experimental Study (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)