AITopics | Lignos, Constantine

Collaborating Authors

Lignos, Constantine

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages

Palen-Michel, Chester, Pickering, Maxwell, Kruse, Maya, Sälevä, Jonne, Lignos, Constantine

arXiv.org Artificial IntelligenceDec-12-2024

We present OpenNER 1.0, a standardized collection of openly available named entity recognition (NER) datasets. OpenNER contains 34 datasets spanning 51 languages, annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation, map entity type names to be more consistent across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline models using three pretrained multilingual language models to compare the performance of recent models and facilitate future research in NER.

artificial intelligence, natural language, text processing, (14 more...)

arXiv.org Artificial Intelligence

2412.09587

Country:

Europe (1.00)
Asia (0.68)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English

Rueda, Andrew, Mellado, Elena Álvarez, Lignos, Constantine

arXiv.org Artificial IntelligenceMay-20-2024

Modern named entity recognition systems have steadily improved performance in the age of larger and more powerful neural models. However, over the past several years, the state-of-the-art has seemingly hit another plateau on the benchmark CoNLL-03 English dataset. In this paper, we perform a deep dive into the test outputs of the highest-performing NER models, conducting a fine-grained evaluation of their performance by introducing new document-level annotations on the test set. We go beyond F1 scores by categorizing errors in order to interpret the true state of the art for NER and guide future work. We review previous attempts at correcting the various flaws of the test set and introduce CoNLL#, a new corrected version of the test set that addresses its systematic and most prevalent errors, allowing for low-noise, interpretable error analysis.

artificial intelligence, information retrieval, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.11865

Country:

North America > United States > Maryland (0.14)
Asia > Middle East > UAE (0.14)

Genre:

Research Report (1.00)
Overview (0.66)

Industry: Leisure & Entertainment > Sports (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.90)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.56)

Add feedback

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Sälevä, Jonne, Lignos, Constantine

arXiv.org Artificial IntelligenceMay-15-2024

We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2405.09496

Country:

Europe (1.00)
Asia (0.93)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Research Report > New Finding (0.47)
Research Report > Experimental Study (0.46)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

QueryNER: Segmentation of E-commerce Queries

Palen-Michel, Chester, Liang, Lizzie, Wu, Zhe, Lignos, Constantine

arXiv.org Artificial IntelligenceMay-15-2024

Prior work in sequence labeling for e-commerce has largely addressed aspect-value extraction which focuses on extracting portions of a product title or query for narrowly defined aspects. Our work instead focuses on the goal of dividing a query into meaningful chunks with broadly applicable types. We report baseline tagging results and conduct experiments comparing token and entity dropping for null and low recall query recovery. Challenging test sets are created using automatic transformations and show how simple data augmentation techniques can make the models more robust to noise. We make the QueryNER dataset publicly available.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2405.09507

Country:

North America > United States > New Mexico (0.14)
North America > United States > Minnesota (0.14)
North America > United States > Colorado (0.14)
Europe > United Kingdom > Scotland (0.14)

Genre: Research Report (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Leisure & Entertainment (0.93)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)

Add feedback

LR-Sum: Summarization for Less-Resourced Languages

Palen-Michel, Chester, Lignos, Constantine

arXiv.org Artificial IntelligenceOct-26-2023

This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe how we plan to use the data for modeling experiments and discuss limitations of the dataset.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2212.09674

Country:

Europe (1.00)
Asia (1.00)
Africa (1.00)
North America > United States > Louisiana (0.14)

Genre: Research Report (0.40)

Industry:

Information Technology (0.66)
Government > Regional Government > North America Government > United States Government (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)

Add feedback

What changes when you randomly choose BPE merge operations? Not much

Sälevä, Jonne, Lignos, Constantine

arXiv.org Artificial IntelligenceMay-4-2023

We introduce three simple randomized variants of byte pair encoding (BPE) and explore whether randomizing the selection of merge operations substantially affects a downstream machine translation task. We focus on translation into morphologically rich languages, hypothesizing that this task may show sensitivity to the method of choosing subwords. Analysis using a Bayesian linear model indicates that two of the variants perform nearly indistinguishably compared to standard BPE while the other degrades performance less than we anticipated. We conclude that although standard BPE is widely used, there exists an interesting universe of potential variations on it worth investigating. Our code is available at: https://github.com/bltlab/random-bpe.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2305.03029

Country: Europe (1.00)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

Adelani, David Ifeoluwa, Neubig, Graham, Ruder, Sebastian, Rijhwani, Shruti, Beukman, Michael, Palen-Michel, Chester, Lignos, Constantine, Alabi, Jesujoba O., Muhammad, Shamsuddeen H., Nabende, Peter, Dione, Cheikh M. Bamba, Bukula, Andiswa, Mabuya, Rooweither, Dossou, Bonaventure F. P., Sibanda, Blessing, Buzaaba, Happy, Mukiibi, Jonathan, Kalipe, Godson, Mbaye, Derguene, Taylor, Amelia, Kabore, Fatoumata, Emezue, Chris Chinenye, Aremu, Anuoluwapo, Ogayo, Perez, Gitau, Catherine, Munkoh-Buabeng, Edwin, Koagne, Victoire M., Tapo, Allahsera Auguste, Macucwa, Tebogo, Marivate, Vukosi, Mboning, Elvis, Gwadabe, Tajuddeen, Adewumi, Tosin, Ahia, Orevaoghene, Nakatumba-Nabende, Joyce, Mokono, Neo L., Ezeani, Ignatius, Chukwuneke, Chiamaka, Adeyemi, Mofetoluwa, Hacheme, Gilles Q., Abdulmumin, Idris, Ogundepo, Odunayo, Yousuf, Oreen, Ngoli, Tatiana Moteu, Klakow, Dietrich

arXiv.org Artificial IntelligenceNov-15-2022

African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages.

computational linguistic, information retrieval, natural language, (19 more...)

arXiv.org Artificial Intelligence

2210.12391

Country:

Europe (1.00)
Asia (1.00)
Africa (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

Macro-Average: Rare Types Are Important Too

Gowda, Thamme, You, Weiqiu, Lignos, Constantine, May, Jonathan

arXiv.org Artificial IntelligenceApr-12-2021

While traditional corpus-level evaluation metrics for machine translation (MT) correlate well with fluency, they struggle to reflect adequacy. Model-based MT metrics trained on segment-level human judgments have emerged as an attractive replacement due to strong correlation results. These models, however, require potentially expensive re-training for new domains and languages. Furthermore, their decisions are inherently non-transparent and appear to reflect unwelcome biases. We explore the simple type-based classifier metric, MacroF1, and study its applicability to MT evaluation. We find that MacroF1 is competitive on direct assessment, and outperforms others in indicating downstream cross-lingual information retrieval task performance. Further, we show that MacroF1 can be used to effectively compare supervised and unsupervised neural machine translation, and reveal significant qualitative differences in the methods' outputs.

law enforcement, orchestra, us government, (18 more...)

arXiv.org Artificial Intelligence

2104.057

Country:

Europe (1.00)
Asia (1.00)
Africa (1.00)
(3 more...)

Genre: Research Report (1.00)

Industry:

Media (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Leisure & Entertainment > Sports (0.93)
(4 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Mining Wikidata for Name Resources for African Languages

Sälevä, Jonne, Lignos, Constantine

arXiv.org Artificial IntelligenceApr-1-2021

This work supports further development of language technology for the languages of Africa by providing a Wikidata-derived resource of name lists corresponding to common entity types (person, location, and organization). While we are not the first to mine Wikidata for name lists, our approach emphasizes scalability and replicability and addresses data quality issues for languages that do not use Latin scripts. We produce lists containing approximately 1.9 million names across 28 African languages. We describe the data, the process used to produce it, and its limitations, and provide the software and data for public use. Finally, we discuss the ethical considerations of producing this resource and others of its kind.

artificial intelligence, natural language, wikidata, (15 more...)

arXiv.org Artificial Intelligence

2104.00558

Country:

Africa (0.35)
Asia (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science (0.86)

Add feedback

TMR: Evaluating NER Recall on Tough Mentions

Tu, Jingxuan, Lignos, Constantine

arXiv.org Artificial IntelligenceMar-23-2021

We propose the Tough Mentions Recall (TMR) metrics to supplement traditional named entity recognition (NER) evaluation by examining recall on specific subsets of "tough" mentions: unseen mentions, those whose tokens or token/type combination were not observed in training, and type-confusable mentions, token sequences with multiple entity types in the test data. We demonstrate the usefulness of these metrics by evaluating corpora of English, Spanish, and Dutch using five recent neural architectures. We identify subtle differences between the performance of BERT and Flair on two English NER corpora and identify a weak spot in the performance of current models in Spanish. We conclude that the TMR metrics enable differentiation between otherwise similar-scoring systems and identification of patterns in performance that would go unnoticed from overall precision, recall, and F1.

computational linguistics, deep learning, neural network, (22 more...)

arXiv.org Artificial Intelligence

2103.12312

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback