AITopics | gold data

Collaborating Authors

gold data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

Anikina, Tatiana, Cegin, Jan, Simko, Jakub, Ostermann, Simon

arXiv.org Artificial IntelligenceSep-22-2025

Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed, such as demonstrations, label-based summaries, and self-revision, their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods, particularly target-language demonstrations with LLM-based revisions, yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.12158

Country:

Europe (1.00)
Asia (0.68)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Government (1.00)
Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

The Curious Case of Factuality Finetuning: Models' Internal Beliefs Can Improve Factuality

Newman, Benjamin, Ravichander, Abhilasha, Jung, Jaehun, Xin, Rui, Ivison, Hamish, Kuznetsov, Yegor, Koh, Pang Wei, Choi, Yejin

arXiv.org Artificial IntelligenceJul-14-2025

Language models are prone to hallucination - generating text that is factually incorrect. Finetuning models on high-quality factual information can potentially reduce hallucination, but concerns remain; obtaining factual gold data can be expensive and training on correct but unfamiliar data may potentially lead to even more downstream hallucination. What data should practitioners finetune on to mitigate hallucinations in language models? In this work, we study the relationship between the factuality of finetuning data and the prevalence of hallucinations in long-form generation tasks. Counterintuitively, we find that finetuning on factual gold data is not as helpful as finetuning on model-generated data that models believe to be factual. Next, we evaluate filtering strategies applied on both factual gold data and model-generated data, and find that finetuning on model-generated data that is filtered by models' own internal judgments often leads to better overall factuality compared to other configurations: training on gold data filtered by models' judgments, training on gold data alone, or training on model-generated data that is supported by gold data. These factuality improvements transfer across three domains we study, suggesting that a models' own beliefs can provide a powerful signal for factuality.

factuality, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.08371

Country:

North America > United States (1.00)
Europe > France (1.00)
Asia (1.00)

Genre: Research Report > New Finding (0.93)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area > Dermatology (1.00)
Government > Voting & Elections (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

JAPAGEN: Efficient Few/Zero-shot Learning via Japanese Training Dataset Generation with LLM

Fujii, Takuro, Katsumata, Satoru

arXiv.org Artificial IntelligenceDec-9-2024

Recently some studies have highlighted the potential of Large Language Models (LLMs) as effective generators of supervised training data, offering advantages such as enhanced inference efficiency and reduced costs associated with data collection. However, these studies have predominantly focused on English language tasks. In this paper, we address the fundamental research question: Can LLMs serve as proficient training data generators for other language tasks? Specifically, we leverage LLMs to synthesize supervised training data under few-shot and zero-shot learning scenarios across six diverse Japanese downstream tasks. Subsequently, we utilize this synthesized data to train compact models (e.g., BERT). This novel methodology is termed JAPAGEN. Our experimental findings underscore that JAPAGEN achieves robust performance in classification tasks that necessitate formal text inputs, demonstrating competitive results compared to conventional LLM prompting strategies.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2412.06738

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.31)
Health & Medicine > Therapeutic Area > Immunology (0.31)
Health & Medicine > Epidemiology (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Findings of the Third Shared Task on Multilingual Coreference Resolution

Novák, Michal, Dohnalová, Barbora, Konopík, Miloslav, Nedoluzhko, Anna, Popel, Martin, Pražák, Ondřej, Sido, Jakub, Straka, Milan, Žabokrtský, Zdeněk, Zeman, Daniel

arXiv.org Artificial IntelligenceNov-9-2024

The paper presents an overview of the third edition of the shared task on multilingual coreference resolution, held as part of the CRAC 2024 workshop. Similarly to the previous two editions, the participants were challenged to develop systems capable of identifying mentions and clustering them based on identity coreference. This year's edition took another step towards real-world application by not providing participants with gold slots for zero anaphora, increasing the task's complexity and realism. In addition, the shared task was expanded to include a more diverse set of languages, with a particular focus on historical languages. The training and evaluation data were drawn from version 1.2 of the multilingual collection of harmonized coreference resources CorefUD, encompassing 21 datasets across 15 languages. 6 systems competed in this shared task.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2410.15949

Country:

Europe > Hungary > Csongrád-Csanád County > Szeged (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Europe > Germany > Brandenburg > Potsdam (0.04)
(16 more...)

Genre:

Overview (0.86)
Research Report (0.64)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.67)

Add feedback

Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms

Lee, Wonkee, Heo, Seong-Hwan, Lee, Jong-Hyeok

arXiv.org Artificial IntelligenceJun-3-2024

Semi-supervised learning that leverages synthetic data for training has been widely adopted for developing automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data. We introduce a noising-based data-synthesis method by adapting the masked language model approach, generating a noisy text from a clean text by infilling masked tokens with erroneous tokens. Moreover, we propose selective corpus interleaving that combines two separate synthetic datasets by taking only the advantageous samples to enhance the quality of the synthetic data further. Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.

dataset, gold data, synthetic data, (13 more...)

arXiv.org Artificial Intelligence

2204.03896

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Germany > Berlin (0.04)
North America > United States > New York (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data

Li, Rumeng, Wang, Xun, Yu, Hong

arXiv.org Artificial IntelligenceDec-9-2023

Large language models (LLMs) can generate natural language texts for various domains and tasks, but their potential for clinical text mining, a domain with scarce, sensitive, and imbalanced medical data, is underexplored. We investigate whether LLMs can augment clinical data for detecting Alzheimer's Disease (AD)-related signs and symptoms from electronic health records (EHRs), a challenging task that requires high expertise. We create a novel pragmatic taxonomy for AD sign and symptom progression based on expert knowledge, which guides LLMs to generate synthetic data following two different directions: "data-to-label", which labels sentences from a public EHR collection with AD-related signs and symptoms; and "label-to-data", which generates sentences with AD-related signs and symptoms based on the label definition. We train a system to detect AD-related signs and symptoms from EHRs, using three datasets: (1) a gold dataset annotated by human experts on longitudinal EHRs of AD patients; (2) a silver dataset created by the data-to-label method; and (3) a bronze dataset created by the label-to-data method. We find that using the silver and bronze datasets improves the system performance, outperforming the system using only the gold dataset. This shows that LLMs can generate synthetic clinical data for a complex task by incorporating expert knowledge, and our label-to-data method can produce datasets that are free of sensitive information, while maintaining acceptable quality.

dataset, llm, symptom, (15 more...)

arXiv.org Artificial Intelligence

2401.06774

Country:

North America > United States > Wisconsin (0.04)
North America > United States > Massachusetts > Middlesex County > Lowell (0.04)
North America > United States > California (0.04)
(2 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment

Köksal, Abdullatif, Severini, Silvia, Schütze, Hinrich

arXiv.org Artificial IntelligenceMar-27-2023

Word alignments are essential for a variety of NLP tasks. Therefore, choosing the best approaches for their creation is crucial. However, the scarce availability of gold evaluation data makes the choice difficult. We propose SilverAlign, a new method to automatically create silver data for the evaluation of word aligners by exploiting machine translation and minimal pairs. We show that performance on our silver data correlates well with gold benchmarks for 9 language pairs, making our approach a valid resource for evaluation of different domains and languages when gold data are not available. This addresses the important scenario of missing gold data alignments for low-resource languages.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2210.06207

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(10 more...)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

3 Data Quality Stages for Preparing Machine Learning Data

#artificialintelligenceNov-24-2022, 13:15:44 GMT

This is part of Solutions Review's Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, dotData Founder and CEO Ryohei Fujimaki offers commentary on data quality strategies to get your data machine learning-ready. As the world embraces machine learning (ML) and Artificial Intelligence (AI), data leaders are adjusting and perfecting data quality management frameworks. Traditionally, there are two stages in data quality: raw unprofiled data and cleansed data, free of common errors and commonly used for business intelligence (BI). But, companies at the forefront of data-driven decision-making have realized that data quality needs to level up--and this is where ML-ready data comes in.

bronze, preparing machine learning data, silver layer, (12 more...)

#artificialintelligence

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.95)

Add feedback

Group-Sparse Matrix Factorization for Transfer Learning of Word Embeddings

Xu, Kan, Zhao, Xuanyi, Bastani, Hamsa, Bastani, Osbert

arXiv.org Machine LearningApr-18-2021

Sparse regression has recently been applied to enable transfer learning from very limited data. We study an extension of this approach to unsupervised learning -- in particular, learning word embeddings from unstructured text corpora using low-rank matrix factorization. Intuitively, when transferring word embeddings to a new domain, we expect that the embeddings change for only a small number of words -- e.g., the ones with novel meanings in that domain. We propose a novel group-sparse penalty that exploits this sparsity to perform transfer learning when there is very little text data available in the target domain -- e.g., a single article of text. We prove generalization bounds for our algorithm. Furthermore, we empirically evaluate its effectiveness, both in terms of prediction accuracy in downstream tasks as well as the interpretability of the results.

estimator, group-sparse matrix factorization, inequality, (11 more...)

arXiv.org Machine Learning

2104.08928

Country:

North America > United States > Pennsylvania (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

Ding, Bosheng, Liu, Linlin, Bing, Lidong, Kruengkrai, Canasai, Nguyen, Thien Hai, Joty, Shafiq, Si, Luo, Miao, Chunyan

arXiv.org Artificial IntelligenceNov-3-2020

Data augmentation techniques have been widely used to improve machine learning performance as they enhance the generalization capability of models. In this work, to generate high quality synthetic data for low-resource tagging tasks, we propose a novel augmentation method with language models trained on the linearized labeled sentences. Our method is applicable to both supervised and semi-supervised settings. For the supervised settings, we conduct extensive experiments on named entity recognition (NER), part of speech (POS) tagging and end-to-end target based sentiment analysis (E2E-TBSA) tasks. For the semi-supervised settings, we evaluate our method on the NER task under the conditions of given unlabeled data only and unlabeled data plus a knowledge base. The results show that our method can consistently outperform the baselines, particularly when the given gold training data are less.

machine learning, natural language, training data, (19 more...)

arXiv.org Artificial Intelligence

2011.01549

Country:

North America > United States > Alabama (0.04)
Asia > Singapore (0.04)
Oceania > Australia (0.04)
(12 more...)

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback