AITopics | synthetic corpora

Collaborating Authors

synthetic corpora

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling

Ulm, Jannek, Du, Kevin, Snæbjarnarson, Vésteinn

arXiv.org Artificial IntelligenceOct-10-2025

Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.

computational linguistic, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2510.08245

Country:

Europe (0.93)
North America > United States (0.46)
North America > Mexico (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.85)

Add feedback

Data-Constrained Synthesis of Training Data for De-Identification

Vakili, Thomas, Henriksson, Aron, Dalianis, Hercules

arXiv.org Artificial IntelligenceFeb-21-2025

Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

corpora, experiment, synthetic corpora, (15 more...)

arXiv.org Artificial Intelligence

2502.14677

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Montserrat (0.04)
North America > Canada > Ontario > Toronto (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Quality > Quantity: Synthetic Corpora from Foundation Models for Closed-Domain Extractive Question Answering

Sengupta, Saptarshi, Heaton, Connor, Ghosh, Shreya, Nakov, Preslav, Mitra, Prasenjit

arXiv.org Artificial IntelligenceOct-25-2023

Domain adaptation, the process of training a model in one domain and applying it to another, has been extensively explored in machine learning. While training a domain-specific foundation model (FM) from scratch is an option, recent methods have focused on adapting pre-trained FMs for domain-specific tasks. However, our experiments reveal that either approach does not consistently achieve state-of-the-art (SOTA) results in the target domain. In this work, we study extractive question answering within closed domains and introduce the concept of targeted pre-training. This involves determining and generating relevant data to further pre-train our models, as opposed to the conventional philosophy of utilizing domain-specific FMs trained on a wide range of data. Our proposed framework uses Galactica to generate synthetic, ``targeted'' corpora that align with specific writing styles and topics, such as research papers and radiology reports. This process can be viewed as a form of knowledge distillation. We apply our method to two biomedical extractive question answering datasets, COVID-QA and RadQA, achieving a new benchmark on the former and demonstrating overall improvements on the latter. Code available at https://github.com/saptarshi059/CDQA-v1-Targetted-PreTraining/tree/main.

foundation model, quantity, synthetic corpora

arXiv.org Artificial Intelligence

2310.16995

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.87)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.80)

Add feedback

WeTS: A Benchmark for Translation Suggestion

Yang, Zhen, Meng, Fandong, Zhang, Yingxue, Li, Ernan, Zhou, Jie

arXiv.org Artificial IntelligenceOct-10-2022

Translation Suggestion (TS), which provides alternatives for specific words or phrases given the entire documents translated by machine translation (MT) \cite{lee2021intellicat}, has been proven to play a significant role in post editing (PE). However, there is still no publicly available data set to support in-depth research for this problem, and no reproducible experimental results can be followed by researchers in this community. To break this limitation, we create a benchmark data set for TS, called \emph{WeTS}, which contains golden corpus annotated by expert translators on four translation directions. Apart from the human-annotated golden corpus, we also propose several novel methods to generate synthetic corpus which can substantially improve the performance of TS. With the corpus we construct, we introduce the Transformer-based model for TS, and experimental results show that our model achieves State-Of-The-Art (SOTA) results on all four translation directions, including English-to-German, German-to-English, Chinese-to-English and English-to-Chinese. Codes and corpus can be found at https://github.com/ZhenYangIACAS/WeTS.git.

machine learning, natural language, translation, (19 more...)

arXiv.org Artificial Intelligence

2110.05151

Country:

Europe > France (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(7 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Semantic Relatedness and Taxonomic Word Embeddings

Kacmajor, Magdalena, Kelleher, John D., Klubicka, Filip, Maldonado, Alfredo

arXiv.org Artificial IntelligenceFeb-14-2020

This paper 1 connects a series of papers dealing with taxonomic word embeddings. It begins by noting that there are different types of semantic relatedness and that different lexical representations encode different forms of relatedness. A particularly important distinction within semantic relatedness is that of thematic versus taxonomic relatedness. Next, we present a number of experiments that analyse taxonomic embeddings that have been trained on a synthetic corpus that has been generated via a random walk over a taxonomy. These experiments demonstrate how the properties of the synthetic corpus, such as the percentage of rare words, are affected by the shape of the knowledge graph the corpus is generated from. Finally, we explore the interactions between the relative sizes of natural and synthetic corpora on the performance of embeddings when taxonomic and thematic embeddings are combined.

arXiv.org Artificial Intelligence

2002.06235

Country:

South America > Uruguay > Maldonado > Maldonado (0.06)
North America > United States > Colorado > Denver County > Denver (0.04)
Europe > Spain > Galicia > Madrid (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Bilingual Embeddings with Random Walks over Multilingual Wordnets

Goikoetxea, J., Soroa, A., Agirre, E.

arXiv.org Artificial IntelligenceApr-23-2018

Bilingual word embeddings represent words of two languages in the same space, and allow to transfer knowledge from one language to the other without machine translation. The main approach is to train monolingual embeddings first and then map them using bilingual dictionaries. In this work, we present a novel method to learn bilingual embeddings based on multilingual knowledge bases (KB) such as WordNet. Our method extracts bilingual information from multilingual wordnets via random walks and learns a joint embedding space in one go. We further reinforce cross-lingual equivalence adding bilingual con- straints in the loss function of the popular skipgram model. Our experiments involve twelve cross-lingual word similarity and relatedness datasets in six lan- guage pairs covering four languages, and show that: 1) random walks over mul- tilingual wordnets improve results over just using dictionaries; 2) multilingual wordnets on their own improve over text-based systems in similarity datasets; 3) the good results are consistent for large wordnets (e.g. English, Spanish), smaller wordnets (e.g. Basque) or loosely aligned wordnets (e.g. Italian); 4) the combination of wordnets and text yields the best results, above mapping-based approaches. Our method can be applied to richer KBs like DBpedia or Babel- Net, and can be easily extended to multilingual embeddings. All software and resources are open source.

artificial intelligence, natural language, text processing, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.knosys.2018.03.017

1804.08316

Country:

North America > United States (1.00)
Europe (0.67)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback