AITopics

2402.07255

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Africa > Middle East > Morocco (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Education > Curriculum > Subject-Specific Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

arXiv.org Artificial IntelligenceFeb-10-2024

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Hu, Yuchen, Chen, Chen, Yang, Chao-Han Huck, Li, Ruizhe, Zhang, Dong, Chen, Zhehuai, Chng, Eng Siong

Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.

gentranslate, hypothesis, translation, (17 more...)

2402.06894

Country: Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

arXiv.org Artificial IntelligenceFeb-10-2024

A Tale of Tails: Model Collapse as a Change of Scaling Laws

Dohmatob, Elvis, Feng, Yunzhen, Yang, Pu, Charton, Francois, Kempe, Julia

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2.

model collapse, scaling law, synthesized data, (16 more...)

2402.07043

Country:

North America > United States > New York (0.04)
North America > Dominican Republic (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.45)

Gantt, William, Martin, Alexander, Kuchmiichuk, Pavlo, White, Aaron Steven

Event-Keyed Summarization

arXiv.org Artificial IntelligenceFeb-10-2024

We introduce event-keyed summarization (EKS), a novel task that marries traditional summarization and document-level event extraction, with the goal of generating a contextualized summary for a specific event, given a document and an extracted event structure. We introduce a dataset for this task, MUCSUM, consisting of summaries of all events in the classic MUC-4 dataset, along with a set of baselines that comprises both pretrained LM standards in the summarization literature, as well as larger frontier models. We show that ablations that reduce EKS to traditional summarization or structure-to-text yield inferior summaries of target events and that MUCSUM is a robust benchmark for this task. Lastly, we conduct a human evaluation of both reference and model summaries, and provide some detailed analysis of the results.

computational linguistic, proceedings, summarization, (16 more...)

2402.06973

Country:

North America > El Salvador (0.15)
North America > United States > Virginia > Fairfax County > McLean (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(14 more...)

Genre: Research Report (1.00)

Industry: Law Enforcement & Public Safety > Terrorism (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Amadeus, Marcellus, Castañeda, William Alberto Cruz

Evaluation Metrics for Text Data Augmentation in NLP

Recent surveys on data augmentation for natural language processing have reported different techniques and advancements in the field. Several frameworks, tools, and repositories promote the implementation of text data augmentation pipelines. However, a lack of evaluation criteria and standards for method comparison due to different tasks, metrics, datasets, architectures, and experimental settings makes comparisons meaningless. Also, a lack of methods unification exists and text data augmentation research would benefit from unified metrics to compare different augmentation methods. Thus, academics and the industry endeavor relevant evaluation metrics for text data augmentation techniques. The contribution of this work is to provide a taxonomy of evaluation metrics for text augmentation methods and serve as a direction for a unified benchmark. The proposed taxonomy organizes categories that include tools for implementation and metrics calculation. Finally, with this study, we intend to present opportunities to explore the unification and standardization of text data augmentation metrics.

augmentation, classification, data augmentation, (15 more...)

2402.06766

Country:

South America > Brazil (0.14)
North America > United States > New York > New York County > New York City (0.05)
North America > United States > Washington > King County > Seattle (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Singh, Shivalika, Vargus, Freddie, Dsouza, Daniel, Karlsson, Börje F., Mahendiran, Abinaya, Ko, Wei-Yin, Shandilya, Herumb, Patel, Jay, Mataciunas, Deividas, OMahony, Laura, Zhang, Mike, Hettiarachchi, Ramith, Wilson, Joseph, Machado, Marina, Moura, Luisa Souza, Krzemiński, Dominik, Fadaei, Hakimeh, Ergün, Irem, Okoh, Ifeoma, Alaagib, Aisha, Mudannayake, Oshan, Alyafeai, Zaid, Chien, Vu Minh, Ruder, Sebastian, Guthikonda, Surya, Alghamdi, Emad A., Gehrmann, Sebastian, Muennighoff, Niklas, Bartolo, Max, Kreutzer, Julia, Üstün, Ahmet, Fadaee, Marzieh, Hooker, Sara

average prompt and completion length, information processing system, international joint conference, (15 more...)

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.

2402.06619

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Washington > King County > Seattle (0.13)
Europe > Finland > Southwest Finland > Turku (0.04)
(64 more...)

Genre: Overview (1.00)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Information Technology (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Graham, Yvette, Qureshi, Mohammed Rameez, Khalid, Haider, Lampouras, Gerasimos, Iacobacci, Ignacio, Liu, Qun

Findings of the First Workshop on Simulating Conversational Intelligence in Chat

The aim of this workshop is to bring together experts working on open-domain dialogue research. In this speedily advancing research area many challenges still exist, such as learning information from conversations, engaging in realistic and convincing simulation of human intelligence and reasoning. SCI-CHAT follows previous workshops on open domain dialogue but with a focus on the simulation of intelligent conversation as judged in a live human evaluation. Models aim to include the ability to follow a challenging topic over a multi-turn conversation, while positing, refuting and reasoning over arguments. The workshop included both a research track and shared task. The main goal of this paper is to provide an overview of the shared task and a link to an additional paper that will include an in depth analysis of the shared task results following presentation at the workshop.

computational linguistic, evaluation, yvette graham, (11 more...)

2402.0642

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.16)
Europe > United Kingdom > England > Greater London > London (0.05)
Asia > China > Hong Kong (0.05)
(8 more...)

Genre: Instructional Material > Course Syllabus & Notes (0.55)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.35)

Gete, Harritxu, Etchegoyhen, Thierry

Promoting Target Data in Context-aware Neural Machine Translation

Standard context-aware neural machine translation (NMT) typically relies on parallel document-level data, exploiting both source and target contexts. Concatenation-based approaches in particular, still a strong baseline for document-level NMT, prepend source and/or target context sentences to the sentences to be translated, with model variants that exploit equal amounts of source and target data on each side achieving state-of-the-art results. In this work, we investigate whether target data should be further promoted within standard concatenation-based approaches, as most document-level phenomena rely on information that is present on the target language side. We evaluate novel concatenation-based variants where the target context is prepended to the source language, either in isolation or in combination with the source context. Experimental results in English-Russian and Basque-Spanish show that including target context in the source leads to large improvements on target language phenomena. On source-dependent phenomena, using only target language context in the source achieves parity with state-of-the-art concatenation approaches, or slightly underperforms, whereas combining source and target context on the source side leads to significant gains across the board.

computational linguistic, machine translation, translation, (13 more...)

2402.06342

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(20 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Savoldi, Beatrice, Piergentili, Andrea, Fucci, Dennis, Negri, Matteo, Bentivogli, Luisa

A Prompt Response to the Demand for Automatic Gender-Neutral Translation

arXiv.org Artificial IntelligenceFeb-8-2024

Gender-neutral translation (GNT) that avoids biased and undue binary assumptions is a pivotal challenge for the creation of more inclusive translation technologies. Advancements for this task in Machine Translation (MT), however, are hindered by the lack of dedicated parallel data, which are necessary to adapt MT systems to satisfy neutral constraints. For such a scenario, large language models offer hitherto unforeseen possibilities, as they come with the distinct advantage of being versatile in various (sub)tasks when provided with explicit instructions. In this paper, we explore this potential to automate GNT by comparing MT with the popular GPT-4 model. Through extensive manual analyses, our study empirically reveals the inherent limitations of current MT systems in generating GNTs and provides valuable insights into the potential and challenges associated with prompting for neutrality.

computational linguistic, proceedings, translation, (14 more...)

2402.06041

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > Canada > Ontario > Toronto (0.05)
North America > United States > Washington > King County > Seattle (0.04)
(10 more...)

Genre: Research Report (0.64)

Industry:

Media (0.46)
Commercial Services & Supplies > Security & Alarm Services (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceFeb-7-2024

Towards Boosting Many-to-Many Multilingual Machine Translation with Large Language Models

Gao, Pengzhi, He, Zhongjun, Wu, Hua, Wang, Haifeng

The training paradigm for machine translation has gradually shifted, from learning neural machine translation (NMT) models with extensive parallel corpora to instruction finetuning on multilingual large language models (LLMs) with high-quality translation pairs. In this paper, we focus on boosting many-to-many multilingual translation of LLMs with an emphasis on zero-shot translation directions. We demonstrate that prompt strategies adopted during finetuning are crucial to zero-shot translation and introduce a cross-lingual consistency regularization, XConST, to bridge the representation gap among different languages and improve zero-shot translation performance. XConST is not a new method, but a version of CrossConST (Gao et al., 2023a) adapted for translation instruction finetuning with LLMs. Experimental results on ALMA (Xu et al., 2023), Tower (Team, 2024), and LLaMA-2 (Touvron et al., 2023) show that our approach consistently improves translation performance. Our implementations are available at https://github.com/gpengzhi/CrossConST-LLM.

ru zh, translation, translation performance, (11 more...)

2401.05861

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > Canada > Quebec > Montreal (0.04)
(11 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)