AITopics

2407.05189

Country:

Asia > Azerbaijan (0.05)
Asia > Middle East > Iran > Ardabil Province > Ardabil (0.05)
North America > United States > Michigan (0.04)
(8 more...)

Genre: Research Report > New Finding (0.88)

Industry: Education > Curriculum > Subject-Specific Education (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Bachmann, Gregor, Nagarajan, Vaishnavh

The pitfalls of next-token prediction

arXiv.org Artificial IntelligenceJul-5-2024

Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. Finally, we provide preliminary evidence that this failure can be resolved using a simple modification that predicts multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures

language model, next-token prediction, supervision, (13 more...)

2403.06963

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Austria > Vienna (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(12 more...)

Genre: Research Report (0.50)

Industry: Education > Curriculum > Subject-Specific Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
(2 more...)

arXiv.org Artificial IntelligenceJul-5-2024

NADI 2024: The Fifth Nuanced Arabic Dialect Identification Shared Task

Abdul-Mageed, Muhammad, Keleg, Amr, Elmadany, AbdelRahim, Zhang, Chiyu, Hamed, Injy, Magdy, Walid, Bouamor, Houda, Habash, Nizar

We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI's objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on pre-specified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Subtask~1), identification of the Arabic level of dialectness (Subtask~2), and dialect-to-MSA machine translation (Subtask~3). A total of 51 unique teams registered for the shared task, of whom 12 teams have participated (with 76 valid submissions during the test phase). Among these, three teams participated in Subtask~1, three in Subtask~2, and eight in Subtask~3. The winning teams achieved 50.57 F\textsubscript{1} on Subtask~1, 0.1403 RMSE for Subtask~2, and 20.44 BLEU in Subtask~3, respectively. Results show that Arabic dialect processing tasks such as dialect identification and machine translation remain challenging. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

annotator, dataset, dialect, (13 more...)

2407.0491

Country:

Africa > Middle East > Somalia (0.14)
Africa > Middle East > Djibouti (0.14)
Africa > Middle East > Algeria (0.05)
(35 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Communications > Social Media (0.93)

Sildam, Tiia, Velve, Andra, Alumäe, Tanel

Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation

arXiv.org Artificial IntelligenceJul-4-2024

This paper investigates the finetuning of end-to-end models for bidirectional Estonian-English and Estonian-Russian conversational speech-to-text translation. Due to the limited availability of speech translation data for Estonian, we created additional training data by web scraping and synthesizing data from speech recognition datasets using machine translation. We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. Our results indicate that fine-tuning with synthetic data enhances translation accuracy by a large margin, with SeamlessM4T matching or surpassing cascaded speech translation systems that use state-of-the-art speech recognition and machine translation models.

speech, training data, translation, (15 more...)

2407.03809

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Estonia > Tartu County > Tartu (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
(10 more...)

Genre: Research Report > New Finding (0.48)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceJul-4-2024

A Survey of Data Synthesis Approaches

Chang, Hsin-Yu, Chen, Pei-Yu, Chou, Tun-Hsiang, Kao, Chang-Sheng, Yu, Hsuan-Yun, Lin, Yen-Ting, Chen, Yun-Nung

This paper provides a detailed survey of synthetic data techniques. We first discuss the expected goals of using synthetic data in data augmentation, which can be divided into four parts: 1) Improving Diversity, 2) Data Balancing, 3) Addressing Domain Shift, and 4) Resolving Edge Cases. Synthesizing data are closely related to the prevailing machine learning techniques at the time, therefore, we summarize the domain of synthetic data techniques into four categories: 1) Expert-knowledge, 2) Direct Training, 3) Pre-train then Fine-tune, and 4) Foundation Models without Fine-tuning. Next, we categorize the goals of synthetic data filtering into four types for discussion: 1) Basic Quality, 2) Label Consistency, and 3) Data Distribution. In section 5 of this paper, we also discuss the future directions of synthetic data and state three direction that we believe is important: 1) focus more on quality, 2) the evaluation of synthetic data, and 3) multi-model data augmentation.

augmentation, data augmentation, dataset, (14 more...)

2407.03672

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(7 more...)

Genre: Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

The Japan TimesJul-3-2024, 00:24:00 GMT

Clear-screen translation system is being tested at Tokyo's Haneda

Japan Airlines and Toppan said Tuesday that they have begun at Tokyo's Haneda Airport a demonstration test of the clear-screen translation system developed by the printing company. The system can automatically translate spoken words into 13 languages, including English and Korean, and quickly display the translated words and sentences on its transparent screen. It also shows words entered with a keyboard. With its clear screen, the system enables speakers to talk while seeing each other's faces. It is designed to rapidly provide information to foreign travelers and people with hearing difficulties. The trial will be conducted at counters at Haneda Airport's Terminal 1 until Monday and at Osaka International Airport, also known as Itami Airport, in August.

airport, clear-screen translation system, tokyo, (1 more...)

The Japan Times

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.67)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.31)

Industry:

Transportation > Infrastructure & Services > Airport (1.00)
Transportation > Air (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)

Cavalin, Paulo, Domingues, Pedro Henrique, Pinhanez, Claudio

Sentence-level Aggregation of Lexical Metrics Correlate Stronger with Human Judgements than Corpus-level Aggregation

In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT. We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem. Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation. Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of MT systems for low-resource languages more trustworthy.

computational linguistic, correlation, metric, (14 more...)

2407.12832

Country:

Asia > Singapore (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > Canada > Ontario > Toronto (0.04)
(12 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Universal Gloss-level Representation for Gloss-free Sign Language Translation and Production

Hwang, Eui Jun, Cho, Sukmin, Lee, Huije, Yoon, Youngwoo, Park, Jong C.

Sign language, essential for the deaf and hard-of-hearing, presents unique challenges in translation and production due to its multimodal nature and the inherent ambiguity in mapping sign language motion to spoken language words. Previous methods often rely on gloss annotations, requiring time-intensive labor and specialized expertise in sign language. Gloss-free methods have emerged to address these limitations, but they often depend on external sign language data or dictionaries, failing to completely eliminate the need for gloss annotations. There is a clear demand for a comprehensive approach that can supplant gloss annotations and be utilized for both Sign Language Translation (SLT) and Sign Language Production (SLP). We introduce Universal Gloss-level Representation (UniGloR), a unified and self-supervised solution for both SLT and SLP, trained on multiple datasets including PHOENIX14T, How2Sign, and NIASL2021. Our results demonstrate UniGloR's effectiveness in the translation and production tasks. We further report an encouraging result for the Sign Language Recognition (SLR) on previously unseen data. Our study suggests that self-supervised learning can be made in a unified manner, paving the way for innovative and practical applications in future research.

dataset, translation, uniglor, (10 more...)

2407.02854

Country:

North America > United States (0.04)
Europe > Russia (0.04)
Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Appicharla, Ramakrishna, Gain, Baban, Pal, Santanu, Ekbal, Asif, Bhattacharyya, Pushpak

A Case Study on Context-Aware Neural Machine Translation with Multi-Task Learning

In document-level neural machine translation (DocNMT), multi-encoder approaches are common in encoding context and source sentences. Recent studies \cite{li-etal-2020-multi-encoder} have shown that the context encoder generates noise and makes the model robust to the choice of context. This paper further investigates this observation by explicitly modelling context encoding through multi-task learning (MTL) to make the model sensitive to the choice of context. We conduct experiments on cascade MTL architecture, which consists of one encoder and two decoders. Generation of the source from the context is considered an auxiliary task, and generation of the target from the source is the main task. We experimented with German--English language pairs on News, TED, and Europarl corpora. Evaluation results show that the proposed MTL approach performs better than concatenation-based and multi-encoder DocNMT models in low-resource settings and is sensitive to the choice of context. However, we observe that the MTL models are failing to generate the source from the context. These observations align with the previous studies, and this might suggest that the available document-level parallel corpora are not context-aware, and a robust sentence-level model can outperform the context-aware models.

computational linguistic, source sentence, translation, (13 more...)

2407.03076

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(12 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Wu, Guojun, Cohen, Shay B., Sennrich, Rico

Evaluating Automatic Metrics with Incremental Machine Translation Systems

We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. Since human A/B testing is commonly used, we assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations. Our study confirms several previous findings in MT metrics research and demonstrates the dataset's value as a testbed for metric evaluation. We release our code at https://github.com/gjwubyron/Evo

computational linguistic, language pair, proceedings, (12 more...)

2407.03277

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
North America > United States > Pennsylvania (0.04)
(10 more...)

Genre: Research Report > New Finding (0.94)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)