english sentence
Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models
Browder, Michael, Duh, Kevin, Harris, J. David, Lyzinski, Vince, McNamee, Paul, Park, Youngser, Priebe, Carey E., Viechnicki, Peter
Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models -- particularly LLMs -- are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to 'fiddle' with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (9 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.48)
SETUP: Sentence-level English-To-Uniform Meaning Representation Parser
Markle, Emma, Bach, Javier Gutierrez, Wein, Shira
Uniform Meaning Representation (UMR) is a novel graph-based semantic representation which captures the core meaning of a text, with flexibility incorporated into the annotation schema such that the breadth of the world's languages can be annotated (including low-resource languages). While UMR shows promise in enabling language documentation, improving low-resource language technologies, and adding interpretability, the downstream applications of UMR can only be fully explored when text-to-UMR parsers enable the automatic large-scale production of accurate UMR graphs at test time. Prior work on text-to-UMR parsing is limited to date. In this paper, we introduce two methods for English text-to-UMR parsing, one of which fine-tunes existing parsers for Abstract Meaning Representation and the other, which leverages a converter from Universal Dependencies, using prior work as a baseline. Our best-performing model, which we call SETUP, achieves an AnCast score of 84 and a SMATCH++ score of 91, indicating substantial gains towards automatic UMR parsing.
- Europe > Austria > Vienna (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (9 more...)
- Research Report (0.64)
- Overview (0.46)
SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation
Thomas, Marshall, Fish, Edward, Bowden, Richard
Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign, ChicagoFSWildPlus, and BOBSL datasets with a BLEU-4 score of 22.1, 73.2% letter accuracy and BLEU-4 score of 6.8 respectively. These results validate our core hypothesis: isolating and solving distinct recognition tasks before fusion provides a more powerful and effective pathway to robust, high-fidelity sign language translation.
- Europe > United Kingdom > England > Surrey > Guildford (0.40)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > Ireland (0.04)
Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic
Pan, Muyu, Kodakandla, Dheeraj, Farooque, Mahfuza
Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic without human intervention. This enables automated reasoning and facilitates debugging, finding loop invariants, and adhering to specifications in software systems. However, hallucinations-incorrect outputs generated by LLMs are challenging, particularly for logical translation tasks requiring precision. This work introduces a novel framework that inputs English sentences, converts them into logical expressions, and then translates them into Conjunctive Normal Form (CNF) for satisfiability solving. It employs classical NLP techniques with self-defined grammar, symbolic computation libraries, and a fine-tuned language model to reduce hallucinations. In the early experiments, we observed that the fine-tuned model, trained on different grammar settings, could intentionally correct the same types of hallucinations made by the original model. Thus, it provides reliable CNF generation.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
IASC: Interactive Agentic System for ConLangs
Taguchi, Chihiro, Sproat, Richard
We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France (0.04)
- (15 more...)
- Research Report > New Finding (1.00)
- Instructional Material (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
Yan, Brian, Hamed, Injy, Shimizu, Shuichiro, Lodagala, Vasista, Chen, William, Iakovenko, Olga, Talafha, Bashar, Hussein, Amir, Polok, Alexander, Chang, Kalvin, Klement, Dominik, Althubaiti, Sara, Peng, Puyuan, Wiesner, Matthew, Solorio, Thamar, Ali, Ahmed, Khudanpur, Sanjeev, Watanabe, Shinji, Chen, Chih-Chen, Wu, Zhen, Benharrak, Karim, Diwan, Anuj, Cornell, Samuele, Yeo, Eunjung, Choi, Kwanghee, Carvalho, Carlos, Rosero, Karen
CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > Canada > British Columbia (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)
Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text
Mohamed, Amr, Zhang, Yang, Vazirgiannis, Michalis, Shang, Guokan
Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English text$\unicode{x2013}$even under linguistic constraints$\unicode{x2013}$embedding English into other languages often improves comprehension. Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Costa Rica > Heredia Province > Heredia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (13 more...)
Un-considering Contextual Information: Assessing LLMs' Understanding of Indexical Elements
Oguz, Metehan, Bakman, Yavuz, Yaldiz, Duygu Nur
Large Language Models (LLMs) have demonstrated impressive performances in tasks related to coreference resolution. However, previous studies mostly assessed LLM performance on coreference resolution with nouns and third person pronouns. This study evaluates LLM performance on coreference resolution with indexical like I, you, here and tomorrow, which come with unique challenges due to their linguistic properties. We present the first study examining how LLMs interpret indexicals in English, releasing the English Indexical Dataset with 1600 multiple-choice questions. We evaluate pioneering LLMs, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and DeepSeek V3. Our results reveal that LLMs exhibit an impressive performance with some indexicals (I), while struggling with others (you, here, tomorrow), and that syntactic cues (e.g. quotation) contribute to LLM performance with some indexicals, while they reduce performance with others. Code and data are available at: https://github.com/metehanoguzz/LLMs-Indexicals-English.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.10)
- North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.05)
- (4 more...)