Goto

Collaborating Authors

 polyglot


Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs

Dai, Hankun, Wang, Maoquan, Qi, Mengnan, Zhang, Yikai, Jin, Zijian, Yao, Yongqiang, Huang, Yufan, Fu, Shengyu, Nallipogu, Elsie

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly being applied to programming tasks, ranging from single-turn code completion to autonomous agents. Current code agent designs frequently depend on complex, hand-crafted workflows and tool sets. However, this reliance on elaborate scaffolding presents several challenges: agent performance becomes overly dependent on prompt tuning and custom design choices, heavy human intervention obscures a model's true underlying capabilities, and intricate pipelines are costly to build and maintain. Furthermore, optimizing complex task prompts increases the risk of data leakage. Currently, when introducing new models, LLM providers like OpenAI and Anthropic often publish benchmark scores to demonstrate their models' coding proficiency, but keep their proprietary evaluation frameworks confidential. To address these limitations, we introduce Lita (Lite Agent), which operationalizes liteness, a principle of minimizing manual design while retaining the essential elements of a fully autonomous agent. Lita enables a more faithful and unified evaluation without elaborate scaffolding. Experiments on the Aider Polyglot and SWE-Bench with frontier models demonstrate that Lita achieves competitive or superior performance compared to workflow-based and agentic baselines. Crucially, Lita also consumes fewer tokens and requires significantly less design effort. Our results suggest that Lita is sufficient to reveal the underlying coding competence of modern LLMs. Finally, we propose the Agent Complexity Law: the performance gap between agents of varying complexity, from simple to sophisticated designs, will shrink as the core model improves, ultimately converging to a negligible difference.


Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Zhang, Jenny, Hu, Shengran, Lu, Cong, Lange, Robert, Clune, Jeff

arXiv.org Artificial Intelligence

Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The Gödel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin Gödel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.


Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMs

Khelladi, Djamel Eddine, Reux, Charly, Acher, Mathieu

arXiv.org Artificial Intelligence

Large language model (LLM)-based test generation has gained attention in software engineering, yet most studies evaluate LLMs' ability to generate unit tests in a single attempt for a given language, missing the opportunity to leverage LLM diversity for more robust testing. This paper introduces PolyTest, a novel approach that enhances test generation by exploiting polyglot and temperature-controlled diversity. PolyTest systematically leverages these properties in two complementary ways: (1) Cross-lingual test generation, where tests are generated in multiple languages at zero temperature and then unified; (2) Diverse test sampling, where multiple test sets are generated within the same language at a higher temperature before unification. A key insight is that LLMs can generate diverse yet contradicting tests -- same input, different expected outputs -- across languages and generations. PolyTest mitigates inconsistencies by unifying test sets, fostering self-consistency and improving overall test quality. Unlike single-language or single-attempt approaches, PolyTest enhances testing without requiring on-the-fly execution, making it particularly beneficial for weaker-performing languages. We evaluate PolyTest on Llama3-70B, GPT-4o, and GPT-3.5 using EvalPlus, generating tests in five languages (Java, C, Python, JavaScript, and a CSV-based format) at temperature 0 and sampling multiple sets at temperature 1. We observe that LLMs frequently generate contradicting tests across settings, and that PolyTest significantly improves test quality across all considered metrics -- number of tests, passing rate, statement/branch coverage (up to +9.01%), and mutation score (up to +11.23%). Finally, PolyTest outperforms Pynguin in test generation, passing rate, and mutation score.


On the Abuse and Detection of Polyglot Files

Koch, Luke, Oesch, Sean, Chaulagain, Amul, Dixon, Jared, Dixon, Matthew, Huettal, Mike, Sadovnik, Amir, Watson, Cory, Weber, Brian, Hartman, Jacob, Patulski, Richard

arXiv.org Artificial Intelligence

A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.


Attending Form and Context to Generate Specialized Out-of-VocabularyWords Representations

Garneau, Nicolas, Leboeuf, Jean-Samuel, Pinter, Yuval, Lamontagne, Luc

arXiv.org Machine Learning

We propose a new contextual-compositional neural network layer that handles out-of-vocabulary (OOV) words in natural language processing (NLP) tagging tasks. This layer consists of a model that attends to both the character sequence and the context in which the OOV words appear. We show that our model learns to generate task-specific \textit{and} sentence-dependent OOV word representations without the need for pre-training on an embedding table, unlike previous attempts. We insert our layer in the state-of-the-art tagging model of \citet{plank2016multilingual} and thoroughly evaluate its contribution on 23 different languages on the task of jointly tagging part-of-speech and morphosyntactic attributes. Our OOV handling method successfully improves performances of this model on every language but one to achieve a new state-of-the-art on the Universal Dependencies Dataset 1.4.


Polyglot!

Communications of the ACM

Google speaks 106 languages--or at least can understand queries in written form if not also oral form. When I watch someone interacting verbally with Google Assistant in languages other than English (my native tongue), I realize Google's language ability vastly exceeds my own. I have a modest ability to speak and understand German. I know a few phrases in Russian and French. But it suddenly strikes me that Google is usefully dealing with over 100 languages in written and oral form.


AI Weekly: The AI research agenda for the next 20 years is being made now

#artificialintelligence

It's mind-blowing how much world-shaping work that gets done in hotel ballrooms. Machine learning experts regularly gather at conferences around the world to discuss noteworthy work and how to move the industry forward. Few are fortunate enough to attend in person, but you can sometimes find video online. The most recent example: The Association for the Advancement of Artificial Intelligence (AAAI) met in Hawaii last week, and among topics discussed was the roadmap for AI research in the United States for the next 20 years. The process to create a plan for the next two decades started in November with private workshops attended by academics and people from industry.


Facebook's 'polyglot' AI speaks English, German, and Spanish

#artificialintelligence

But they hold particular promise in the text-to-speech (TTS) realm, as evidenced by systems like Google's WaveNet, Baidu's DeepVoice, and WaveLoop. Another case in point: an artificially intelligent (AI) 'polyglot' system created by researchers at Facebook that's able to, given voice data, produce new speech samples in multiple languages. The team describes their work in a paper ("Unsupervised Polyglot Text-to-Speech") published on the preprint server Arxiv.org. "The … [AI] is able to transfer a voice, which was presented as a sample in a source language, into one of several target languages," they wrote. "[It can] take a sample of a speaker talking in one language and have [them] … speak as a native speaker in another language."