Goto

Collaborating Authors

 function name





NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code Generation

Saadi, Hossain Shaikh, Alam, Faria, Sanz-Guerrero, Mario, Bui, Minh Duc, Mager, Manuel, von der Wense, Katharina

arXiv.org Artificial Intelligence

This paper presents JGU Mainz's winning system for the BLP-2025 Shared Task on Code Generation from Bangla Instructions. We propose a multi-agent-based pipeline. First, a code-generation agent produces an initial solution from the input instruction. The candidate program is then executed against the provided unit tests (pytest-style, assert-based). Only the failing cases are forwarded to a debugger agent, which reruns the tests, extracts error traces, and, conditioning on the error messages, the current program, and the relevant test cases, generates a revised solution. Using this approach, our submission achieved first place in the shared task with a $Pass@1$ score of 95.4. We also make our code public.




Anchor Function

Neural Information Processing Systems

Figure 7: Actual example of how an anchor function impacts the generated solution. In this section, we provide additional experimental details and results for the experiments in Section 3. We include additional details for anchoring (Appendix A.1), the availability heuristic (Appendix A.3), Filtering prompts for longer canonical solutions. However, all components of the prompts from Section 3.3.2 We plot the analogous add-var results in Figure 10 and include full numerical results in Table 7. In this section, we augment Section 3.3.3


Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning

Ersoy, Asim, Altinisik, Enes, Sencar, Husrev Taha, Darwish, Kareem

arXiv.org Artificial Intelligence

Tool calling is a critical capability that allows Large Language Models (LLMs) to interact with external systems, significantly expanding their utility. However, research and resources for tool calling are predominantly English-centric, leaving a gap in our understanding of how to enable this functionality for other languages, such as Arabic. This paper investigates three key research questions: (1) the necessity of in-language (Arabic) tool-calling data versus relying on cross-lingual transfer, (2) the effect of general-purpose instruction tuning on tool-calling performance, and (3) the value of fine-tuning on specific, high-priority tools. To address these questions, we conduct extensive experiments using base and post-trained variants of an open-weight Arabic LLM. To enable this study, we bridge the resource gap by translating and adapting two open-source tool-calling datasets into Arabic. Our findings provide crucial insights into the optimal strategies for developing robust tool-augmented agents for Arabic.


Cyber-Zero: Training Cybersecurity Agents without Runtime

Zhuo, Terry Yue, Wang, Dingmin, Ding, Hantian, Kumar, Varun, Wang, Zijian

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero, the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs. Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero, we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best model, Cyber-Zero-32B, establishes new state-of-the-art performance among open-weight models, matching the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet while offering superior cost-effectiveness, and demonstrating that runtime-free trajectory synthesis can effectively democratize the development of state-of-the-art cybersecurity agents.