AITopics | Lieret, Kilian

Collaborating Authors

Lieret, Kilian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Yang, John, Jimenez, Carlos E., Zhang, Alex L., Lieret, Kilian, Yang, Joyce, Wu, Xindi, Press, Ori, Muennighoff, Niklas, Synnaeve, Gabriel, Narasimhan, Karthik R., Yang, Diyi, Wang, Sida I., Press, Ofir

arXiv.org Artificial IntelligenceOct-4-2024

Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHub repositories. However, SWE-bench uses only Python repositories, with problem statements presented predominantly as text and lacking visual elements such as images. This limited coverage motivates our inquiry into how existing systems might perform on unrepresented software engineering domains (e.g., front-end, game development, DevOps), which use different programming languages and paradigms. Therefore, we propose SWE-bench Multimodal (SWE-bench M), to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software. SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. Each SWE-bench M task instance contains at least one image in its problem statement or unit tests. Our analysis finds that top-performing SWE-bench systems struggle with SWE-bench M, revealing limitations in visual problem-solving and cross-language generalization. Lastly, we show that SWE-agent's flexible language-agnostic features enable it to substantially outperform alternatives on SWE-bench M, resolving 12% of task instances compared to 6% for the next best system.

large language model, machine learning, programming language, (23 more...)

arXiv.org Artificial Intelligence

2410.03859

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Software (0.66)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Software Engineering (1.00)
Information Technology > Human Computer Interaction > Interfaces (1.00)
(5 more...)

Add feedback

EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges

Abramovich, Talor, Udeshi, Meet, Shao, Minghao, Lieret, Kilian, Xi, Haoran, Milner, Kimberly, Jancheska, Sofija, Yang, John, Jimenez, Carlos E., Khorrami, Farshad, Krishnamurthy, Prashanth, Dolan-Gavitt, Brendan, Shafique, Muhammad, Narasimhan, Karthik, Karri, Ramesh, Press, Ofir

arXiv.org Artificial IntelligenceSep-24-2024

Although language model (LM) agents are demonstrating growing potential in many domains, their success in cybersecurity has been limited due to simplistic design and the lack of fundamental features for this domain. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. EnIGMA introduces new Agent-Computer Interfaces (ACIs) to improve the success rate on CTF challenges. We establish the novel Interactive Agent Tool concept, which enables LM agents to run interactive command-line utilities essential for these challenges. Empirical analysis of EnIGMA on over 350 CTF challenges from three different benchmarks indicates that providing a robust set of new tools with demonstration of their usage helps the LM solve complex problems and achieves state-of-the-art results on the NYU CTF and Intercode-CTF benchmarks. Finally, we discuss insights on ACI design and agent behavior on cybersecurity tasks that highlight the need to adapt real-world tools for LM agents.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2409.16165

Country: North America > United States (0.67)

Genre:

Research Report (0.82)
Workflow (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Education (1.00)
Government > Military > Cyberwarfare (0.70)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

SciCode: A Research Coding Benchmark Curated by Scientists

Tian, Minyang, Gao, Luyu, Zhang, Shizhuo Dylan, Chen, Xinan, Fan, Cunwei, Guo, Xuefei, Haas, Roland, Ji, Pan, Krongchon, Kittithat, Li, Yao, Liu, Shengyan, Luo, Di, Ma, Yutao, Tong, Hao, Trinh, Kha, Tian, Chenyu, Wang, Zihan, Wu, Bohao, Xiong, Yanyu, Yin, Shengzhu, Zhu, Minhui, Lieret, Kilian, Lu, Yanxin, Liu, Genglin, Du, Yufeng, Tao, Tianhua, Press, Ofir, Callan, Jamie, Huerta, Eliu, Peng, Hao

arXiv.org Artificial IntelligenceJul-18-2024

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

large language model, machine learning, main problem, (20 more...)

arXiv.org Artificial Intelligence

2407.13168

Country:

North America > United States > Illinois (0.28)
North America > United States > Texas (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang, John, Jimenez, Carlos E., Wettig, Alexander, Lieret, Kilian, Yao, Shunyu, Narasimhan, Karthik, Press, Ofir

arXiv.org Artificial IntelligenceMay-30-2024

Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2405.15793

Country:

North America > United States > Texas (0.13)
Asia > Middle East > Israel > Mediterranean Sea (0.13)

Genre:

Research Report (1.00)
Workflow (0.92)

Industry: Information Technology > Security & Privacy (0.92)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(3 more...)

Add feedback

High Pileup Particle Tracking with Object Condensation

Lieret, Kilian, DeZoort, Gage, Chatterjee, Devdoot, Park, Jian, Miao, Siqi, Li, Pan

arXiv.org Artificial IntelligenceDec-6-2023

Traditional charged particle tracking algorithms at the Large Hadron Collider (LHC) are based on the combinatorial Kalman filter. However, this class of algorithms exhibits sub-optimal scaling with respect to pileup, rendering tracking a bottleneck for future experiments such as the High Luminosity LHC (HL-LHC) [1]. This has prompted research into tracking algorithms leveraging graph neural networks (GNNs) or similar machine learning (ML) architectures demonstrating improved computational scaling. Recent results have confirmed that GNN-based algorithms can indeed achieve linear scaling with pileup [2, 3]. The majority of GNN approaches adopt an edge classification (EC) approach to tackle the tracking problem. Given an initial graph that connects all hits that potentially belong to the same particle, a GNN is trained to remove edges that connect hits belonging to different particles.

artificial intelligence, machine learning, particle, (15 more...)

arXiv.org Artificial Intelligence

2312.03823

Country: North America > United States > New York (0.14)

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback