AITopics

2501.16655

Country:

North America > United States (0.28)
Europe > Austria > Vienna (0.14)
Europe > Middle East > Malta (0.14)

Genre:

Workflow (0.98)
Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceOct-10-2024

REDO: Execution-Free Runtime Error Detection for COding Agents

Li, Shou, Kan, Andrey, Callot, Laurent, Bhasker, Bhavana, Rashid, Muhammad Shihab, Esler, Timothy B

As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level problems with complex external dependencies. Finally, through both quantitative and qualitative analyses across various error detection tasks, we demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and 9.1% higher weighted F1 score; and provide insights into the advantages of incorporating LLMs for error detection. Large language models (LLMs) and LLM-based agents have exhibited significant potential in code generation, code editing, and code evaluation. This progress has culminated in the development of advanced LLM-based agents (hereafter referred to as coding agents) designed to address increasingly complex tasks. For example, SWE-Bench (Jimenez et al., 2024a) presents a demanding benchmark comprising repository-level coding challenges. This benchmark requires coding agents to generate a modification patch that solves a given problem within a GitHub repository, based on a problem statement expressed in natural language. To effectively navigate complex tasks such as those posed by SWE-Bench, coding agents must demonstrate proficiency in the following core competencies: 1) comprehension of the problem statement and retrieving relevant code, 2) reasoning towards a functionally correct solution, and 3) generation of programs free from runtime errors such as SyntaxError, AttributeError, or TypeError. While the majority of coding agents across different tasks focus on enhancing comprehension, retrieval and reasoning capabilities, the systematic detection of runtime errors has received comparatively limited attention. However, ensuring that generated code is free from runtime errors is as critical as the aforementioned capabilities. For example, an AttributeError can cause the modified code to fail, irrespective of the agent's comprehension and reasoning processes.

large language model, machine learning, natural language, (18 more...)

2410.09117

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceMay-22-2024

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

Guinet, Gauthier, Omidvar-Tehrani, Behrooz, Deoras, Anoop, Callot, Laurent

We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG). Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions based on the corpus of documents associated with the task. Our method is an automated, cost-efficient, interpretable, and robust strategy to select the optimal components for a RAG system. We leverage Item Response Theory (IRT) to estimate the quality of an exam and its informativeness on task-specific accuracy. IRT also provides a natural way to iteratively improve the exam by eliminating the exam questions that are not sufficiently informative about a model's ability. We demonstrate our approach on four new open-ended Question-Answering tasks based on Arxiv abstracts, StackExchange questions, AWS DevOps troubleshooting guides, and SEC filings. In addition, our experiments reveal more general insights into factors impacting RAG performance like size, retrieval mechanism, prompting and fine-tuning. Most notably, our findings show that choosing the right retrieval algorithms often leads to bigger performance gains than simply using a larger language model.

large language model, machine learning, natural language, (18 more...)

2405.13622

Country:

North America > United States (1.00)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.86)

Industry:

Education (1.00)
Energy > Oil & Gas (0.68)
Government > Regional Government > North America Government > United States Government (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceJan-18-2024

MELODY: Robust Semi-Supervised Hybrid Model for Entity-Level Online Anomaly Detection with Multivariate Time Series

Ni, Jingchao, Guinet, Gauthier, Jiang, Peihong, Callot, Laurent, Kan, Andrey

In large IT systems, software deployment is a crucial process in online services as their code is regularly updated. However, a faulty code change may degrade the target service's performance and cause cascading outages in downstream services. Thus, software deployments should be comprehensively monitored, and their anomalies should be detected timely. In this paper, we study the problem of anomaly detection for deployments. We begin by identifying the challenges unique to this anomaly detection problem, which is at entity-level (e.g., deployments), relative to the more typical problem of anomaly detection in multivariate time series (MTS). The unique challenges include the heterogeneity of deployments, the low latency tolerance, the ambiguous anomaly definition, and the limited supervision. To address them, we propose a novel framework, semi-supervised hybrid Model for Entity-Level Online Detection of anomalY (MELODY). MELODY first transforms the MTS of different entities to the same feature space by an online feature extractor, then uses a newly proposed semi-supervised deep one-class model for detecting anomalous entities. We evaluated MELODY on real data of cloud services with 1.2M+ time series. The relative F1 score improvement of MELODY over the state-of-the-art methods ranges from 7.6% to 56.5%. The user evaluation suggests MELODY is suitable for monitoring deployments in large online systems.

cloud computing, data mining, machine learning, (15 more...)

2401.10338

Country: North America > United States (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry: Information Technology > Services (0.87)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceJan-24-2023

Unsupervised Model Selection for Time-series Anomaly Detection

Goswami, Mononito, Challu, Cristian, Callot, Laurent, Minorics, Lenon, Kan, Andrey

Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly detection methods have been proposed in the literature, a recent survey concluded that no single method is the most accurate across various datasets. To make matters worse, anomaly labels are scarce and rarely available in practice. The practical problem of selecting the most accurate model for a given dataset without labels has received little attention in the literature. This paper answers this question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies, and show that some metrics are highly correlated with standard supervised anomaly detection performance metrics such as the $F_1$ score, but to varying degrees. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical justification behind the proposed approach. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model based on partially labeled data.

data mining, machine learning, metric, (12 more...)

2210.01078

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine > Diagnostic Medicine (0.67)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-7-2022

Criteria for Classifying Forecasting Methods

Januschowski, Tim, Gasthaus, Jan, Wang, Yuyang, Salinas, David, Flunkert, Valentin, Bohlke-Schneider, Michael, Callot, Laurent

Classifying forecasting methods as being either of a "machine learning" or "statistical" nature has become commonplace in parts of the forecasting literature and community, as exemplified by the M4 competition and the conclusion drawn by the organizers. We argue that this distinction does not stem from fundamental differences in the methods assigned to either class. Instead, this distinction is probably of a tribal nature, which limits the insights into the appropriateness and effectiveness of different forecasting methods. We provide alternative characteristics of forecasting methods which, in our view, allow to draw meaningful conclusions. Further, we discuss areas of forecasting which could benefit most from cross-pollination between the ML and the statistics communities.

data mining, forecasting, machine learning, (15 more...)

2212.03523

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(4 more...)

arXiv.org Machine LearningJun-15-2022

Deep Learning for Time Series Forecasting: Tutorial and Literature Survey

Benidis, Konstantinos, Rangapuram, Syama Sundar, Flunkert, Valentin, Wang, Yuyang, Maddix, Danielle, Turkmen, Caner, Gasthaus, Jan, Bohlke-Schneider, Michael, Salinas, David, Stella, Lorenzo, Aubet, Francois-Xavier, Callot, Laurent, Januschowski, Tim

Deep learning based forecasting methods have become the methods of choice in many applications of time series prediction or forecasting often outperforming other approaches. Consequently, over the last years, these methods are now ubiquitous in large-scale industrial forecasting applications and have consistently ranked among the best entries in forecasting competitions (e.g., M4 and M5). This practical success has further increased the academic interest to understand and improve deep forecasting methods. In this article we provide an introduction and overview of the field: We present important building blocks for deep forecasting in some depth; using these building blocks, we then survey the breadth of the recent deep forecasting literature.

data mining, forecasting, machine learning, (12 more...)

doi: 10.1145/3533382

2004.1024

Country:

North America > United States > California (0.46)
North America > United States > Minnesota (0.27)

Genre:

Overview (1.00)
Research Report > New Finding (0.45)

Industry:

Energy > Oil & Gas (0.92)
Education (0.67)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningJan-18-2022

Online Time Series Anomaly Detection with State Space Gaussian Processes

Bock, Christian, Aubet, François-Xavier, Gasthaus, Jan, Kan, Andrey, Chen, Ming, Callot, Laurent

We propose r-ssGPFA, an unsupervised online anomaly detection model for uni- and multivariate time series building on the efficient state space formulation of Gaussian processes. For high-dimensional time series, we propose an extension of Gaussian process factor analysis to identify the common latent processes of the time series, allowing us to detect anomalies efficiently in an interpretable manner. We gain explainability while speeding up computations by imposing an orthogonality constraint on the mapping from the latent to the observed. Our model's robustness is improved by using a simple heuristic to skip Kalman updates when encountering anomalous observations. We investigate the behaviour of our model on synthetic data and show on standard benchmark datasets that our method is competitive with state-of-the-art methods while being computationally cheaper.

artificial intelligence, data mining, machine learning, (17 more...)

2201.06763

Country:

North America > Canada > Quebec (0.28)
North America > United States > California (0.28)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre: Research Report (0.70)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

arXiv.org Machine LearningJun-21-2021

Spliced Binned-Pareto Distribution for Robust Modeling of Heavy-tailed Time Series

Ehrlich, Elena, Callot, Laurent, Aubet, François-Xavier

This work proposes a novel method to robustly and accurately model time series with heavy-tailed noise, in non-stationary scenarios. In many practical application time series have heavy-tailed noise that significantly impacts the performance of classical forecasting models; in particular, accurately modeling a distribution over extreme events is crucial to performing accurate time series anomaly detection. We propose a Spliced Binned-Pareto distribution which is both robust to extreme observations and allows accurate modeling of the full distribution. Our method allows the capture of time dependencies in the higher order moments of the distribution such as the tail heaviness. We compare the robustness and the accuracy of the tail estimation of our method to other state of the art methods on Twitter mentions count time series.

artificial intelligence, base distribution, data mining, (17 more...)

2106.10952

Country:

North America > United States (0.29)
Europe > Austria > Vienna (0.14)

Genre: Research Report > Promising Solution (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications (0.87)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.72)

arXiv.org Machine LearningSep-15-2020

Improve black-box sequential anomaly detector relevancy with limited user feedback

Kong, Luyang, Chen, Lifan, Chen, Ming, Bhatia, Parminder, Callot, Laurent

Anomaly detectors are often designed to catch statistical anomalies. End-users typically do not have interest in all of the detected outliers, but only those relevant to their application. Given an existing black-box sequential anomaly detector, this paper proposes a method to improve its user relevancy using a small number of human feedback. As our first contribution, the method is agnostic to the detector: it only assumes access to its anomaly scores, without requirement on any additional information inside it. Inspired by a fact that anomalies are of different types, our approach identifies these types and utilizes user feedback to assign relevancy to types. This relevancy score, as our second contribution, is used to adjust the subsequent anomaly selection process. Empirical results on synthetic and real-world datasets show that our approach yields significant improvements on precision and recall over a range of anomaly detectors.

air transportation, anomaly, deep learning, (20 more...)

2009.07241

Country:

North America > United States (0.28)
Europe > Austria > Vienna (0.14)

Genre: Research Report (0.64)

Industry:

Information Technology (0.69)
Transportation > Air (0.63)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)