Goto

Collaborating Authors

 Nikšić


DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

Rubinstein, Alexander, Raible, Benjamin, Gubri, Martin, Oh, Seong Joon

arXiv.org Artificial Intelligence

Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that $\textit{maximise diversity in model responses}$. Our method, $\textbf{Diversifying Sample Condensation (DISCO)}$, selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. $\textbf{DISCO}$ shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.


Less Stress, More Privacy: Stress Detection on Anonymized Speech of Air Traffic Controllers

Viswanathan, Janaki, Blatt, Alexander, Hagemann, Konrad, Klakow, Dietrich

arXiv.org Artificial Intelligence

Air traffic control (ATC) demands multi-tasking under time pressure with high consequences of an error. This can induce stress. Detecting stress is a key point in maintaining the high safety standards of ATC. However, processing ATC voice data entails privacy restrictions, e.g. the General Data Protection Regulation (GDPR) law. Anonymizing the ATC voice data is one way to comply with these restrictions. In this paper, different architectures for stress detection for anonymized ATCO speech are evaluated. Our best networks reach a stress detection accuracy of 93.6% on an anonymized version of the Speech Under Simulated and Actual Stress (SUSAS) dataset and an accuracy of 80.1% on our anonymized ATC simulation dataset. This shows that privacy does not have to be an impediment in building well-performing deep-learning-based models.


AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

Chen, Yinfang, Shetty, Manish, Somashekar, Gagan, Ma, Minghua, Simmhan, Yogesh, Mace, Jonathan, Bansal, Chetan, Wang, Rujia, Rajmohan, Saravan

arXiv.org Artificial Intelligence

AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next-generation AIOps agents. Through evaluations of state-of-the-art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.


Active Testing of Large Language Model via Multi-Stage Sampling

Huang, Yuheng, Song, Jiayang, Hu, Qiang, Juefei-Xu, Felix, Ma, Lei

arXiv.org Artificial Intelligence

Performance evaluation plays a crucial role in the development life cycle of large language models (LLMs). It estimates the model's capability, elucidates behavior characteristics, and facilitates the identification of potential issues and limitations, thereby guiding further improvement. Given that LLMs' diverse task-handling abilities stem from large volumes of training data, a comprehensive evaluation also necessitates abundant, well-annotated, and representative test data to assess LLM performance across various downstream tasks. However, the demand for high-quality test data often entails substantial time, computational resources, and manual efforts, sometimes causing the evaluation to be inefficient or impractical. To address these challenges, researchers propose active testing, which estimates the overall performance by selecting a subset of test data. Nevertheless, the existing active testing methods tend to be inefficient, even inapplicable, given the unique new challenges of LLMs (e.g., diverse task types, increased model complexity, and unavailability of training data). To mitigate such limitations and expedite the development cycle of LLMs, in this work, we introduce AcTracer, an active testing framework tailored for LLMs that strategically selects a small subset of test data to achieve a nearly optimal performance estimation for LLMs. AcTracer utilizes both internal and external information from LLMs to guide the test sampling process, reducing variance through a multi-stage pool-based active selection. Our experiment results demonstrate that AcTracer achieves state-of-the-art performance compared to existing methods across various tasks, with up to 38.83% improvement over previous SOTA.


Building AI Agents for Autonomous Clouds: Challenges and Design Principles

Shetty, Manish, Chen, Yinfang, Somashekar, Gagan, Ma, Minghua, Simmhan, Yogesh, Zhang, Xuchao, Mace, Jonathan, Vandevoorde, Dax, Las-Casas, Pedro, Gupta, Shachee Mishra, Nath, Suman, Bansal, Chetan, Rajmohan, Saravan

arXiv.org Artificial Intelligence

The rapid growth in the use of Large Language Models (LLMs) and AI Agents as part of software development and deployment is revolutionizing the information technology landscape. While code generation receives significant attention, a higher-impact application lies in using AI agents for operational resilience of cloud services, which currently require significant human effort and domain knowledge. There is a growing interest in AI for IT Operations (AIOps) which aims to automate complex operational tasks, like fault localization and root cause analysis, thereby reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds though AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents. This vision paper lays the groundwork for such a framework by first framing the requirements and then discussing design decisions that satisfy them. We also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. We report promising results and lay the groundwork to build a modular and robust framework for building, evaluating, and improving agents for autonomous clouds.


Model orthogonalization and Bayesian forecast mixing via Principal Component Analysis

Giuliani, Pablo, Godbey, Kyle, Kejzlar, Vojtech, Nazarewicz, Witold

arXiv.org Machine Learning

One can improve predictability in the unknown domain by combining forecasts of imperfect complex computational models using a Bayesian statistical machine learning framework. In many cases, however, the models used in the mixing process are similar. In addition to contaminating the model space, the existence of such similar, or even redundant, models during the multimodeling process can result in misinterpretation of results and deterioration of predictive performance. In this work we describe a method based on the Principal Component Analysis that eliminates model redundancy. We show that by adding model orthogonalization to the proposed Bayesian Model Combination framework, one can arrive at better prediction accuracy and reach excellent uncertainty quantification performance.


Comparison of two data fusion approaches for land use classification

Cubaud, Martin, Bris, Arnaud Le, Jolivet, Laurence, Olteanu-Raimond, Ana-Maria

arXiv.org Artificial Intelligence

ABSTRACT: Accurate land use maps, describing the territory from an anthropic utilisation point of view, are useful tools for land management and planning. To produce them, the use of optical images alone remains limited. It is therefore necessary to make use of several heterogeneous sources, each carrying complementary or contradictory information due to their imperfections or their different specifications. This study compares two different approaches i.e. a pre-classification and a post-classification fusion approach for combining several sources of spatial data in the context of land use classification. The approaches are applied on authoritative land use data located in the Gers department in the south-west of France. Pre-classification fusion, while not explicitly modeling imperfections, has the best final results, reaching an overall accuracy of 97% and a macro-mean F1 score of 88%. 1. INTRODUCTION At the feature level, Fonte et al. (2018) identified building functions using Land Use (LU) describes the socio-economic human activity of a rule based classifications of OpenStreetMap (OSM), Facebook an area (e.g. Land al. (2022) identified building functions from images, POI and Use and Land Cover (LULC) maps are very useful for understanding, building footprint from Gaode map (authoritative database) and monitoring, planning and predicting the evolution of distance to OSM roads using a XGBoost classifier.


Efficient Linearizability Checking for Actor-based Systems

Al-Mahfoudh, Mohammed S., Stutsman, Ryan, Gopalakrishnan, Ganesh

arXiv.org Artificial Intelligence

Recent demand for distributed software had led to a surge in popularity in actor-based frameworks. However, even with the stylized message passing model of actors, writing correct distributed software is still difficult. We present our work on linearizability checking in DS2, an integrated framework for specifying, synthesizing, and testing distributed actor systems. The key insight of our approach is that often subcomponents of distributed actor systems represent common algorithms or data structures (e.g.\ a distributed hash table or tree) that can be validated against a simple sequential model of the system. This makes it easy for developers to validate their concurrent actor systems without complex specifications. DS2 automatically explores the concurrent schedules that system could arrive at, and it compares observed output of the system to ensure it is equivalent to what the sequential implementation could have produced. We describe DS2's linearizability checking and test it on several concurrent replication algorithms from the literature. We explore in detail how different algorithms for enumerating the model schedule space fare in finding bugs in actor systems, and we present our own refinements on algorithms for exploring actor system schedules that we show are effective in finding bugs.


Cluster-CAM: Cluster-Weighted Visual Interpretation of CNNs' Decision in Image Classification

Feng, Zhenpeng, Ji, Hongbing, Dakovic, Milos, Cui, Xiyang, Zhu, Mingzhe, Stankovic, Ljubisa

arXiv.org Artificial Intelligence

Despite the tremendous success of convolutional neural networks (CNNs) in computer vision, the mechanism of CNNs still lacks clear interpretation. Currently, class activation mapping (CAM), a famous visualization technique to interpret CNN's decision, has drawn increasing attention. Gradient-based CAMs are efficient while the performance is heavily affected by gradient vanishing and exploding. In contrast, gradient-free CAMs can avoid computing gradients to produce more understandable results. However, existing gradient-free CAMs are quite time-consuming because hundreds of forward interference per image are required. In this paper, we proposed Cluster-CAM, an effective and efficient gradient-free CNN interpretation algorithm. Cluster-CAM can significantly reduce the times of forward propagation by splitting the feature maps into clusters in an unsupervised manner. Furthermore, we propose an artful strategy to forge a cognition-base map and cognition-scissors from clustered feature maps. The final salience heatmap will be computed by merging the above cognition maps. Qualitative results conspicuously show that Cluster-CAM can produce heatmaps where the highlighted regions match the human's cognition more precisely than existing CAMs. The quantitative evaluation further demonstrates the superiority of Cluster-CAM in both effectiveness and efficiency.


Computing Parameterized Invariants of Parameterized Petri Nets

Esparza, Javier, Raskin, Mikhail, Welzel, Christoph

arXiv.org Artificial Intelligence

A fundamental advantage of Petri net models is the possibility to automatically compute useful system invariants from the syntax of the net. Classical techniques used for this are place invariants, P-components, siphons or traps. Recently, Bozga et al. have presented a novel technique for the \emph{parameterized} verification of safety properties of systems with a ring or array architecture. They show that the statement \enquote{for every instance of the parameterized Petri net, all markings satisfying the linear invariants associated to all the P-components, siphons and traps of the instance are safe} can be encoded in \acs{WS1S} and checked using tools like MONA. However, while the technique certifies that this infinite set of linear invariants extracted from P-components, siphons or traps are strong enough to prove safety, it does not return an explanation of this fact understandable by humans. We present a CEGAR loop that constructs a \emph{finite} set of \emph{parameterized} P-components, siphons or traps, whose infinitely many instances are strong enough to prove safety. For this we design parameterization procedures for different architectures.