Goto

Collaborating Authors

 test condition


SAFLITE: Fuzzing Autonomous Systems via Large Language Models

Zhu, Taohong, Skapars, Adrians, Mackenzie, Fardeen, Kehoe, Declan, Newton, William, Embury, Suzanne, Sun, Youcheng

arXiv.org Artificial Intelligence

Fuzz testing effectively uncovers software vulnerabilities; however, it faces challenges with Autonomous Systems (AS) due to their vast search spaces and complex state spaces, which reflect the unpredictability and complexity of real-world environments. This paper presents a universal framework aimed at improving the efficiency of fuzz testing for AS. At its core is SaFliTe, a predictive component that evaluates whether a test case meets predefined safety criteria. By leveraging the large language model (LLM) with information about the test objective and the AS state, SaFliTe assesses the relevance of each test case. We evaluated SaFliTe by instantiating it with various LLMs, including GPT-3.5, Mistral-7B, and Llama2-7B, and integrating it into four fuzz testing tools: PGFuzz, DeepHyperion-UAV, CAMBA, and TUMB. These tools are designed specifically for testing autonomous drone control systems, such as ArduPilot, PX4, and PX4-Avoidance. The experimental results demonstrate that, compared to PGFuzz, SaFliTe increased the likelihood of selecting operations that triggered bug occurrences in each fuzzing iteration by an average of 93.1\%. Additionally, after integrating SaFliTe, the ability of DeepHyperion-UAV, CAMBA, and TUMB to generate test cases that caused system violations increased by 234.5\%, 33.3\%, and 17.8\%, respectively. The benchmark for this evaluation was sourced from a UAV Testing Competition.


Investigating learning-independent abstract reasoning in artificial neural networks

Barak, Tomer, Loewenstein, Yonatan

arXiv.org Artificial Intelligence

Humans are capable of solving complex abstract reasoning tests. Whether this ability reflects a learning-independent inference mechanism applicable to any novel unlearned problem or whether it is a manifestation of extensive training throughout life is an open question. Addressing this question in humans is challenging because it is impossible to control their prior training. However, assuming a similarity between the cognitive processing of Artificial Neural Networks (ANNs) and humans, the extent to which training is required for ANNs' abstract reasoning is informative about this question in humans. Previous studies demonstrated that ANNs can solve abstract reasoning tests. However, this success required extensive training. In this study, we examined the learning-independent abstract reasoning of ANNs. Specifically, we evaluated their performance without any pretraining, with the ANNs' weights being randomly-initialized, and only change in the process of problem solving. We found that naive ANN models can solve non-trivial visual reasoning tests, similar to those used to evaluate human learning-independent reasoning. We further studied the mechanisms that support this ability. Our results suggest the possibility of learning-independent abstract reasoning that does not require extensive training.


The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Xie, Yuankun, Lu, Yi, Fu, Ruibo, Wen, Zhengqi, Wang, Zhiyong, Tao, Jianhua, Qi, Xin, Wang, Xiaopeng, Liu, Yukun, Cheng, Haonan, Ye, Long, Sun, Yi

arXiv.org Artificial Intelligence

With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including 2 languages, over 1M audio samples, and various test conditions, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online.


Adversarial Data Augmentation for Robust Speaker Verification

Zhou, Zhenyu, Chen, Junhui, Wang, Namin, Li, Lantian, Wang, Dong

arXiv.org Artificial Intelligence

Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn speaker-related representations while disregarding irrelevant acoustic variations, thereby improving robustness and generalization. However, a potential issue with the vanilla DA is augmentation residual, i.e., unwanted distortion caused by different types of augmentation. To address this problem, this paper proposes a novel approach called adversarial data augmentation (A-DA) which combines DA with adversarial learning. Specifically, it involves an additional augmentation classifier to categorize various augmentation types used in data augmentation. This adversarial learning empowers the network to generate speaker embeddings that can deceive the augmentation classifier, making the learned speaker embeddings more robust in the face of augmentation variations. Experiments conducted on VoxCeleb and CN-Celeb datasets demonstrate that our proposed A-DA outperforms standard DA in both augmentation matched and mismatched test conditions, showcasing its superior robustness and generalization against acoustic variations.


Test Case Generation and Test Oracle Support for Testing CPSs using Hybrid Models

Sadri-Moshkenani, Zahra, Bradley, Justin, Rothermel, Gregg

arXiv.org Artificial Intelligence

Cyber-Physical Systems (CPSs) play a central role in the behavior of a wide range of autonomous physical systems such as medical devices, autonomous vehicles, and smart homes, many of which are safety-critical. CPSs are often specified iteratively as a sequence of models at different levels that can be tested via simulation systems at early stages of their development cycle. One such model is a hybrid automaton; these are used frequently for CPS applications and have the advantage of encapsulating both continuous and discrete CPS behaviors. When testing CPSs, engineers can take advantage of these models to generate test cases that target both types of these behaviors. Moreover, since these models are constructed early in the development process for CPSs, they allow test cases to be generated early in that process for those CPSs, even before simulation models of the CPSs have been designed. One challenge when testing CPSs is that these systems may operate differently even under an identically applied test scenario. In such cases, we cannot employ test oracles that use predetermined deterministic behaviors; instead, test oracles should consider sets of desired behaviors in order to determine whether the CPS has behaved appropriately. In this paper we present a test case generation technique, HYTEST, that generates test cases based on hybrid models, accompanied by appropriate test oracles, for use in testing CPSs early in their development cycle. To evaluate the effectiveness and efficiency of HYTEST, we conducted an empirical study in which we applied the technique to several CPSs and measured its ability to detect faults in those CPSs and the amount of time required to perform the testing process. The results of the study show that HYTEST was able to detect faults more effectively and efficiently than the baseline techniques we compare it to.


Critical Evaluation of Artificial Intelligence as Digital Twin of Pathologist for Prostate Cancer Pathology

Eminaga, Okyaz, Abbas, Mahmoud, Kunder, Christian, Tolkach, Yuri, Han, Ryan, Brooks, James D., Nolley, Rosalie, Semjonow, Axel, Boegemann, Martin, West, Robert, Long, Jin, Fan, Richard, Bettendorf, Olaf

arXiv.org Artificial Intelligence

Prostate cancer pathology plays a crucial role in clinical management but is time-consuming. Artificial intelligence (AI) shows promise in detecting prostate cancer and grading patterns. We tested an AI-based digital twin of a pathologist, vPatho, on 2,603 histology images of prostate tissue stained with hematoxylin and eosin. We analyzed various factors influencing tumor-grade disagreement between vPatho and six human pathologists. Our results demonstrated that vPatho achieved comparable performance in prostate cancer detection and tumor volume estimation, as reported in the literature. Concordance levels between vPatho and human pathologists were examined. Notably, moderate to substantial agreement was observed in identifying complementary histological features such as ductal, cribriform, nerve, blood vessels, and lymph cell infiltrations. However, concordance in tumor grading showed a decline when applied to prostatectomy specimens (kappa = 0.44) compared to biopsy cores (kappa = 0.70). Adjusting the decision threshold for the secondary Gleason pattern from 5% to 10% improved the concordance level between pathologists and vPatho for tumor grading on prostatectomy specimens (kappa from 0.44 to 0.64). Potential causes of grade discordance included the vertical extent of tumors toward the prostate boundary and the proportions of slides with prostate cancer. Gleason pattern 4 was particularly associated with discordance. Notably, grade discordance with vPatho was not specific to any of the six pathologists involved in routine clinical grading. In conclusion, our study highlights the potential utility of AI in developing a digital twin of a pathologist. This approach can help uncover limitations in AI adoption and the current grading system for prostate cancer pathology.


On the Impact of Interruptions During Multi-Robot Supervision Tasks

Dahiya, Abhinav, Cai, Yifan, Schneider, Oliver, Smith, Stephen L.

arXiv.org Artificial Intelligence

Human supervisors in multi-robot systems are primarily responsible for monitoring robots, but can also be assigned with secondary tasks. These tasks can act as interruptions and can be categorized as either intrinsic, i.e., being directly related to the monitoring task, or extrinsic, i.e., being unrelated. In this paper, we investigate the impact of these two types of interruptions through a user study ($N=39$), where participants monitor a number of remote mobile robots while intermittently being interrupted by either a robot fault correction task (intrinsic) or a messaging task (extrinsic). We find that task performance of participants does not change significantly with the interruptions but depends greatly on the number of robots. However, interruptions result in an increase in perceived workload, and extrinsic interruptions have a more negative effect on workload across all NASA-TLX scales. Participants also reported switching between extrinsic interruptions and the primary task to be more difficult compared to the intrinsic interruption case. Statistical significance of these results is confirmed using ANOVA and one-sample t-test. These findings suggest that when deciding task assignment in such supervision systems, one should limit interruptions from secondary tasks, especially extrinsic ones, in order to limit user workload.


Non-iterative generation of an optimal mesh for a blade passage using deep reinforcement learning

Kim, Innyoung, Kim, Sejin, You, Donghyun

arXiv.org Artificial Intelligence

A method using deep reinforcement learning (DRL) to non-iteratively generate an optimal mesh for an arbitrary blade passage is developed. Despite automation in mesh generation using either an empirical approach or an optimization algorithm, repeated tuning of meshing parameters is still required for a new geometry. The method developed herein employs a DRL-based multi-condition optimization technique to define optimal meshing parameters as a function of the blade geometry, attaining automation, minimization of human intervention, and computational efficiency. The meshing parameters are optimized by training an elliptic mesh generator which generates a structured mesh for a blade passage with an arbitrary blade geometry. During each episode of the DRL process, the mesh generator is trained to produce an optimal mesh for a randomly selected blade passage by updating the meshing parameters until the mesh quality, as measured by the ratio of determinants of the Jacobian matrices and the skewness, reaches the highest level. Once the training is completed, the mesh generator create an optimal mesh for a new arbitrary blade passage in a single try without an repetitive process for the parameter tuning for mesh generation from the scratch. The effectiveness and robustness of the proposed method are demonstrated through the generation of meshes for various blade passages.


Zarrieß

AAAI Conferences

A knowledge-based program defines the behavior of an agent by combining primitive actions, programming constructs and test conditions that make explicit reference to the agent's knowledge. In this paper we consider a setting where an agent is equipped with a Description Logic (DL) knowledge base providing general domain knowledge and an incomplete description of the initial situation. We introduce a corresponding new DL-based action language that allows for representing both physical and sensing actions, and that we then use to build knowledge-based programs with test conditions expressed in the epistemic DL. After proving undecidability for the general case, we then discuss a restricted fragment where verification becomes decidable. The provided proof is constructive and comes with an upper bound on the procedure's complexity.


WAD: A Deep Reinforcement Learning Agent for Urban Autonomous Driving

Sharma, Arjit, Sharma, Sahil

arXiv.org Artificial Intelligence

Urban autonomous driving is an open and challenging problem to solve as the decision-making system has to account for several dynamic factors like multi-agent interactions, diverse scene perceptions, complex road geometries, and other rarely occurring real-world events. On the other side, with deep reinforcement learning (DRL) techniques, agents have learned many complex policies. They have even achieved super-human-level performances in various Atari Games and Deepmind's AlphaGo. However, current DRL techniques do not generalize well on complex urban driving scenarios. This paper introduces the DRL driven Watch and Drive (WAD) agent for end-to-end urban autonomous driving. Motivated by recent advancements, the study aims to detect important objects/states in high dimensional spaces of CARLA and extract the latent state from them. Further, passing on the latent state information to WAD agents based on TD3 and SAC methods to learn the optimal driving policy. Our novel approach utilizing fewer resources, step-by-step learning of different driving tasks, hard episode termination policy, and reward mechanism has led our agents to achieve a 100% success rate on all driving tasks in the original CARLA benchmark and set a new record of 82% on further complex NoCrash benchmark, outperforming the state-of-the-art model by more than +30% on NoCrash benchmark.