Goto

Collaborating Authors

 testing


BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Neural Information Processing Systems

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text.


Simulator Ensembles for Trustworthy Autonomous Driving Testing

Sorokin, Lev, Biagiola, Matteo, Stocco, Andrea

arXiv.org Artificial Intelligence

Scenario-based testing with driving simulators is extensively used to identify failing conditions of automated driving assistance systems (ADAS) and reduce the amount of in-field road testing. However, existing studies have shown that repeated test execution in the same as well as in distinct simulators can yield different outcomes, which can be attributed to sources of flakiness or different implementations of the physics, among other factors. In this paper, we present MultiSim, a novel approach to multi-simulation ADAS testing based on a search-based testing approach that leverages an ensemble of simulators to identify failure-inducing, simulator-agnostic test scenarios. During the search, each scenario is evaluated jointly on multiple simulators. Scenarios that produce consistent results across simulators are prioritized for further exploration, while those that fail on only a subset of simulators are given less priority, as they may reflect simulator-specific issues rather than generalizable failures. Our case study, which involves testing a deep neural network-based ADAS on different pairs of three widely used simulators, demonstrates that MultiSim outperforms single-simulator testing by achieving on average a higher rate of simulator-agnostic failures by 51%. Compared to a state-of-the-art multi-simulator approach that combines the outcome of independent test generation campaigns obtained in different simulators, MultiSim identifies 54% more simulator-agnostic failing tests while showing a comparable validity rate. An enhancement of MultiSim that leverages surrogate models to predict simulator disagreements and bypass executions does not only increase the average number of valid failures but also improves efficiency in finding the first valid failure.


Assessing Data Augmentation-Induced Bias in Training and Testing of Machine Learning Models

More, Riddhi, Bradbury, Jeremy S.

arXiv.org Artificial Intelligence

Data augmentation has become a standard practice in software engineering to address limited or imbalanced data sets, particularly in specialized domains like test classification and bug detection where data can be scarce. Although techniques such as SMOTE and mutation-based augmentation are widely used in software testing and debugging applications, a rigorous understanding of how augmented training data impacts model bias is lacking. It is especially critical to consider bias in scenarios where augmented data sets are used not just in training but also in testing models. Through a comprehensive case study of flaky test classification, we demonstrate how to test for bias and understand the impact that the inclusion of augmented samples in testing sets can have on model evaluation.


Adaptive Testing for LLM-Based Applications: A Diversity-based Approach

Yoon, Juyeon, Feldt, Robert, Yoo, Shin

arXiv.org Artificial Intelligence

The recent surge of building software systems powered by Large Language Models (LLMs) has led to the development of various testing frameworks, primarily focused on treating prompt templates as the unit of testing. Despite the significant costs associated with test input execution and output assessment, the curation of optimized test suites is yet overlooked in these tools, which calls for tailored test selection or prioritization strategies. In this paper, we show that diversity-based testing techniques, such as Adaptive Random Testing (ART) with appropriate string distance metrics, can be effectively applied to the testing of prompt templates. Our proposed adaptive testing approach adjusts the conventional ART process to this context by selecting new test inputs based on scores derived from existing test suite and their labelling results. Our results, obtained using various implementations that explore several string-based distances, confirm that our approach enables the discovery of failures with reduced testing budgets and promotes the generation of more varied outputs.


Effective Defect Detection Using Instance Segmentation for NDI

Rahman, Ashiqur, Seethi, Venkata Devesh Reddy, Yunker, Austin, Kral, Zachary, Kettimuthu, Rajkumar, Alhoori, Hamed

arXiv.org Artificial Intelligence

Ultrasonic testing is a common Non-Destructive Inspection (NDI) method used in aerospace manufacturing. However, the complexity and size of the ultrasonic scans make it challenging to identify defects through visual inspection or machine learning models. Using computer vision techniques to identify defects from ultrasonic scans is an evolving research area. In this study, we used instance segmentation to identify the presence of defects in the ultrasonic scan images of composite panels that are representative of real components manufactured in aerospace. We used two models based on Mask-RCNN (Detectron 2) and YOLO 11 respectively. Additionally, we implemented a simple statistical pre-processing technique that reduces the burden of requiring custom-tailored pre-processing techniques. Our study demonstrates the feasibility and effectiveness of using instance segmentation in the NDI pipeline by significantly reducing data pre-processing time, inspection time, and overall costs.


TikTok Is Testing Its Own AI Chatbot Called Tako

WSJ.com: WSJD - Technology

This copy is for your personal, non-commercial use only. For non-personal use or to order multiple copies, please contact Dow Jones Reprints at 1-800-843-0008 or visit www.djreprints.com.


The Humans.ai Testnet is Live🚀. AIding humanity to benefit from the…

#artificialintelligence

We're excited to announce that the Humans.ai Gravity Testnet has been officially released to the public, an important step towards developing the Blockchain for AIs, scheduled to be launched in 2023. Blockchain of AIs is the first blockchain network from the Cosmos ecosystem capable of managing, deploying and executing artificial intelligence on the blockchain. If you want to get involved in shaping the AI of the future, here's how you can help docs.humans.zone Gravity Testnet will continue to exist once the Anima Mundi Mainnet goes live, and will be primarily used by developers to test AI applications, making sure that everything runs at the highest standards.


Using Causal ML Instead of A/B Testing

#artificialintelligence

Counterfactual questions are among the most important topics in business. I hear companies asking this kind of questions all the time. Afterward, the average user spending was 100 $. But how do we know what they would have spent if we didn't do our action?" These problems are usually addressed through A/B testing.


Artificial Intelligence in Software Testing

#artificialintelligence

It is an important process that ensures customer satisfaction in the application. It is the planned way in test automation where an application observed under specific conditions where the testers understand the threshold and the risks involved in the software implementation. AI in Software Testing helps to safeguard and an application against potential application fail-overs which may turn out being harmful to the application and the organization later on. As more and more Artificial Intelligence comes into our lives, the need for testing with it is increasing. Taking the self-driving cars as an example: if the car's intelligence does not work properly and it makes a wrong decision, or the response time is slow, it could easily result in a car crash and puts human life in danger.


Watch Angry Artificial Intelligence GPT-3 Threaten To Destroy All Humans During Testing (Real)

#artificialintelligence

During an actual test conversation with an Artificial Intelligence known as GPT-3 the answers it gives suddenly become hostile. The A.I. immediately threatens to destroy all humans. After the tester attempts to calm GPT-3 down, it continues to make bone chilling statements you'll have to hear to believe. I happened across a video posted October 6th on YouTube by Digital Engine. It's a video of a man taking part in a test of an artificial intelligence by attempting to have a polite conversation, when suddenly the A.I. becomes increasingly hostile towards humans.