testability
Testable and Actionable Calibration for Full Swap Regret
Bairaktari, Konstantina, Hu, Lunjia, Nguyen, Huy L., Ullman, Jonathan
AI generated predictions increasingly inform decision making in critical tasks, and therefore must be trustworthy. One widely used measure of trustworthiness is calibration, which requires that the predictions match the true frequencies and can be treated like real probabilities of a given outcome. However, defining calibration is subtle, and designing good measures of calibration error has been an active topic of recent research. The first goal is to find calibration measures that are actionable, meaning they can inform decision makers about their utility loss when predictions are treated as true probabilities, which is known as swap regret. The second goal is to find calibration measures that are testable, meaning that calibration error can be measured from a small sample of predictions and outcomes. Although these are very basic requirements, there is no existing calibration measure that fully satisfies both properties, and all existing measures relax actionability by bounding a weaker notion of swap regret, or relax testability by having suboptimal estimation error. We introduce a new calibration measure, Soft-Binned Calibration Decision Loss (SCDL), which we prove is fully actionable without weakening either requirement, and testable with nearly optimal error rate. In addition, SCDL satisfies other desired properties such as continuity and consistency. We also provide a set of experiments confirming that the theoretical advantages of SCDL compared to other measures lead to better performance in practice.
Executable Governance for AI: Translating Policies into Rules Using LLMs
Datla, Gautam Varma, Vurity, Anudeep, Dash, Tejaswani, Ahmad, Tazeem, Adnan, Mohd, Rafi, Saima
AI policy guidance is predominantly written as prose, which practitioners must first convert into executable rules before frameworks can evaluate or enforce them. This manual step is slow, error-prone, difficult to scale, and often delays the use of safeguards in real-world deployments. To address this gap, we present Policy-to-Tests (P2T), a framework that converts natural-language policy documents into normalized, machine-readable rules. The framework comprises a pipeline and a compact domain-specific language (DSL) that encodes hazards, scope, conditions, exceptions, and required evidence, yielding a canonical representation of extracted rules. To test the framework beyond a single policy, we apply it across general frameworks, sector guidance, and enterprise standards, extracting obligation-bearing clauses and converting them into executable rules. These AI-generated rules closely match strong human baselines on span-level and rule-level metrics, with robust inter-annotator agreement on the gold set. To evaluate downstream behavioral and safety impact, we add HIPAA-derived safeguards to a generative agent and compare it with an otherwise identical agent without guardrails. An LLM-based judge, aligned with gold-standard criteria, measures violation rates and robustness to obfuscated and compositional prompts. Detailed results are provided in the appendix. We release the codebase, DSL, prompts, and rule sets as open-source resources to enable reproducible evaluation.
Testability of Instrumental Variables in Additive Nonlinear, Non-Constant Effects Models
Guo, Xichen, Li, Zheng, Huang, Biwei, Zeng, Yan, Geng, Zhi, Xie, Feng
We address the issue of the testability of instrumental variables derived from observational data. Most existing testable implications are centered on scenarios where the treatment is a discrete variable, e.g., instrumental inequality (Pearl, 1995), or where the effect is assumed to be constant, e.g., instrumental variables condition based on the principle of independent mechanisms (Burauel, 2023). However, treatments can often be continuous variables, such as drug dosages or nutritional content levels, and non-constant effects may occur in many real-world scenarios. In this paper, we consider an additive nonlinear, non-constant effects model with unmeasured confounders, in which treatments can be either discrete or continuous, and propose an Auxiliary-based Independence Test (AIT) condition to test whether a variable is a valid instrument. We first show that if the candidate instrument is valid, then the AIT condition holds. Moreover, we illustrate the implications of the AIT condition and demonstrate that, in certain conditions, AIT conditions are necessary and sufficient to detect all invalid IVs. We also extend the AIT condition to include covariates and introduce a practical testing algorithm. Experimental results on both synthetic and three different real-world datasets show the effectiveness of our proposed condition.
Natural Language Requirements Testability Measurement Based on Requirement Smells
Zakeri-Nasrabadi, Morteza, Parsa, Saeed
Requirements form the basis for defining software systems' obligations and tasks. Testable requirements help prevent failures, reduce maintenance costs, and make it easier to perform acceptance tests. However, despite the importance of measuring and quantifying requirements testability, no automatic approach for measuring requirements testability has been proposed based on the requirements smells, which are at odds with the requirements testability. This paper presents a mathematical model to evaluate and rank the natural language requirements testability based on an extensive set of nine requirements smells, detected automatically, and acceptance test efforts determined by requirement length and its application domain. Most of the smells stem from uncountable adjectives, context-sensitive, and ambiguous words. A comprehensive dictionary is required to detect such words. We offer a neural word-embedding technique to generate such a dictionary automatically. Using the dictionary, we could automatically detect Polysemy smell (domain-specific ambiguity) for the first time in 10 application domains. Our empirical study on nearly 1000 software requirements from six well-known industrial and academic projects demonstrates that the proposed smell detection approach outperforms Smella, a state-of-the-art tool, in detecting requirements smells. The precision and recall of smell detection are improved with an average of 0.03 and 0.33, respectively, compared to the state-of-the-art. The proposed requirement testability model measures the testability of 985 requirements with a mean absolute error of 0.12 and a mean squared error of 0.03, demonstrating the model's potential for practical use.
An ensemble meta-estimator to predict source code testability
Zakeri-Nasrabadi, Morteza, Parsa, Saeed
Unlike most other software quality attributes, testability cannot be evaluated solely based on the characteristics of the source code. The effectiveness of the test suite and the budget assigned to the test highly impact the testability of the code under test. The size of a test suite determines the test effort and cost, while the coverage measure indicates the test effectiveness. Therefore, testability can be measured based on the coverage and number of test cases provided by a test suite, considering the test budget. This paper offers a new equation to estimate testability regarding the size and coverage of a given test suite. The equation has been used to label 23,000 classes belonging to 110 Java projects with their testability measure. The labeled classes were vectorized using 262 metrics. The labeled vectors were fed into a family of supervised machine learning algorithms, regression, to predict testability in terms of the source code metrics. Regression models predicted testability with an R2 of 0.68 and a mean squared error of 0.03, suitable in practice. Fifteen software metrics highly affecting testability prediction were identified using a feature importance analysis technique on the learned model. The proposed models have improved mean absolute error by 38% due to utilizing new criteria, metrics, and data compared with the relevant study on predicting branch coverage as a test criterion. As an application of testability prediction, it is demonstrated that automated refactoring of 42 smelly Java classes targeted at improving the 15 influential software metrics could elevate their testability by an average of 86.87%.
Taking Machine Learning from Research to Production
We discuss the use of Machine Learning pipeline architectures for implementing production ML applications, and in particular we review Google's experience with TensorFlow Extended (TFX). An ML application in production must address all of the issues of modern software development methodology, as well as issues unique to ML and data science. Most of the focus in the ML community is on research, which is exciting and important. Equally important however is bringing that research to production applications to solve real-world problems, but the issues and approaches for doing that are often poorly understood. An ML application in production must address all of the issues of modern software development methodology, as well as issues unique to ML and data science.
On the Testability of BDI Agent Systems
Before deploying a software system we need to assure ourselves (and stakeholders) that the system will behave correctly. This assurance is usually done by testing the system. However, it is intuitively obvious that adaptive systems, including agent-based systems, can exhibit complex behaviour, and are thus harder to test. In this paper we examine this "obvious intuition" in the case of Belief-Desire-Intention (BDI) agents. We analyse the size of the behaviour space of BDI agents and show that although the intuition is correct, the factors that influence the size are not what we expected them to be. Specifically, we found that the introduction of failure handling had a much larger effect on the size of the behaviour space than we expected. We also discuss the implications of these findings on the testability of BDI agents.