stress test
Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability
Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests Victor V eitch 1,2, Alexander D'Amour 1, Steve Y adlowsky 1, and Jacob Eisenstein 1 1
Informally, a'spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character; e.g., changing the gender of a sentence's subject changes a sentiment predictor's output. To check for spurious correlations, we can'stress test' models by perturbing irrelevant parts of input data and seeing if model predictions change. In this paper, we study stress testing using the tools of causal inference. We introduce counterfactual invariance as a formalization of the requirement that changing irrelevant parts of the input shouldn't change model predictions.
- Leisure & Entertainment (0.93)
- Media > Film (0.68)
LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests
Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.
Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests Victor V eitch 1,2, Alexander D'Amour 1, Steve Y adlowsky 1, and Jacob Eisenstein 1 1
Informally, a'spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character; e.g., changing the gender of a sentence's subject changes a sentiment predictor's output. To check for spurious correlations, we can'stress test' models by perturbing irrelevant parts of input data and seeing if model predictions change. In this paper, we study stress testing using the tools of causal inference. We introduce counterfactual invariance as a formalization of the requirement that changing irrelevant parts of the input shouldn't change model predictions.
- Leisure & Entertainment (0.93)
- Media > Film (0.68)
Electromechanical Dynamics of the Heart: A Study of Cardiac Hysteresis During Physical Stress Test
Karimi, Sajjad, Karimi, Shirin, Shah, Amit J., Clifford, Gari D., Sameni, Reza
Cardiovascular diseases are best diagnosed using multiple modalities that assess both the heart's electrical and mechanical functions. While effective, imaging techniques like echocardiography and nuclear imaging are costly and not widely accessible. More affordable technologies, such as simultaneous electrocardiography (ECG) and phonocardiography (PCG), may provide valuable insights into electromechanical coupling and could be useful for prescreening in low-resource settings. Using physical stress test data from the EPHNOGRAM ECG-PCG dataset, collected from 23 healthy male subjects (age: 25.4+/-1.9 yrs), we investigated electromechanical intervals (RR, QT, systolic, and diastolic) and their interactions during exercise, along with hysteresis between cardiac electrical activity and mechanical responses. Time delay analysis revealed distinct temporal relationships between QT, systolic, and diastolic intervals, with RR as the primary driver. The diastolic interval showed near-synchrony with RR, while QT responded to RR interval changes with an average delay of 10.5s, and the systolic interval responded more slowly, with an average delay of 28.3s. We examined QT-RR, systolic-RR, and diastolic-RR hysteresis, finding narrower loops for diastolic RR and wider loops for systolic RR. Significant correlations (average:0.75) were found between heart rate changes and hysteresis loop areas, suggesting the equivalent circular area diameter as a promising biomarker for cardiac function under exercise stress. Deep learning models, including Long Short-Term Memory and Convolutional Neural Networks, estimated the QT, systolic, and diastolic intervals from RR data, confirming the nonlinear relationship between RR and other intervals. Findings highlight a significant cardiac memory effect, linking ECG and PCG morphology and timing to heart rate history.
- Europe > Portugal > Coimbra > Coimbra (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics
Cosma, Adrian, Ruseti, Stefan, Dascalu, Mihai, Caragea, Cornelia
Natural Language Inference (NLI) evaluation is crucial for assessing language understanding models; however, popular datasets suffer from systematic spurious correlations that artificially inflate actual model performance. To address this, we propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. This categorization significantly reduces spurious correlation measures, with examples labeled as having the highest difficulty showing markedly decreased performance and encompassing more realistic and diverse linguistic phenomena. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset, surpassing other dataset characterization techniques. Our research addresses limitations in NLI dataset construction, providing a more authentic evaluation of model performance with implications for diverse NLU applications.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (10 more...)
InstaGrasp: An Entirely 3D Printed Adaptive Gripper with TPU Soft Elements and Minimal Assembly Time
Fabricating existing and popular open-source adaptive robotic grippers commonly involves using multiple professional machines, purchasing a wide range of parts, and tedious, time-consuming assembly processes. This poses a significant barrier to entry for some robotics researchers and drives others to opt for expensive commercial alternatives. To provide both parties with an easier and cheaper (under 100GBP) solution, we propose a novel adaptive gripper design where every component (with the exception of actuators and the screws that come packaged with them) can be fabricated on a hobby-grade 3D printer, via a combination of inexpensive and readily available PLA and TPU filaments. This approach means that the gripper's tendons, flexure joints and finger pads are now printed, as a replacement for traditional string-tendons and molded urethane flexures and pads. A push-fit systems results in an assembly time of under 10 minutes. The gripper design is also highly modular and requires only a few minutes to replace any part, leading to extremely user-friendly maintenance and part modifications. An extensive stress test has shown a level of durability more than suitable for research, whilst grasping experiments (with perturbations) using items from the YCB object set has also proven its mechanical adaptability to be highly satisfactory.
- North America > United States (0.05)
- Europe > United Kingdom > England > Greater London > London (0.04)
Should Bank Stress Tests Be Fair?
Regulatory stress tests have become one of the main tools for setting capital requirements at the largest U.S. banks. The Federal Reserve uses confidential models to evaluate bank-specific outcomes for bank-specific portfolios in shared stress scenarios. As a matter of policy, the same models are used for all banks, despite considerable heterogeneity across institutions; individual banks have contended that some models are not suited to their businesses. Motivated by this debate, we ask, what is a fair aggregation of individually tailored models into a common model? We argue that simply pooling data across banks treats banks equally but is subject to two deficiencies: it may distort the impact of legitimate portfolio features, and it is vulnerable to implicit misdirection of legitimate information to infer bank identity. We compare various notions of regression fairness to address these deficiencies, considering both forecast accuracy and equal treatment. In the setting of linear models, we argue for estimating and then discarding centered bank fixed effects as preferable to simply ignoring differences across banks. We present evidence that the overall impact can be material. We also discuss extensions to nonlinear models.
- North America > United States > District of Columbia > Washington (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (6 more...)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.46)
- Government > Regional Government > North America Government > United States Government (1.00)
- Banking & Finance > Trading (1.00)
- Banking & Finance > Loans (1.00)
- Banking & Finance > Economy (1.00)
DeScoD-ECG: Deep Score-Based Diffusion Model for ECG Baseline Wander and Noise Removal
Li, Huayu, Ditzler, Gregory, Roveda, Janet, Li, Ao
Objective: Electrocardiogram (ECG) signals commonly suffer noise interference, such as baseline wander. High-quality and high-fidelity reconstruction of the ECG signals is of great significance to diagnosing cardiovascular diseases. Therefore, this paper proposes a novel ECG baseline wander and noise removal technology. Methods: We extended the diffusion model in a conditional manner that was specific to the ECG signals, namely the Deep Score-Based Diffusion model for Electrocardiogram baseline wander and noise removal (DeScoD-ECG). Moreover, we deployed a multi-shots averaging strategy that improved signal reconstructions. We conducted the experiments on the QT Database and the MIT-BIH Noise Stress Test Database to verify the feasibility of the proposed method. Baseline methods are adopted for comparison, including traditional digital filter-based and deep learning-based methods. Results: The quantities evaluation results show that the proposed method obtained outstanding performance on four distance-based similarity metrics with at least 20\% overall improvement compared with the best baseline method. Conclusion: This paper demonstrates the state-of-the-art performance of the DeScoD-ECG for ECG baseline wander and noise removal, which has better approximations of the true data distribution and higher stability under extreme noise corruptions. Significance: This study is one of the first to extend the conditional diffusion-based generative model for ECG noise removal, and the DeScoD-ECG has the potential to be widely used in biomedical applications.
- North America > United States > Arizona > Pima County > Tucson (0.14)
- North America > United States > New Jersey > Gloucester County > Glassboro (0.04)
- North America > United States > Massachusetts (0.04)
- (2 more...)
Using machine learning to forecast amine emissions
Global warming is partly due to the vast amount of carbon dioxide that we release, mostly from power generation and industrial processes, such as making steel and cement. For a while now, chemical engineers have been exploring carbon capture, a process that can separate carbon dioxide and store it in ways that keep it out of the atmosphere. This is done in dedicated carbon-capture plants, whose chemical process involves amines, compounds that are already used to capture carbon dioxide from natural gas processing and refining plants. Amines are also used in certain pharmaceuticals, epoxy resins, and dyes. The problem is that amines could also be potentially harmful to the environment as well as a health hazard, making it essential to mitigate their impact.
- Materials > Chemicals > Commodity Chemicals > Petrochemicals (1.00)
- Energy > Oil & Gas > Downstream (1.00)