validity
- North America > United States > Illinois (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (5 more...)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
Discrete Object Generation with Reversible Inductive Construction
The success of generative modeling in continuous domains has led to a surge of interest in generating discrete data such as molecules, source code, and graphs. However, construction histories for these discrete objects are typically not unique and so generative models must reason about intractably large spaces in order to learn. Additionally, structured discrete domains are often characterized by strict constraints on what constitutes a valid object and generative models must respect these requirements in order to produce useful novel samples. Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity. Building off of generative interpretations of denoising autoencoders, the Markov chain alternates between producing 1) a sequence of corrupted objects that are valid but not from the data distribution, and 2) a learned reconstruction distribution that attempts to fix the corruptions while also preserving validity. This approach constrains the generative model to only produce valid objects, requires the learner to only discover local modifications to the objects, and avoids marginalization over an unknown and potentially large space of construction histories. We evaluate the proposed approach on two highly structured discrete domains, molecules and Laman graphs, and find that it compares favorably to alternative methods at capturing distributional statistics for a host of semantically relevant metrics.
Falsifying Predictive Algorithm
Empirical investigations into unintended model behavior often show that the algorithm is predicting another outcome than what was intended. These exposes highlight the need to identify when algorithms predict unintended quantities - ideally before deploying them into consequential settings. We propose a falsification framework that provides a principled statistical test for discriminant validity: the requirement that an algorithm predict intended outcomes better than impermissible ones. Drawing on falsification practices from causal inference, econometrics, and psychometrics, our framework compares calibrated prediction losses across outcomes to assess whether the algorithm exhibits discriminant validity with respect to a specified impermissible proxy. In settings where the target outcome is difficult to observe, multiple permissible proxy outcomes may be available; our framework accommodates both this setting and the case with a single permissible proxy. Throughout we use nonparametric hypothesis testing methods that make minimal assumptions on the data-generating process. We illustrate the method in an admissions setting, where the framework establishes discriminant validity with respect to gender but fails to establish discriminant validity with respect to race. This demonstrates how falsification can serve as an early validity check, prior to fairness or robustness analyses. We also provide analysis in a criminal justice setting, where we highlight the limitations of our framework and emphasize the need for complementary approaches to assess other aspects of construct validity and external validity.
- North America > United States > California > Alameda County > Berkeley (0.14)
- Europe > United Kingdom (0.14)
- North America > United States > Pennsylvania (0.04)
- (7 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.68)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Health & Medicine > Consumer Health (0.67)
- Law > Criminal Law (0.66)
Task-Agnostic Machine-Learning-Assisted Inference
Machine learning (ML) is playing an increasingly important role in scientific research. In conjunction with classical statistical approaches, ML-assisted analytical strategies have shown great promise in accelerating research findings. This has also opened a whole field of methodological research focusing on integrative approaches that leverage both ML and statistics to tackle data science challenges. One type of study that has quickly gained popularity employs ML to predict unobserved outcomes in massive samples, and then uses predicted outcomes in downstream statistical inference. However, existing methods designed to ensure the validity of this type of post-prediction inference are limited to very basic tasks such as linear regression analysis.
How to Scale Your EMA
Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6$\times$ wall-clock time reduction under idealized hardware settings.
Circuits, Features, and Heuristics in Molecular Transformers
Varadi, Kristof, Marosi, Mark, Antal, Peter
Transformers generate valid and diverse chemical structures, but little is known about the mechanisms that enable these models to capture the rules of molecular representation. We present a mechanistic analysis of autoregressive transformers trained on drug-like small molecules to reveal the computational structure underlying their capabilities across multiple levels of abstraction. We identify computational patterns consistent with low-level syntactic parsing and more abstract chemical validity constraints. Using sparse autoencoders (SAEs), we extract feature dictionaries associated with chemically relevant activation patterns. We validate our findings on downstream tasks and find that mechanistic insights can translate to predictive performance in various practical settings.
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Materials > Chemicals > Commodity Chemicals (0.46)
Establishing Validity for Distance Functions and Internal Clustering Validity Indices in Correlation Space
Degen, Isabella, Abdallah, Zahraa S, Brown, Kate Robson, Reeve, Henry W J
Internal clustering validity indices (ICVIs) assess clustering quality without ground truth labels. Comparative studies consistently find that no single ICVI outperforms others across datasets, leaving practitioners without principled ICVI selection. We argue that inconsistent ICVI performance arises because studies evaluate them based on matching human labels rather than measuring the quality of the discovered structure in the data, using datasets without formally quantifying the structure type and quality. Structure type refers to the mathematical organisation in data that clustering aims to discover. Validity theory requires a theoretical definition of clustering quality, which depends on structure type. We demonstrate this through the first validity assessment of clustering quality measures for correlation patterns, a structure type that arises from clustering time series by correlation relationships. We formalise 23 canonical correlation patterns as the theoretical optimal clustering and use synthetic data modelling this structure with controlled perturbations to evaluate validity across content, criterion, construct, and external validity. Our findings show that Silhouette Width Criterion (SWC) and Davies-Bouldin Index (DBI) are valid for correlation patterns, whilst Calinski-Harabasz (VRC) and Pakhira-Bandyopadhyay-Maulik (PBM) indices fail. Simple Lp norm distances achieve validity, whilst correlation-specific functions fail structural, criterion, and external validity. These results differ from previous studies where VRC and PBM performed well, demonstrating that validity depends on structure type. Our structure-type-specific validation method provides both practical guidance (quality thresholds SWC>0.9, DBI<0.15) and a methodological template for establishing validity for other structure types.
- North America > United States > District of Columbia > Washington (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > California > Orange County > Irvine (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.67)
- Information Technology > Data Science > Data Mining (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.45)
Flowchart2Mermaid: A Vision-Language Model Powered System for Converting Flowcharts into Editable Diagram Code
Flowcharts are common tools for communicating processes but are often shared as static images that cannot be easily edited or reused. We present Flowchart2Mermaid, a lightweight web system that converts flowchart images into editable Mermaid.js code which is a markup language for visual workflows, using a detailed system prompt and vision-language models. The interface supports mixed-initiative refinement through inline text editing, drag-and-drop node insertion, and natural-language commands interpreted by an integrated AI assistant. Unlike prior image-to-diagram tools, our approach produces a structured, version-controllable textual representation that remains synchronized with the rendered diagram. We further introduce evaluation metrics to assess structural accuracy, flow correctness, syntax validity, and completeness across multiple models.
- Europe > United Kingdom > Northern Ireland (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.76)
Eval Factsheets: A Structured Framework for Documenting AI Evaluations
Bordes, Florian, Ross, Candace, Kao, Justine T, Spiliopoulou, Evangelia, Williams, Adina
The rapid proliferation of benchmarks has created significant challenges in reproducibility, transparency, and informed decision-making. However, unlike datasets and models -- which benefit from structured documentation frameworks like Datasheets and Model Cards -- evaluation methodologies lack systematic documentation standards. We introduce Eval Factsheets, a structured, descriptive framework for documenting AI system evaluations through a comprehensive taxonomy and questionnaire-based approach. Our framework organizes evaluation characteristics across five fundamental dimensions: Context (Who made the evaluation and when?), Scope (What does it evaluate?), Structure (With what the evaluation is built?), Method (How does it work?) and Alignment (In what ways is it reliable/valid/robust?). We implement this taxonomy as a practical questionnaire spanning five sections with mandatory and recommended documentation elements. Through case studies on multiple benchmarks, we demonstrate that Eval Factsheets effectively captures diverse evaluation paradigms -- from traditional benchmarks to LLM-as-judge methodologies -- while maintaining consistency and comparability. We hope Eval Factsheets are incorporated into both existing and newly released evaluation frameworks and lead to more transparency and reproducibility.
- Health & Medicine (0.46)
- Law (0.46)
- Information Technology > Security & Privacy (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Platform-independent experiments on social media Science
Social media is an important source of political information, yet there is little external oversight of platforms' ever-changing algorithms and policies. This opacity presents a major problem: Conducting a real-world experiment on the causal effects of platform features generally requires the collaboration of the platform being studied, which rarely happens, and even when it does, future platform changes may invalidate prior findings. The authors introduce a methodological paradigm for testing the effect of social media on partisan animosity without platform collaboration by reranking users' existing feeds using large language models (LLMs) and a browser extension. They find that changing the visibility of polarizing content can influence people's feelings about opposing partisans. Social media is in a period of upheaval.
- Information Technology > Services (0.48)
- Government > Voting & Elections (0.30)