evaluation
US to safety test new AI models from Google, Microsoft, xAI
New artificial intelligence (AI) tools and capabilities from Google, Microsoft and xAI will now be tested by the US Department of Commerce before they are released to the public. The tech firms have agreed to voluntarily submit their models for testing through Commerce's Center for AI Standards and Innovation (CAISI). The new pacts are an expansion on agreements by AI companies like OpenAI and Anthropic that were reached during the Biden Administration, and will see AI models from all of the companies evaluated for their capabilities and security. These expanded industry collaborations help us scale our work in the public interest at a critical moment, CAISI's director Chris Fall said. Overall, the evaluations of the AI tools will cover testing, collaborative research and best practice development related to commercial AI systems.
- North America > United States (1.00)
- Europe (1.00)
Sony AI table tennis robot outplays elite human players
In an article published today in Nature, Sony AI introduce Ace, the first robot to beat elite human players in competitive physical sport. Although AI systems have shown advanced performance in digital domains and board games (such as complex video games, chess and Go), translating this to physical performance has remained a significant challenge. Such a feat requires perception, planning, and control to work in a high-speed domain on the scale of milliseconds. Table tennis is a demanding and complex real-world test for robotics, requiring rapid decision-making, precise physical execution, and continuous adaptation to an unpredictable opponent. The ball's high speed, spin, and complex trajectories are central to competitive play.
- Leisure & Entertainment > Games (1.00)
- Leisure & Entertainment > Sports > Tennis (0.90)
Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation
Zheng, Yaowei, Zhang, Richong, Wu, Shenxi, Bian, Shirui, Zhang, Haosong, Zeng, Li, Ma, Xingjian, Zhang, Yichi
We study finite-horizon continuous-time policy evaluation from discrete closed-loop trajectories under time-inhomogeneous dynamics. The target value surface solves a backward parabolic equation, but the Bellman baseline obtained from one-step recursion is only first-order in the grid width. We estimate the time-dependent generator from multi-step transitions using moment-matching coefficients that cancel lower-order truncation terms, and combine the resulting surrogate with backward regression. The main theory gives an end-to-end decomposition into generator misspecification, projection error, pooling bias, finite-sample error, and start-up error, together with a decision-frequency regime map explaining when higher-order gains should be visible. Across calibration studies, four-scale benchmarks, feature and start-up ablations, and gain-mismatch stress tests, the second-order estimator consistently improves on the Bellman baseline and remains stable in the regime where the theory predicts visible gains. These results position high-order generator regression as an interpretable continuous-time policy-evaluation method with a clear operating region.
- Asia > China (0.05)
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.04)
Spurious Predictability in Financial Machine Learning
Adaptive specification search generates statistically significant backtests even under martingale-difference nulls. We introduce a falsification audit testing complete predictive workflows against synthetic reference classes, including zero-predictability environments and microstructure placebos. Workflows generating significant walk-forward evidence in these environments are falsified. For passing workflows, we quantify selection-induced performance inflation using an absolute magnitude gap linking optimized in-sample evidence to disjoint walk-forward realizations, adjusted for effective multiplicity. Simulations validate extreme-value scaling under correlated searches and demonstrate detection power under genuine structure. Empirical case studies confirm that many apparent findings represent methodological artifacts rather than genuine predictability.
- Europe > Greece (0.40)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Research Report > Experimental Study (0.46)
- Research Report > New Finding (0.46)
- Banking & Finance (1.00)
- Health & Medicine (0.66)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Data Science > Data Mining (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.46)
Adaptive multi-fidelity optimization with fast learning rates
Fiegel, Come, Gabillon, Victor, Valko, Michal
In multi-fidelity optimization, biased approximations of varying costs of the target function are available. This paper studies the problem of optimizing a locally smooth function with a limited budget, where the learner has to make a tradeoff between the cost and the bias of these approximations. We first prove lower bounds for the simple regret under different assumptions on the fidelities, based on a cost-to-bias function. We then present the Kometo algorithm which achieves, with additional logarithmic factors, the same rates without any knowledge of the function smoothness and fidelity assumptions, and improves previously proven guarantees. We finally empirically show that our algorithm outperforms previous multi-fidelity optimization methods without the knowledge of problem-dependent parameters.
Improving Machine Learning Performance with Synthetic Augmentation
Sohm, Mel, Dezons, Charles, Sellami, Sami, Ninou, Oscar, Pincon, Axel
Synthetic augmentation is increasingly used to mitigate data scarcity in financial machine learning, yet its statistical role remains poorly understood. We formalize synthetic augmentation as a modification of the effective training distribution and show that it induces a structural bias--variance trade-off: while additional samples may reduce estimation error, they may also shift the population objective whenever the synthetic distribution deviates from regions relevant under evaluation. To isolate informational gains from mechanical sample-size effects, we introduce a size-matched null augmentation and a finite-sample, non-parametric block permutation test that remains valid under weak temporal dependence. We evaluate this framework in both controlled Markov-switching environments and real financial datasets, including high-frequency option trade data and a daily equity panel. Across generators spanning bootstrap, copula-based models, variational autoencoders, diffusion models, and TimeGAN, we vary augmentation ratio, model capacity, task type, regime rarity, and signal-to-noise. We show that synthetic augmentation is beneficial only in variance-dominant regimes, such as persistent volatility forecasting-while it deteriorates performance in bias-dominant settings, including near-efficient directional prediction. Rare-regime targeting can improve domain-specific metrics but may conflict with unconditional permutation inference. Our results provide a structural perspective on when synthetic data improves financial learning performance and when it induces persistent distributional distortion.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Forecasting Multivariate Time Series under Predictive Heterogeneity: A Validation-Driven Clustering Framework
Ma, Ziling, Oriona, Ángel López, Ombao, Hernando, Sun, Ying
We study adaptive pooling under predictive heterogeneity in high-dimensional multivariate time series forecasting, where global models improve statistical efficiency but may fail to capture heterogeneous predictive structure, while naive specialization can induce negative transfer. We formulate adaptive pooling as a statistical decision problem and propose a validation-driven framework that determines when and how specialization should be applied. Rather than grouping series based on representation similarity, we define partitions through out-of-sample predictive performance, thereby aligning data organization with predictive risk, defined as expected out-of-sample loss and approximated via validation error. Cluster assignments are iteratively updated using validation losses for both point (Huber) and probabilistic (pinball) forecasting, improving robustness to heavy-tailed errors and local anomalies. To ensure reliability, we introduce a leakage-free fallback mechanism that reverts to a global model whenever specialization fails to improve validation performance, providing a safeguard against performance degradation under a strict training-validation-test protocol. Experiments on large-scale traffic datasets demonstrate consistent improvements over strong baselines while avoiding degradation when heterogeneity is weak. Overall, the proposed framework provides a principled and practically reliable approach to adaptive pooling in high-dimensional forecasting problems.
- North America > United States > California (0.04)
- Asia > Middle East > Saudi Arabia (0.04)
- Asia > India (0.04)
fastml: Guarded Resampling Workflows for Safer Automated Machine Learning in R
Korkmaz, Selcuk, Goksuluk, Dincer, Karaismailoglu, Eda
Preprocessing leakage arises when scaling, imputation, or other data-dependent transformations are estimated before resampling, inflating apparent performance while remaining hard to detect. We present fastml, an R package that provides a single-call interface for leakage-aware machine learning through guarded resampling, where preprocessing is re-estimated inside each resample and applied to the corresponding assessment data. The package supports grouped and time-ordered resampling, blocks high-risk configurations, audits recipes for external dependencies, and includes sandboxed execution and integrated model explanation. We evaluate fastml with a Monte Carlo simulation contrasting global and fold-local normalization, a usability comparison with tidymodels under matched specifications, and survival benchmarks across datasets of different sizes. The simulation demonstrates that global preprocessing substantially inflates apparent performance relative to guarded resampling. fastml matched held-out performance obtained with tidymodels while reducing workflow orchestration, and it supported consistent benchmarking of multiple survival model classes through a unified interface.
- Europe > Netherlands > South Holland > Rotterdam (0.04)
- North America > United States > Wisconsin (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO
Zhang, Tiantian, Zuo, Jierui, Wang, Wenping
This paper reorganizes the current manuscript around the DPO versus DDO-RM preference-optimization project and focuses on two parts: the algorithmic view and the preliminary held-out benchmark. The benchmark asks a narrow question: even in a minimal pairwise chosen-versus-rejected setting, can a reward-guided decision-distribution update outperform a direct pairwise objective? We compare Direct Preference Optimization (DPO) against DDO-RM on EleutherAI/pythia-410m using HuggingFaceH4/ultrafeedback\_binarized, evaluate on the held-out test\_prefs split, and report results for seeds 42, 13, and 3407. Algorithmically, DDO-RM treats each prompt as a finite decision problem over candidate responses. Instead of optimizing only a binary chosen-rejected relation, it forms a policy distribution over candidates, centers reward-model scores under that distribution, and distills a reward-guided target distribution back into the policy. In the current public benchmark, DDO-RM improves mean pair accuracy from 0.5238 to 0.5602, AUC from 0.5315 to 0.5382, and mean margin from 0.1377 to 0.5353 relative to DPO. These are encouraging but still preliminary results: the study covers one model family, one dataset, one held-out evaluation split, and three seeds.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Beijing > Beijing (0.04)