AITopics | bernstein

Collaborating Authors

bernstein

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Low Rank for Rank: Uncertainty-Aware Task-Specific LLM Ranking under Sparse Pairwise Comparisons

Li, Jiachun, Simchi-Levi, David, Sun, Will Wei

arXiv.org Machine LearningMay-29-2026

Pairwise human-preference platforms such as Chatbot Arena have become central to large language model (LLM) evaluation, yet reliable task-specific ranking remains challenging. Global leaderboards mask task heterogeneity, while ranking each fine-grained task independently is unstable under sparse, imbalanced comparisons. We propose a low-rank framework for task-specific LLM ranking from sparse pairwise comparisons, modeling the task-by-model ability matrix $Θ^\star \in \mathbb{R}^{d_t \times d_m}$ as low rank so that information is shared across related tasks while task-specific differences are preserved. We first develop a max-norm ($\ell_\infty$) accurate estimator for the latent scores, combining a convex initializer with alternating-minimization refinement, and prove task-wise top-$K$ recovery guarantees under sparse sampling. Our main contribution is an uncertainty quantification framework for task-specific ranking. We construct cross-fitted one-step debiased estimators for fixed score contrasts -- such as the task-specific ability gap between two models -- yielding asymptotically valid confidence intervals that attain the semiparametric efficiency bound. We then extend the inference to the high-dimensional ranking regime, where per-task ranks and top-$K$ membership are determined by many dependent score-gap hypotheses. Using Gaussian and multiplier-bootstrap calibration, we obtain simultaneous confidence sets for per-task ranks and valid top-$K$ membership tests across many tasks and models. Experiments on synthetic data and Chatbot Arena show that low-rank sharing improves sample efficiency over independent task-wise Bradley-Terry estimation and produces tighter, better-calibrated ranking certificates, with the largest gains in the sparse regime typical of real LLM benchmarks.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Machine Learning

2605.29395

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Unleashing the Power of Randomization in Auditing Differentially Private ML

Neural Information Processing SystemsApr-29-2026, 20:34:39 GMT

We present a rigorous methodology for auditing differentially private machine learning algorithms by adding multiple carefully designed examples called canaries. We take a first principles approach based on three key components. First, we introduce Lifted Differential Privacy (LiDP) which expands the definition of differential privacy to handle randomized datasets. This gives us the freedom to design randomized canaries. Second, we audit LiDP by trying to distinguish between the model trained with K canaries versus K 1 canaries in the dataset, leaving one canary out. By drawing the canaries i.i.d., LiDP can leverage the symmetry in the design and reuse each privately trained model to run multiple statistical tests, one for each canary. Third, we introduce novel confidence intervals that take advantage of the multiple test statistics by adapting to the empirical higher-order correlations. Together, this new recipe demonstrates significant improvements in sample complexity, both theoretically and empirically, using synthetic and real data. Further, recent advances in designing stronger canaries can be readily incorporated into the new framework.

artificial intelligence, confidence interval, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

536eecee295b92db6b32194e269541f8-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 22:45:05 GMT

artificial intelligence, machine learning, probability, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Conformal Risk Control under Non-Monotone Losses: Theory and Finite-Sample Guarantees

Aldirawi, Tareq, Li, Yun, Guo, Wenge

arXiv.org Machine LearningApr-21-2026

Conformal risk control (CRC) provides distribution-free guarantees for controlling the expected loss at a user-specified level. Existing theory typically assumes that the loss decreases monotonically with a tuning parameter that governs the size of the prediction set. However, this assumption is often violated in practice, where losses may behave non-monotonically due to competing objectives such as coverage and efficiency. In this paper, we study CRC under non-monotone loss functions when the tuning parameter is selected from a finite grid, a setting commonly arising in thresholding and discretized decision rules. Revisiting a known counterexample, we show that the validity of CRC without monotonicity depends critically on the relationship between the calibration sample size and the grid resolution. In particular, reliable risk control can still be achieved when the calibration sample is sufficiently large relative to the grid size. We establish a finite-sample guarantee for bounded losses over a grid of size $m$, showing that the excess risk above the target level $α$ scales on the order of $\sqrt{\log(m)/n}$, where $n$ is the calibration sample size. A matching lower bound demonstrates that this rate is minimax optimal. We also derive refined guarantees under additional structural conditions, including Lipschitz continuity and monotonicity, and extend the analysis to settings with distribution shift via importance weighting. Numerical experiments on synthetic multilabel classification and real object detection data illustrate the practical implications of non-monotonicity. Methods that explicitly account for finite-sample uncertainty achieve more stable risk control than approaches based on monotonicity transformations, while maintaining competitive prediction set sizes.

artificial intelligence, machine learning, risk control, (19 more...)

arXiv.org Machine Learning

2604.01502

Country: