Goto

Collaborating Authors

 Europe


Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

arXiv.org Machine Learning

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.


A unifying view of contrastive learning, importance sampling, and bridge sampling for energy-based models

arXiv.org Machine Learning

In the last decades, energy-based models (EBMs) have become an important class of probabilistic models in which a component of the likelihood is intractable and therefore cannot be evaluated explicitly. Consequently, parameter estimation in EBMs is challenging for conventional inference methods. In this work, we provide a unified framework that connects noise contrastive estimation (NCE), reverse logistic regression (RLR), multiple importance sampling (MIS), and bridge sampling within the context of EBMs. We further show that these methods are equivalent under specific conditions. This unified perspective clarifies relationships among existing methods and enables the development of new estimators, with the potential to improve statistical and computational efficiency. Furthermore, this study helps elucidate the success of NCE in terms of its flexibility and robustness, while also identifying scenarios in which its performance can be further improved. Hence, rather than being a purely descriptive review, this work offers a unifying perspective and additional methodological contributions. The MATLAB code used in the numerical experiments is also made freely available to support the reproducibility of the results.


Sparse $ฮต$ insensitive zone bounded asymmetric elastic net support vector machines for pattern classification

arXiv.org Machine Learning

Existing support vector machines(SVM) models are sensitive to noise and lack sparsity, which limits their performance. To address these issues, we combine the elastic net loss with a robust loss framework to construct a sparse $\varepsilon$-insensitive bounded asymmetric elastic net loss, and integrate it with SVM to build $\varepsilon$ Insensitive Zone Bounded Asymmetric Elastic Net Loss-based SVM($\varepsilon$-BAEN-SVM). $\varepsilon$-BAEN-SVM is both sparse and robust. Sparsity is proven by showing that samples inside the $\varepsilon$-insensitive band are not support vectors. Robustness is theoretically guaranteed because the influence function is bounded. To solve the non-convex optimization problem, we design a half-quadratic algorithm based on clipping dual coordinate descent. It transforms the problem into a series of weighted subproblems, improving computational efficiency via the $\varepsilon$ parameter. Experiments on simulated and real datasets show that $\varepsilon$-BAEN-SVM outperforms traditional and existing robust SVMs. It balances sparsity and robustness well in noisy environments. Statistical tests confirm its superiority. Under the Gaussian kernel, it achieves better accuracy and noise insensitivity, validating its effectiveness and practical value.


Claude Mythos Is Everyone's Problem

The Atlantic - Technology

What happens when AI can hack everything? For the past several weeks, Anthropic says it secretly possessed a tool potentially capable of commandeering most computer servers in the world. This is a bot that, if unleashed, might be able to hack into banks, exfiltrate state secrets, and fry crucial infrastructure. Already, according to the company, this AI model has identified thousands of major cybersecurity vulnerabilities--including exploits in every single major operating system and browser. This level of cyberattack is typically available only to elite, state-sponsored hacking cells in a very small number of countries including China, Russia, and the United States.


AI-pocalypse: Anthropic sparks fears after developing a bot that's 'too dangerous to release to the public'

Daily Mail - Science & tech

New Jersey man's chilling'cancer map' fuels fears of poisoned neighborhood with 41 cases and counting Three stocks are high as a kite after Trump's wild executive order as investors rush to cash in New'Hollywood dose' pill: A-listers hooked on'youth elixir' that dermatologists say is anti-ageing, shrinks pores, smooths wrinkles... and even banishes rosacea Days after we got engaged, the love of my life told me he'd killed a man and buried him in a bog. I reported him to police... but then I made this irreversible mistake Papa John's under fire for an outrageous message now printed on all pizza boxes Iran vows to put'new cards on the battlefield' after Trump breaks ceasefire as Vance travels to Pakistan for peace talks before deadline ends TODAY NASA's return of humans to the Moon in 2028 faces alarming setback California coffee farmers nearly escaped death before'tragic accident' as autopsy reveals disturbing new details How to lose weight when perimenopause sabotages your metabolism: I'm a PT but when I hit 46, I piled on the pounds overnight. Australia has spoken: Report reveals what everyone is thinking about Prince Harry and Meghan Markle's Australia tour Humiliating moment runner celebrates winning marathon... only to be pipped at the line by rival in brutal finish How prophet of extreme Mormon cult who had 20 wives - some aged just 10 - is now spreading evil from prison, as woman who bravely exposed him reveals new threat Netflix doc missed and'sister brides' still under his thrall Even Cameron Diaz admits she's a dirty mess. I'll get hate for saying it, but we're all thinking the same thing about THAT wrinkled forehead: CAROLINE BULLOCK Two high school sweethearts survived the Columbine High School massacre. Months later, they were gunned down in a Subway on Valentine's Day in a crime that remains unsolved AI-pocalypse: Anthropic sparks fears after developing a bot that's'too dangerous to release to the public' Anthropic has sparked fears after revealing that it has developed an AI bot deemed too dangerous to release to the public.


Sci-fi show The Miniature Wife underwhelms โ€“ despite the big names

New Scientist

Miniature people have been a staple of science fiction and fantasy going all the way back to Jonathan Swift's, and shrunken characters have taken the spotlight in everything from classic Hollywood movies like and to family-friendly blockbusters like and . References to these movies and others are strewn throughout the new Peacock limited series, but the drawn-out, 10-episode show isn't a particularly worthwhile addition to the sci-fi shrinking canon. Taking only the title and basic premise from Manuel Gonzales's 2014 short story, stars Elizabeth Banks as Lindy Littlejohn, a once-prominent author who now works as a university professor and has been overshadowed by her scientist husband Les (Matthew Macfadyen). Lindy, you see, feels metaphorically small in both her personal and professional lives, and is about to become literally small following an accident - or it? The most pressing problem for Lindy is that Les has yet to develop a stable antidote to his formula, and everything that he has attempted to return to its original size thus far has almost immediately exploded.


Tight Convergence Rates for Online Distributed Linear Estimation with Adversarial Measurements

arXiv.org Machine Learning

We study mean estimation of a random vector $X$ in a distributed parameter-server-worker setup. Worker $i$ observes samples of $a_i^\top X$, where $a_i^\top$ is the $i$th row of a known sensing matrix $A$. The key challenges are adversarial measurements and asynchrony: a fixed subset of workers may transmit corrupted measurements, and workers are activated asynchronously--only one is active at any time. In our previous work, we proposed a two-timescale $\ell_1$-minimization algorithm and established asymptotic recovery under a null-space-property-like condition on $A$. In this work, we establish tight non-asymptotic convergence rates under the same null-space-property-like condition. We also identify relaxed conditions on $A$ under which exact recovery may fail but recovery of a projected component of $\mathbb{E}[X]$ remains possible. Overall, our results provide a unified finite-time characterization of robustness, identifiability, and statistical efficiency in distributed linear estimation with adversarial workers, with implications for network tomography and related distributed sensing problems.


Bridging Theory and Practice in Crafting Robust Spiking Reservoirs

arXiv.org Machine Learning

Spiking reservoir computing provides an energy-efficient approach to temporal processing, but reliably tuning reservoirs to operate at the edge-of-chaos is challenging due to experimental uncertainty. This work bridges abstract notions of criticality and practical stability by introducing and exploiting the robustness interval, an operational measure of the hyperparameter range over which a reservoir maintains performance above task-dependent thresholds. Through systematic evaluations of Leaky Integrate-and-Fire (LIF) architectures on both static (MNIST) and temporal (synthetic Ball Trajectories) tasks, we identify consistent monotonic trends in the robustness interval across a broad spectrum of network configurations: the robustness-interval width decreases with presynaptic connection density $ฮฒ$ (i.e., directly with sparsity) and directly with the firing threshold $ฮธ$. We further identify specific $(ฮฒ, ฮธ)$ pairs that preserve the analytical mean-field critical point $w_{\text{crit}}$, revealing iso-performance manifolds in the hyperparameter space. Control experiments on Erdล‘s-Rรฉnyi graphs show the phenomena persist beyond small-world topologies. Finally, our results show that $w_{\text{crit}}$ consistently falls within empirical high-performance regions, validating $w_{\text{crit}}$ as a robust starting coordinate for parameter search and fine-tuning. To ensure reproducibility, the full Python code is publicly available.


Conformal Prediction with Time-Series Data via Sequential Conformalized Density Regions

arXiv.org Machine Learning

We propose a new conformal prediction method for time-series data with a guaranteed asymptotic conditional coverage rate, Sequential Conformalized Density Regions (SCDR), which is flexible enough to produce both prediction intervals and disconnected prediction sets, signifying the emergence of bifurcations. Our approach uses existing estimated conditional highest density predictive regions to form initial predictive regions. We then use a quantile random forest conformal adjustment to provide guaranteed coverage while adaptively changing to take the non-exchangeable nature of time-series data into account. We show that the proposed method achieves the guaranteed coverage rate asymptotically under certain regularity conditions. In particular, the method is doubly robust -- it works if the predictive density model is correctly specified and/or if the scores follow a nonlinear autoregressive model with the correct order specified. Simulations reveal that the proposed method outperforms existing methods in terms of empirical coverage rates and set sizes. We illustrate the method using two real datasets, the Old Faithful geyser dataset and the Australian electricity usage dataset. Prediction sets formed using SCDR for the geyser eruption durations include both single intervals and unions of two intervals, whereas existing methods produce wider, less informative, single-interval prediction sets.


CRPS-Optimal Binning for Univariate Conformal Regression

arXiv.org Machine Learning

We propose a method for non-parametric conditional distribution estimation based on partitioning covariate-sorted observations into contiguous bins and using the within-bin empirical CDF as the predictive distribution. Bin boundaries are chosen to minimise the total leave-one-out Continuous Ranked Probability Score (LOO-CRPS), which admits a closed-form cost function with $O(n^2 \log n)$ precomputation and $O(n^2)$ storage; the globally optimal $K$-partition is recovered by a dynamic programme in $O(n^2 K)$ time. Minimisation of within-sample LOO-CRPS turns out to be inappropriate for selecting $K$ as it results in in-sample optimism. We instead select $K$ by $K$-fold cross-validation of test CRPS, which yields a U-shaped criterion with a well-defined minimum. Having selected $K^*$ and fitted the full-data partition, we form two complementary predictive objects: the Venn prediction band and a conformal prediction set based on CRPS as the nonconformity score, which carries a finite-sample marginal coverage guarantee at any prescribed level $\varepsilon$. The conformal prediction is transductive and data-efficient, as all observations are used for both partitioning and p-value calculation, with no need to reserve a hold-out set. On real benchmarks against split-conformal competitors (Gaussian split conformal, CQR, CQR-QRF, and conformalized isotonic distributional regression), the method produces substantially narrower prediction intervals while maintaining near-nominal coverage.