Country
Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets
Kunz, M. Ross, Merickel, John, Wilson, Keith
Numeric tabular datasets are the dominant data format in scientific practice, yet large language models lack native mechanisms for representing numeric datasets in a meaningful way across heterogeneous feature spaces. Existing approaches either target predictive modeling over individual datasets, which requires a shared set of variable definitions, or lack mechanisms for interpretable cross-dataset alignment. The proposed methodology characterizes numeric tabular datasets through structured exploratory data analysis descriptors, embeds those descriptors into a shared vector space using a pretrained sentence transformer, and quantifies cross-dataset similarity via Canonical Correlation Analysis (CCA). Furthermore, a penalized formulation of CCA is applied to recover sparse, interpretable variable-level correspondences between datasets, identifying which statistical descriptors or variable-level quantities drive cross-dataset alignment without requiring shared variable names or feature conventions. Differential privacy is optionally applied to the descriptor set prior to embedding, supporting deployment in sensitive data contexts without requiring access to raw observations at time of comparison. The methodology is evaluated across 15 datasets spanning general-purpose benchmarks, materials informatics, and nuclear-grade graphite characterization. Results demonstrate a total P@1 score of 0.9, with known nearest-neighbor retrieval and cluster structure remaining robust across embedding ablations and differential privacy budgets. The proposed framework provides a principled pathway for integrating heterogeneous numeric data into retrieval-augmented generation pipelines while preserving statistical context, with direct applications to data-driven algorithm selection and simulation model initialization for unknown datasets.
Leave a Window Out: Modifying the Jackknife for Predictive Inference in Time Series
Jiang, Hanyang, Barber, Rina Foygel, Pananjady, Ashwin, Xie, Yao
Conformal prediction methods enjoy strong theoretical and empirical predictive inference performance, provided the data is exchangeable, and predictors are trained in a memoryless fashion. However, these assumptions and constraints are impractical in many real-data settings, such as time series (where temporal dependence violates exchangeability, and where memoryless predictors will inevitably have poor predictive accuracy). Recent work shows that the split conformal prediction method is robust to these issues of memory-based predictors and deviations from exchangeability that are common features of time-series data. However, since using sample splitting can lead to lower accuracy, this motivates asking whether other predictive inference methods (that do not rely on data splitting) could also be reliably used in the time series setting. In this work, we show that the vanilla leave-one-out jackknife can suffer an arbitrary loss of coverage even in canonical time series models with mild temporal dependence. As a remedy, we propose a careful modification tailored to such settings, which we term the \emph{leave-a-window-out} (LWO) method, and show that it can achieve valid coverage provided that the model-fitting procedure satisfies mild stability properties. Our proofs are based on quantifying the degree to which the data departs from \emph{cyclic exchangeability}, and we introduce new coefficients to measure the extent of this departure. Experiments on time series data demonstrate that our LWO method often enjoys valid coverage when the vanilla jackknife fails to cover, while producing much narrower intervals than split conformal prediction.
On Language Generation in the Limit with Bounded Memory
Kleinberg, Jon, Mehrotra, Anay, Saberi, Amin, Velegkas, Grigoris
We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.
Reasoning with Sampling: Cutting at Decision Points
Zhou, Felix, Mehrotra, Anay, Liu, Quanquan C.
Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.
The GOP's Attacks on James Talarico Are Straight Out of the Incel Handbook
The GOP's Attacks on James Talarico Are Straight Out of the Incel Handbook Claims about low testosterone and false accusations of veganism might play well to the online far right, but will they win an election? Democratic US Senate candidate James Talarico speaks in Houston, Texas. On Tuesday, with Donald Trump's endorsement and the backing of the MAGA faithful, scandal-ridden Texas attorney general Ken Paxton defeated incumbent US senator John Cornyn in a runoff primary to claim the Republican nomination for that seat. He then quickly set about painting his general-election opponent, Democratic Texas state representative James Talarico, as insufficiently masculine. "My opponent is the most extreme radical that Democrats have ever nominated," Paxton said in his victory speech.
AI facial recognition to check age of asylum seekers from next year
An AI facial recognition tool that aims to detect adult migrants posing as children will be deployed at the UK's borders next year. A software company has been awarded a contract to develop and test the technology, which will estimate a person's age by analysing photographs of them taken at the border. The Home Office says the technology will make it easier to identify adult migrants attempting to game the system, after initial testing indicated promising performance and accuracy. But Human Rights Watch urged the government to scrap the scheme, describing it as unproven technology that will undermine the protections vulnerable children are entitled to. Unaccompanied child migrants are processed through the care system rather than the asylum system, which can make it easier to stay in the country.
'Supergirl' pre-release tracking looks disastrously bad for Hollywood after lead actress' bizarre comments
Dan Le Batard, who previously avoided Doug Emhoff abuse allegation, declares journalism'dead' USA Today calls Stephen Colbert, America's least funny comedian, a'gallant comic avenger' Critics reviews for'The Mandalorian and Grogu' are out, and it's yet another bad sign for Disney, Star Wars Can Victor Wembanyama be the true face of the NBA as a European? Audemars Piguet x Swatch'Royal Pop' release sparks mob scenes, pepper spray and arrests at malls Statisticians strangely don't count multiple clear-cut Caitlin Clark assists vs Mystics The best outdoor weekend in Northwest Georgia doesn't require'roughing it' or sleeping on the ground STRAIT OUTTA WAR?: Iran talks enter most critical phase yet as US military remains on standby Strait of Hormuz reopening among core conditions needed for Trump's approval Greg Gutfeld: A good sheep doesn't do that Brian Kilmeade: This should be in the'fiction section' of every library US, Israeli militaries must ensure Iranians'do not cheat,' Foundation for Defense of Democracies CEO says OutKick-Analysis'Supergirl' pre-release tracking looks disastrously bad for Hollywood after lead actress' bizarre comments Star Milly Alcock's divisive remarks and underwhelming trailers have tracking estimates far below studio hopes Greg Gutfeld: Will Hollywood take the hint? Fox News host Greg Gutfeld and the'Gutfeld!' panel discuss Hollywood's obsession with inserting politics into movies. Hollywood can't get out of its own way. For most of the last decade, the entertainment industry has worked extremely hard to alienate large numbers of potential customers.
The NBA, NBC and fanboys continue to tout deeply misleading ratings data Bobby Burack
Dan Le Batard, who previously avoided Doug Emhoff abuse allegation, declares journalism'dead' USA Today calls Stephen Colbert, America's least funny comedian, a'gallant comic avenger' Critics reviews for'The Mandalorian and Grogu' are out, and it's yet another bad sign for Disney, Star Wars Can Victor Wembanyama be the true face of the NBA as a European? Audemars Piguet x Swatch'Royal Pop' release sparks mob scenes, pepper spray and arrests at malls Statisticians strangely don't count multiple clear-cut Caitlin Clark assists vs Mystics The best outdoor weekend in Northwest Georgia doesn't require'roughing it' or sleeping on the ground NFL's grossly expanded national schedule is making RedZone and Sunday Ticket less essential Greg Gutfeld: A good sheep doesn't do that Brian Kilmeade: This should be in the'fiction section' of every library US, Israeli militaries must ensure Iranians'do not cheat,' Foundation for Defense of Democracies CEO says Scott Bessent reveals three conditions Iran deal must meet for Trump's final sign off Trump won't put'national security' at risk over 2026 midterms, former RNC chairman says President Trump: Democrats are'good salesmen,' but they have no policies While OutKick is trying to enjoy the NBA conference finals, though all the blowouts make that difficult, the fanboys keep demanding we comment on the ratings. Every other day, it seems, NBC or the NBA releases another celebratory graphic touting viewership. The Western Conference Finals are averaging 9.4 million viewers across NBC and Peacock, making it the most-watched Western Conference Finals on record through three games, NBC posted on X on Thursday. The network also said that Thunder-Spurs Game 4 on Sunday delivered a total audience of 10.3 million viewers, making it the most-watched Western Conference Finals Game 4 since 1999.
The Internet Is Somehow Obsessed With the Pope's First Major Letter. I Read It--and Totally See Why.
Users I Read the Pope's Encyclical on A.I. I'm Astounded By What He Wrote. It's an urgent warning--and a celebration of humanity and what we can do at our best. Enter your email to receive alerts for this author. You can manage your newsletter subscriptions at any time. You're already subscribed to the aa_Nitish_Pahwa newsletter.