Goto

Collaborating Authors

 Education


Don't throw the baby out with the bathwater: How and why deep learning for ARC

arXiv.org Artificial Intelligence

The Abstraction and Reasoning Corpus (ARC-AGI) presents a formidable challenge for AI systems. Despite the typically low performance on ARC, the deep learning paradigm remains the most effective known strategy for generating skillful (state-of-the-art) neural networks (NN) across varied modalities and tasks in vision, language etc. The deep learning paradigm has proven to be able to train these skillful neural networks and learn the abstractions needed in these diverse domains. Our work doubles down on that and continues to leverage this paradigm by incorporating on-the-fly NN training at test time. We demonstrate that fully committing to deep learning's capacity to acquire novel abstractions yields state-of-the-art performance on ARC. Specifically, we treat both the neural network and the optimizer (rather than just a pre-trained network) as integral components of the inference process, fostering generalization to unseen tasks. Concretely, we propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning. We also propose Test-Time Fine-Tuning (TTFT) and the Augment Inference Reverse-Augmentation and Vote (AIRV) as effective test-time techniques. We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT. An early version of this approach secured first place in the 2023 ARCathon competition, while the final version achieved the current best score on the ARC private test-set (58%). Our findings highlight the key ingredients of a robust reasoning system in unfamiliar domains, underscoring the central mechanisms that improve broad perceptual reasoning.


AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model's ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.


Token Distillation: Attention-aware Input Embeddings For New Tokens

arXiv.org Artificial Intelligence

New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. This excessive tokenization not only leads to reduced performance on downstream tasks (Rust et al., 2021; Ali et al., 2024) but also increases the computational Although adding new tokens to a model's vocabulary can reduce over-tokenization, it Whenever we wish to add a new token to a pretrained model's vocabulary, this new token may The semantics of a word composed of multiple subtokens will largely not be stored in their raw input embeddings at all - but rather constructed by the Transformer's attention/feed-forward layer stack during contextualization (Elhage et al., 2022; Lad et al., 2024; We demonstrate the efficacy of our method, dubbed "Token Distillation", in Section 5. We illustrate Our experimental setup is detailed in Section 4. In summary, our contributions are as follows. We motivate our proposed method by describing the fundamental limitations of current embedding initialization methods and empirically verify our claims. Most state-of-the-art Large Language Models (LLMs) are trained using a static tokenizer, usually derived by a byte-pair encoding scheme before model training (Sennrich et al., 2016). Furthermore, Lesci et al. (2025) show that in practice, words which are not a single A solution to this problem is to modify the existing vocabulary to suit the specific needs.


In 1925, seven students went 60 hours without sleep--for science

Popular Science

Scientists were out to prove sleep was just a waste of time. Among the students who participated in the sleep deprivation study was the future head of the psychology department at George Washington University. Breakthroughs, discoveries, and DIY tips sent every weekday. The grueling Medical College Admission Test, or MCAT, was first devised in the 1920s by George Washington University professor Frederick August Moss. Originally called the Scholastic Aptitude Test for Medical Students, Moss developed the readiness test as a way to curb high dropout rates in medical schools.


Zohran Annoyed a Lot of New York Public School Parents With This One. But He's Got a Point.

Slate

The many ways we've tried to identify gifted 4-year-olds, and how they've failed. When I was a kindergartner in the 1980s, the "gifted" programming for my class could be found inside of a chest. I don't know what toys and learning materials lived there, since I wasn't one of the handful of presumably more academically advanced kiddos that my kindergarten teacher invited to open the chest. My distinct impression at the time was that my teacher didn't think I was worthy of the enrichment because I frequently spilled my chocolate milk at lunch and I had also once forgotten to hang a sheet of paper on the class easel--instead painting an elaborate and detailed picture on the stand itself. The withering look on my teacher's face after seeing the easel assured me that gifted I was not.


Robot Talk Episode 131 – Empowering game-changing robotics research, with Edith-Clare Hall

Robohub

Edith-Clare Hall is a PhD student at the University of Bristol, Frontier Specialist at ARIA, and leader of Women in Robotics UK. She focuses on the critical interfaces where interconnected systems meet, working to close the gap between academic research and real-world deployment to unlock cyber-physical autonomy. At ARIA, she works as a technical generalist, accelerating breakthroughs across emerging and future programmes. Her PhD research focussed on creating bespoke robotic systems that deliver support for people with progressive conditions such as motor neurone disease (MND). Robot Talk is a weekly podcast that explores the exciting world of robotics, artificial intelligence and autonomous machines.


Budgeted Multiple-Expert Deferral

arXiv.org Machine Learning

Learning to defer uncertain predictions to costly experts offers a powerful strategy for improving the accuracy and efficiency of machine learning systems. However, standard training procedures for deferral algorithms typically require querying all experts for every training instance, an approach that becomes prohibitively expensive when expert queries incur significant computational or resource costs. This undermines the core goal of deferral: to limit unnecessary expert usage. To overcome this challenge, we introduce the budgeted deferral framework, which aims to train effective deferral algorithms while minimizing expert query costs during training. We propose new algorithms for both two-stage and single-stage multiple-expert deferral settings that selectively query only a subset of experts per training example. While inspired by active learning, our setting is fundamentally different: labels are already known, and the core challenge is to decide which experts to query in order to balance cost and predictive performance. We establish theoretical guarantees for both of our algorithms, including generalization bounds and label complexity analyses. Empirical results across several domains show that our algorithms substantially reduce training costs without sacrificing prediction accuracy, demonstrating the practical value of our budget-aware deferral algorithms.


Statistical Inference for Matching Decisions via Matrix Completion under Dependent Missingness

arXiv.org Machine Learning

In contrast to the independent sampling assumed in classical matrix completion literature, the observed entries, which arise from past matching data, are constrained by matching capacity. This matching-induced dependence poses new challenges for both estimation and inference in the matrix completion framework. We propose a non-convex algorithm based on Grassmannian gradient descent and establish near-optimal entrywise convergence rates for three canonical mechanisms, i.e., one-to-one matching, one-to-many matching with one-sided random arrival, and two-sided random arrival. To facilitate valid uncertainty quantification and hypothesis testing on matching decisions, we further develop a general debiasing and projection framework for arbitrary linear forms of the reward matrix, deriving asymptotic normality with finite-sample guarantees under matching-induced dependent sampling. Our empirical experiments demonstrate that the proposed approach provides accurate estimation, valid confidence intervals, and efficient evaluation of matching policies.


Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality

arXiv.org Artificial Intelligence

Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness, often surpassing superficial similarity between the training data and the benchmark, and that mid-layer weight changes correlate most strongly with performance gains. We release these 1,000+ SFT models and benchmark results to accelerate further research. All resources are available at https://github.com/llm-jp/massive-sft.


When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

arXiv.org Artificial Intelligence

State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.