fine-tuning
Towards foundational LiDAR world models with efficient latent flow matching
LiDAR-based world models offer more structured and geometry-aware representations than their image-based counterparts. However, existing LiDAR world models are narrowly trained; each model excels only in the domain for which it was built. This raises a critical question: can we develop LiDAR world models that exhibit strong transferability across multiple domains? To answer this, we conduct the first systematic domain transfer study across three demanding scenarios: (i) outdoor to indoor generalization, (ii) sparse-to dense-beam adaptation, and (iii) non-semantic to semantic transfer. Given different amounts of fine-tuning data, our experiments show that a single pretrained model can achieve up to 11% absolute improvement (83% relative) over training from scratch and outperforms training from scratch in 30/36 of our comparisons. This transferability significantly reduces the reliance on manually annotated data for semantic occupancy forecasting: our method exceeds previous baselines with only 5% of the labeled training data of prior work. We also observed inefficiencies of current generative-model-based LiDAR world models, mainly through their under-compression of LiDAR data and inefficient training objectives. To address these issues, we propose a latent conditional flow matching (CFM)-based framework that achieves state-of-the-art reconstruction accuracy using only half the training data and a compression ratio 6 times higher than that of prior methods. Our model also achieves SOTA performance on semantic occupancy forecasting while being 1.98x-23x more computationally efficient (a 1.1x-3.9x
Pay Attention to Small Weights
Finetuning large pretrained neural networks is known to be resource-intensive, both in terms of memory and computational cost. To mitigate this, a common approach is to restrict training to a subset of the model parameters. By analyzing the relationship between gradients and weights during finetuning, we observe a notable pattern: large gradients are often associated with small-magnitude weights. This correlation is more pronounced in finetuning settings than in training from scratch. Motivated by this observation, we propose NANOADAM, which dynamically updates only the small-magnitude weights during finetuning and offers several practical advantages: first, the criterion is gradient-free--the parameter subset can be determined without gradient computation; second, it preserves large-magnitude weights, which are likely to encode critical features learned during pretraining, thereby reducing the risk of catastrophic forgetting; thirdly, it permits the use of larger learning rates and consistently leads to better generalization performance in experiments. We demonstrate this for both NLP and vision tasks.
X-Mahalanobis: Transformer Feature Mixing for Reliable OODDetection
Recognizing out-of-distribution (OOD) samples is essential for deploying robust machine learning systems in open-world environments. While conventional OOD detection approaches rely on feature representations from the penultimate layer of neural networks, they often overlook informative signals embedded in intermediate layers. In this paper, we present a straightforward feature mixing approach for pretrained Transformers, which combines multi-layer representations via calculated importance weights, and identifies OOD samples using Mahalanobis distance in the blended feature space. When in-distribution samples are accessible, we show that parameter-efficient fine-tuning strategies effectively balance classification accuracy and OOD detection performance. We conduct extensive empirical analyses to validate the superiority of our proposed method under zero-shot, and fine-tuning settings using both class-balanced and long-tailed datasets. The source code is available at https://github.com/SEUML/X-Maha.
Omni-DNA: AGenomic Model Supporting Sequence Understanding, Long-context, and Textual Annotation
The interpretation of genomic sequences is crucial for understanding biological processes. To handle the growing volume of DNA sequence data, Genomic Foundation Models (GFMs) have been developed by adapting architectures and training paradigms from Large Language Models (LLMs). Despite their remarkable performance in DNA sequence classification tasks, there remains a lack of systematic understanding regarding the pre-training and task-adaptation processes of GFMs. Moreover, existing GFMs cannot achieve state-of-the-art performance on both short and long-context tasks and lack multimodal abilities. By revisiting pre-training architectures and post-training techniques, we propose OMNI-DNA, a family of models spanning 20M to 1.1B parameters that supports sequence understanding, long-context genomic reasoning, and natural-language annotation. Omni-DNA establishes new state-of-the-art results on 18 of 26 evaluations drawn from Nucleotide Transformer and Genomic Benchmarks. When jointly finetuning on biologically related tasks, Omni-DNA consistently outperforms existing models and demonstrates multi-tasking abilities. Furthermore, we introduce SEQPACK, an adaptive compression mechanism that enables efficient long-context modeling by summarizing historical tokens through position-aware learnable sampling. This allows transformer-based models to process ultra-long genomic sequences with minimal memory and computational overhead.
Linearization Explains Fine-Tuning in Large Language Models
Parameter-Efficient Fine-Tuning (PEFT) is a popular class of techniques that strive to adapt large models in a scalable and resource-efficient manner. Yet, the mechanisms underlying their training performance and generalization remain underexplored. In this paper, we provide several insights into such fine-tuning through the lens of linearization. Fine-tuned models are often implicitly encouraged to remain close to the pretrained model. By making this explicit, using an โ2distance inductive bias in parameter space, we show that fine-tuning dynamics become equivalent to learning with the positive-definite neural tangent kernel (NTK). We specifically analyze how close the fully linear and the linearized finetuning optimizations are, based on the strength of the regularization. This allows us to be pragmatic about how good a model linearization is when fine-tuning large language models (LLMs). When linearization is a good model, our findings reveal a strong correlation between the eigenvalue spectrum of the NTK and the performance of model adaptation. Motivated by this, we give spectral perturbation bounds on the NTK induced by the choice of layers selected for fine-tuning.
Scalable Fingerprinting of Large Language Models
Model fingerprinting has emerged as a powerful tool for model owners to identify their shared model given API access. In order to lower false discovery rate, fight fingerprint leakage, and defend against coalitions of model users attempting to bypass detection, we argue that scaling up the number of fingerprints one can embed into a model, i.e. Scalability of fingerprints, is critical. Hence, we pose scalability as a crucial requirement for fingerprinting schemes. We experiment with fingerprint design at a scale significantly larger than previously considered, and introduce a new method, dubbed Perinucleus sampling, to generate scalable, persistent, and harmless fingerprints. We demonstrate that this scheme can add 24,576 fingerprints to a Llama-3.1-8B
UniMotion: AUnified Motion Framework for Simulation, Prediction and Planning
Motion simulation, prediction and planning are foundational tasks in autonomous driving, each essential for modeling and reasoning about dynamic traffic scenarios. While often addressed in isolation due to their differing objectives, such as generating diverse motion states or estimating optimal trajectories, these tasks inherently depend on shared capabilities: understanding multi-agent interactions, modeling motion behaviors, and reasoning over temporal and spatial dynamics. Despite this underlying commonality, existing approaches typically adopt specialized model designs, which hinders cross-task generalization and system scalability. More critically, this separation overlooks the potential mutual benefits among tasks. Motivated by these observations, we propose UniMotion, a unified motion framework that captures shared structures across motion tasks while accommodating their individual requirements. Built on a decoder-only Transformer architecture, UniMotion employs dedicated interaction modes and tailored training strategies to simultaneously support these motion tasks. This unified design not only enables joint optimization and representation sharing but also allows for targeted fine-tuning to specialize in individual tasks when needed. Extensive experiments on the Waymo Open Motion Dataset demonstrate that joint training leads to robust generalization and effective task integration. With further fine-tuning, UniMotion achieves state-of-the-art performance across a range of motion tasks, establishing it as a versatile and scalable solution for autonomous driving.
HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs
Quantized training of Large Language Models (LLMs) remains an open challenge, as maintaining accuracy while performing all matrix multiplications in low precision has proven difficult. This is particularly the case when fine-tuning pre-trained models, which can have large weight, activation, and error (output gradient) outlier values that make lower-precision optimization difficult. To address this, we present HALO, a new quantization-aware training approach for Transformers that enables accurate and efficient low-precision training by combining 1) strategic placement of Hadamard rotations in both forward and backward passes, which mitigate outliers, 2) high-performance kernel support, and 3) FSDP integration for low-precision communication. Our approach ensures that all large matrix multiplications during the forward and backward passes are executed in lower precision.
LLMMeeting Decision Trees on Tabular Data
Tabular data have been playing a vital role in diverse real-world fields, including healthcare, finance, etc. With the recent success of Large Language Models (LLMs), early explorations of extending LLMs to the domain of tabular data have been developed. Most of these LLM-based methods typically first serialize tabular data into natural language descriptions, and then tune LLMs or directly infer on these serialized data. However, these methods suffer from two key inherent issues: (i) data perspective: existing data serialization methods lack universal applicability for structured tabular data, and may pose privacy risks through direct textual exposure, and (ii) model perspective: LLM fine-tuning methods struggle with tabular data, and in-context learning scalability is bottle-necked by input length constraints (suitable for few-shot learning). This work explores a novel direction of integrating LLMs into tabular data through logical decision tree rules as intermediaries, proposing a decision tree enhancer with LLM-derived rule for tabular prediction, DeLTa. The proposed DeLTa avoids tabular data serialization, and can be applied to full data learning setting without LLM fine-tuning. Specifically, we leverage the reasoning ability of LLMs to redesign an improved rule given a set of decision tree rules. Furthermore, we provide a calibration method for original decision trees via new generated rule by LLM, which approximates the error correction vector to steer the original decision tree predictions in the direction of "errors" reducing. Finally, extensive experiments on diverse tabular benchmarks show that our method achieves state-of-the-art performance.