Technology
OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-time Emotional Speech Synthesis
Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce OpenOmni, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pretrained speech model undergoes further training on image-text tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference optimization, which enables real-time emotional speech synthesis with high fidelity. Extensive experiments demonstrate that OpenOmni surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks. It achieves a 4-point absolute improvement on OmniBench over the leading open-source model VITA, despite using 5$\times$ fewer training examples and a smaller model size (7B vs. 7$\times$8B). Besides, OpenOmni achieves real-time speech generation with less than 1 second latency at non-autoregressive mode, reducing inference time by 5$\times$ compared to autoregressive methods, and improves emotion classification accuracy by 7.7\%.
Neurons as Detectors of Coherent Sets in Sensory Dynamics
From prior experience, neurons learn {\it coherent sets}--regions of stimulus state space whose trajectories evolve cohesively over finite times--and assign membership indices to new stimuli. Coherent sets are identified via spectral clustering of the {\it stochastic Koopman operator (SKO)}, where the sign pattern of a subdominant singular function partitions the state space into minimally coupled regions. For multivariate Ornstein-Uhlenbeck processes, this singular function reduces to a linear projection onto the dominant singular vector of the whitened state-transition matrix. Encoding this singular vector as a receptive field enables neurons to compute membership indices via the projection sign in a biologically plausible manner. Each neuron detects either a {\it predictive} coherent set (stimuli with common futures) or a {\it retrospective} coherent set (stimuli with common pasts), suggesting a functional dichotomy among neurons. Since neurons lack access to explicit dynamical equations, the requisite singular vectors must be estimated directly from data, for example, via past-future canonical correlation analysis on lag-vector representations--an approach that naturally extends to nonlinear dynamics. This framework provides a novel account of neuronal temporal filtering, the ubiquity of rectification in neural responses, and known functional dichotomies. Coherent-set clustering thus emerges as a fundamental computation underlying sensory processing and transferable to bio-inspired artificial systems.
EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization
Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU systems, inference latency can be further reduced through tensor parallelism (TP), while the optimal TP size of the draft model is typically smaller than that of the base model, leading to GPU idling during the drafting stage. We observe that such inefficiency stems from the sequential execution of layers, which is seemingly natural but actually unnecessary. Therefore, we propose EasySpec, a layer-parallel speculation strategy that optimizes the efficiency of multi-GPU utilization.
MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation
We present MetaFind, a scene-aware multi-modal retrieval framework designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories. MetaFind addresses two core challenges: (i) inconsistent asset retrieval that overlooks spatial, semantic, and stylistic constraints, and (ii) the absence of a standardized retrieval paradigm specifically tailored for 3D asset retrieval, as existing approaches predominantly rely on general-purpose 3D shape representation models. Our key innovation is a retrieval mechanism that enhances both spatial reasoning and style consistency by jointly modeling object-level features (including appearance) and scene-level layout structures. Methodologically, MetaFind introduces a plug-and-play layout encoder that captures both spatial relationships and object appearance features, ensuring retrieved 3D assets are contextually and stylistically coherent with the existing scene. The framework supports iterative scene construction by continuously adapting retrieval results to current scene updates. Empirical evaluations demonstrate the improved spatial and stylistic consistency of MetaFind in various retrieval tasks compared to baseline methods.
Predicting the Performance of Black-box Language Models with Follow-up Queries
Reliably predicting the behavior of language models---such as whether their outputs are correct or have been adversarially manipulated---is a fundamentally challenging task. This is often made even more difficult as frontier language models are offered only through closed-source APIs, providing only black-box access. In this paper, we predict the behavior of black-box language models by asking follow-up questions and taking the probabilities of responses representations to train reliable predictors. We first demonstrate that training a linear model on these responses reliably and accurately predicts model correctness on question-answering and reasoning benchmarks. Surprisingly, this can that operate over model internals or activations. Furthermore, we demonstrate that these follow-up question responses can reliably distinguish between a clean version of an LLM and one that has been adversarially influenced via a system prompt to answer questions incorrectly or to introduce bugs into generated code. Finally, we show that they can also be used to differentiate between black-box LLMs, enabling the detection of misrepresented models provided through an API. Overall, our work shows promise in monitoring black-box language model behavior, supporting their deployment in larger, autonomous systems.
Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning
Recent progress in large language models (LLMs) highlights the power of scaling test-time compute to achieve strong performance on complex tasks, such as mathematical reasoning and code generation. This raises a critical question: how should model training be modified to optimize performance under a subsequent test-time compute strategy and budget? To explore this, we focus on pass@N, a simple test-time strategy that searches for a correct answer in N independent samples. We show, surprisingly, that training with cross-entropy (CE) can be with pass@N in that pass@N accuracy with longer CE training. We explain the origins of this misalignment in terms of model overconfidence induced by CE, and experimentally verify our prediction of overconfidence as an impediment to scaling test-time compute via pass@N. Furthermore we suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance. Our algorithm demonstrates improved mathematical reasoning on MATH and MiniF2F benchmarks under several scenarios: (1) providing answers to math questions both with and without Chain-of-Thought reasoning traces; and (2) proving theorems by searching over proof trees of varying shapes. Overall our work underscores the importance of co-designing two traditionally separate phases of LLM development: training-time protocols and test-time search and reasoning strategies.
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference ($\texttt{PPI}$) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose $\texttt{R-AutoEval+}$, a novel framework that provides finite-sample reliability guarantees on the model evaluation, while also ensuring an enhanced (or at least no worse) sample efficiency compared to conventional methods. The key innovation of $\texttt{R-AutoEval+}$ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data, reverting to conventional methods when the autoevaluator is insufficiently accurate. Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, for prompt design in LLMs, and for test-time reasoning budget allocation in LLMs confirm the reliability and efficiency of $\texttt{R-AutoEval+}$.
From Indicators to Insights: Diversity-Optimized for Medical Series-Text Decoding via LLMs
Medical time-series analysis differs fundamentally from general ones by requiring specialized domain knowledge to interpret complex signals and clinical context. Large language models (LLMs) hold great promise for augmenting medical time-series analysis by complementing raw series with rich contextual knowledge drawn from biomedical literature and clinical guidelines. However, realizing this potential depends on precise and meaningful prompts that guide the LLM to key information. Yet, determining what constitutes effective prompt content remains non-trivial--especially in medical settings where signal interpretation often hinges on subtle, expert-defined decision-making indicators. To this end, we propose InDiGO, a knowledge-aware evolutionary learning framework that integrates clinical signals and decision-making indicators through iterative optimization. Across four medical benchmarks, InDiGO consistently outperforms prior methods.
Fast Non-Log-Concave Sampling under Nonconvex Equality and Inequality Constraints with Landing
Sampling from constrained statistical distributions is a fundamental task in various fields including Bayesian statistics, computational chemistry, and statistical physics. This article considers the cases where the constrained distribution is described by an unconstrained density, as well as additional equality and/or inequality constraints, which often make the constraint set nonconvex. Existing methods for nonconvex constraint set $\Sigma \subset \mathbb{R}^d$ defined by equality or inequality constraints commonly rely on costly projection steps. Moreover, they cannot handle equality and inequality constraints simultaneously as each method only specialized in one case. In addition, rigorous and quantitative convergence guarantee is often lacking. In this paper, we introduce Overdamped Langevin with LAnding (OLLA), a new framework that can design overdamped Langevin dynamics accommodating both equality and inequality constraints. The proposed dynamics also deterministically corrects trajectories along the normal direction of the constraint surface, thus obviating the need for explicit projections. We show that, under suitable regularity conditions on the target density and $\Sigma$, OLLA converges exponentially fast in $W_2$ distance to the constrained target density $\rho_\Sigma(x) \propto \exp(-f(x))d\sigma_\Sigma$. Lastly, through experiments, we demonstrate the efficiency of OLLA compared to projection-based constrained Langevin algorithms and their slack variable variants, highlighting its favorable computational cost and reasonable empirical mixing.