Goto

Collaborating Authors

 sfm



Categorical Flow Matching on Statistical Manifolds

Neural Information Processing Systems

We introduce Statistical Flow Matching (SFM), a novel and mathematically rigorous flow-matching framework on the manifold of parameterized probability measures inspired by the results from information geometry. We demonstrate the effectiveness of our method on the discrete generation problem by instantiat-ing SFM on the manifold of categorical distributions whose geometric properties remain unexplored in previous discrete generative models. Utilizing the Fisher information metric, we equip the manifold with a Riemannian structure whose intrinsic geometries are effectively leveraged by following the shortest paths of geodesics. We develop an efficient training and sampling algorithm that overcomes numerical stability issues with a diffeomorphism between manifolds. Our distinctive geometric perspective of statistical manifolds allows us to apply optimal transport during training and interpret SFM as following the steepest direction of the natural gradient. Unlike previous models that rely on variational bounds for likelihood estimation, SFM enjoys the exact likelihood calculation for arbitrary probability measures. We manifest that SFM can learn more complex patterns on the statistical manifold where existing models often fail due to strong prior assumptions. Comprehensive experiments on real-world generative tasks ranging from image, text to biological domains further demonstrate that SFM achieves higher sampling quality and likelihood than other discrete diffusion or flow-based models. Our code is available at https://github.com/ccr-cheng/


Categorical Flow Matching on Statistical Manifolds

Neural Information Processing Systems

We introduce Statistical Flow Matching (SFM), a novel and mathematically rigorous flow-matching framework on the manifold of parameterized probability measures inspired by the results from information geometry. We demonstrate the effectiveness of our method on the discrete generation problem by instantiating SFM on the manifold of categorical distributions whose geometric properties remain unexplored in previous discrete generative models. Utilizing the Fisher information metric, we equip the manifold with a Riemannian structure whose intrinsic geometries are effectively leveraged by following the shortest paths of geodesics. We develop an efficient training and sampling algorithm that overcomes numerical stability issues with a diffeomorphism between manifolds. Our distinctive geometric perspective of statistical manifolds allows us to apply optimal transport during training and interpret SFM as following the steepest direction of the natural gradient. Unlike previous models that rely on variational bounds for likelihood estimation, SFM enjoys the exact likelihood calculation for arbitrary probability measures. We manifest that SFM can learn more complex patterns on the statistical manifold where existing models often fail due to strong prior assumptions. Comprehensive experiments on real-world generative tasks ranging from image, text to biological domains further demonstrate that SFM achieves higher sampling quality and likelihood than other discrete diffusion or flow-based models.


Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations

Ariyanti, Whenty, Chen, Kuan-Yu, Siniscalchi, Sabato Marco, Wang, Hsin-Min, Tsao, Yu

arXiv.org Artificial Intelligence

Abstract-- Perceptual voice quality assessment plays a vital role in diagnosing and monitoring voice disorders. Traditional methods, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and the Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS) scales, rely on expert raters and are prone to inter-rater variability, emphasizing the need for objective solutions. This study introduces the Voice Quality Assessment Network (VOQANet), a deep learning framework that employs an attention mechanism and Speech Foundation Model (SFM) embeddings to extract high-level features. To further enhance performance, we propose VO-QANet+, which integrates self-supervised SFM embeddings with low-level acoustic descriptors--namely jitter, shimmer, and harmonics-to-noise ratio (HNR). Unlike previous approaches that focus solely on vowel-based phonation (PVQD-A), our models are evaluated on both vowel-level and sentence-level speech (PVQD-S) to assess general-izability. Experimental results demonstrate that sentence-based inputs yield higher accuracy, particularly at the patient level. Overall, VOQANet consistently outperforms baseline models in terms of root mean squared error (RMSE) and Pearson correlation coefficient across CAPE-V and GRBAS dimensions, with VOQANet+ achieving even greater performance gains. Additionally, VOQANet+ maintains consistent performance under noisy conditions, suggesting enhanced robustness for real-world and telehealth applications. This work highlights the value of combining SFM embeddings with low-level features for accurate and robust pathological voice assessment. This work was supported in part by the National Science and T ech-nology Council and Academia Sinica. Whenty Ariyanti is with the Department of Computer Science and Information Engineering, National T aiwan University of Science and T echnology, T aipei 106, T aiwan, and also with the Research Center for Information T echnology Innovation, Academia Sinica, T aipei 11529, T aiwan (e-mail: d11115805@mail.ntust.edu.tw). Kuan-Y u Chen is with the Department of Computer Science and Information Engineering, National T aiwan University of Science and T echnology, T aipei 106, T aiwan (e-mail: kychen@mail.ntust.edu.tw). Sabato Marco Siniscalchi is the University of Palermo, Palermo, Italy (e-mail: sabatomarco.siniscalchi@unipa.it).


AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

Jiang, Yuhua, Cheng, Shuang, Ding, Yan, Gao, Feifei, Qi, Biqing

arXiv.org Artificial Intelligence

Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA achieves state-of-the-art results across general embodied evaluations due to its asynchronous generation in AFM. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.


SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution

Donepudi, Dharma Teja

arXiv.org Artificial Intelligence

Intra - sentence multilingual speech synthesis (code - switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages. Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed - language contexts. We introduce Script - First Multilingual Synthesis with Adaptive Locale Resolution (SFMS - ALR) an engine - agnostic framework for fluent, real - time code - switched speech generation. SFMS - ALR segments input text by Unicode script, applies adaptive language identification to determine each segment's language and locale, and normalizes prosody using sentiment - aware adjustments to preserve expressive continuity across languages. The algorithm generates a unified SSML representation with appropriate or spans and synthesizes the utterance in a single TTS request. Unlike end - to - end multilingual models, SFMS - ALR requires no retraining and integrates seamlessly with existing voices from Google, Apple, Amazon, and other providers. Comparative analysis with data - driven pipelines such as Unicom and Mask LID demonstrates SFMS - ALR's flexibility, interpretability, and immediate deployability . The framework establishes a modular baseline for high - quality, engine - independent multilingual TTS and outlines evaluation strategies for intelligibility, naturalness, and user preference.


Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis

Yang, Dong, Cai, Yiyi, Saito, Yuki, Wang, Lixu, Saruwatari, Hiroshi

arXiv.org Artificial Intelligence

We propose Shallow Flow Matching (SFM), a novel mechanism that enhances flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. Unlike conventional FM modules, which use the coarse representations from the weak generator as conditions, SFM constructs intermediate states along the FM paths from these representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise, thereby focusing computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments demonstrate that SFM yields consistent gains in speech naturalness across both objective and subjective evaluations, and significantly accelerates inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.




Smooth Flow Matching

Tan, Jianbin, Zhang, Anru R.

arXiv.org Machine Learning

Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data. Built upon flow-matching ideas, SFM constructs a semiparametric copula flow to generate infinite-dimensional functional data, free from Gaussianity or low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream statistical tasks, highlighting its potential to boost the utility of EHR data for clinical applications.