Goto

Collaborating Authors

 Yun, Se-Young


When Debate Fails: Bias Reinforcement in Large Language Models

arXiv.org Artificial Intelligence

Large Language Models $($LLMs$)$ solve complex problems using training-free methods like prompt engineering and in-context learning, yet ensuring reasoning correctness remains challenging. While self-correction methods such as self-consistency and self-refinement aim to improve reliability, they often reinforce biases due to the lack of effective feedback mechanisms. Multi-Agent Debate $($MAD$)$ has emerged as an alternative, but we identify two key limitations: bias reinforcement, where debate amplifies model biases instead of correcting them, and lack of perspective diversity, as all agents share the same model and reasoning patterns, limiting true debate effectiveness. To systematically evaluate these issues, we introduce $\textit{MetaNIM Arena}$, a benchmark designed to assess LLMs in adversarial strategic decision-making, where dynamic interactions influence optimal decisions. To overcome MAD's limitations, we propose $\textbf{DReaMAD}$ $($$\textbf{D}$iverse $\textbf{Rea}$soning via $\textbf{M}$ulti-$\textbf{A}$gent $\textbf{D}$ebate with Refined Prompt$)$, a novel framework that $(1)$ refines LLM's strategic prior knowledge to improve reasoning quality and $(2)$ promotes diverse viewpoints within a single model by systematically modifying prompts, reducing bias. Empirical results show that $\textbf{DReaMAD}$ significantly improves decision accuracy, reasoning diversity, and bias mitigation across multiple strategic tasks, establishing it as a more effective approach for LLM-based decision-making.


Probability-Flow ODE in Infinite-Dimensional Function Spaces

arXiv.org Machine Learning

Diffusion model (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021b; Kingma et al., 2021) is a class of generative model that adds noise to real data to train the score network and sequentially approximate the time-reversed process (Fรถllmer and Wakolbinger, 1986; Anderson, 1982) to generate samples from the true data distribution. This model has shown remarkable empirical success in numerous domains such as image generation (Song et al., 2021b,a), video generation (Luo et al., 2023), medical data processing (Song et al., 2022; Chung and Ye, 2022; Akrout et al., 2023), and audio generation (Kong et al., 2020). However, "classical" diffusion models formulated on finite-dimensional Euclidean spaces limit their applicability to function generation problems as they can only generate function values realized on a fixed discretization of the function's domain (Li et al., 2020) and cannot capture functional properties of a data such as integrability or smoothness (Kerrigan et al., 2023). Motivated by such a limitation of finite-dimensional models, there has been a line of works extending the finite-dimensional diffusion model to infinite-dimensional Hilbert spaces; for instance, Hagemann et al. (2023); Kerrigan et al. (2023); Lim et al. (2023a,b); Pidstrigach et al. (2023); Phillips et al. (2022); Baldassari et al. (2023). Kerrigan et al. (2023) proposes a discrete-time model that serves as an analog of Ho et al. (2020) in infinite-dimensional space, and Hagemann et al. (2023) introduces a finite-dimensional approximation of an infinite-dimensional SDEs and utilizes the time-reversal formula in finite-dimensional spaces. Lim et al. (2023a); Franzese et al. (2023); Pidstrigach et al. (2023) propose continuous-time models by extending the SDE framework of Song et al. (2021b) to infinite dimensions based on semigroup theory (ref. Da Prato and Zabczyk (2014)); however, their consideration is limited to a relatively simple class of SDEs, such as Langevin type SDE or SDEs with constant-time diffusion coefficients. Later, Lim et al. (2023b) proved a general form of time-reversal formula which encompasses various choices of SDEs such as VPSDE, VESDE, sub-VPSDE (Song et al., 2021b) and variance scheduling (Nichol and


MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation

arXiv.org Artificial Intelligence

Despite recent advances in text-to-speech (TTS) models, audio-visual to audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multi-modal guidance with CFM, our model robustly preserves speaker-specific characteristics and significantly enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating robust speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements.


What is the Alignment Objective of GRPO?

arXiv.org Artificial Intelligence

In this note, we examine the aggregation of preferences achieved by the Group Policy Optimisation (GRPO) algorithm, a reinforcement learning method used to train advanced artificial intelligence models such as DeepSeek-R1-Zero and DeepSeekMath. The GRPO algorithm trains a policy using a reward preference model, which is computed by sampling a set of outputs for a given context, observing the corresponding rewards, and applying shift-and-scale normalisation to these reward values. Additionally, it incorporates a penalty function to discourage deviations from a reference policy. We present a framework that enables us to characterise the stationary policies of the GRPO algorithm. This analysis reveals that the aggregation of preferences differs fundamentally from standard logarithmic pooling, which is implemented by other approaches such as RLHF. The precise form of preference aggregation arises from the way the reward preference model is defined and from the penalty function, which we show to essentially correspond to the reverse Kullback-Leibler (KL) divergence between the aggregation policy and the reference policy. Interestingly, we demonstrate that for groups of size two, the reward preference model corresponds to pairwise comparison preferences, similar to those in other alignment methods based on pairwise comparison feedback. We provide explicit characterisations of the aggregate preference for binary questions, for groups of size two, and in the limit of large group size. This provides insights into the dependence of the aggregate preference on parameters such as the regularisation constant and the confidence margin of question answers. Finally, we discuss the aggregation of preferences obtained by modifying the GRPO algorithm to use direct KL divergence as the penalty or to use rewards without scale normalisation.


DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

arXiv.org Artificial Intelligence

Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.


Self-Training Elicits Concise Reasoning in Large Language Models

arXiv.org Artificial Intelligence

Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning


MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

arXiv.org Artificial Intelligence

Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.


FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL

arXiv.org Artificial Intelligence

Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or added during the inference trajectory -- a common occurrence in real-world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra-trajectory dynamic entity composition under zero-shot out-of-domain (OOD) generalization, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods. FlickerFusion stochastically drops out parts of the observation space, emulating being in-domain when inferenced OOD. The results show that FlickerFusion not only achieves superior inference rewards but also uniquely reduces uncertainty vis-\`a-vis the backbone, compared to existing methods. Benchmarks, implementations, and model weights are organized and open-sourced at flickerfusion305.github.io, accompanied by ample demo video renderings.


Conditional Synthesis of 3D Molecules with Time Correction Sampler

arXiv.org Artificial Intelligence

Diffusion models have demonstrated remarkable success in various domains, including molecular generation. However, conditional molecular generation remains a fundamental challenge due to an intrinsic trade-off between targeting specific chemical properties and generating meaningful samples from the data distribution. In this work, we present Time-Aware Conditional Synthesis (TACS), a novel approach to conditional generation on diffusion models. It integrates adaptively controlled plug-and-play "online" guidance into a diffusion model, driving samples toward the desired properties while maintaining validity and stability. A key component of our algorithm is our new type of diffusion sampler, Time Correction Sampler (TCS), which is used to control guidance and ensure that the generated molecules remain on the correct manifold at each reverse step of the diffusion process at the same time. Our proposed method demonstrates significant performance in conditional 3D molecular generation and offers a promising approach towards inverse molecular design, potentially facilitating advancements in drug discovery, materials science, and other related fields.


Automated Filtering of Human Feedback Data for Aligning Text-to-Image Diffusion Models

arXiv.org Artificial Intelligence

Fine-tuning text-to-image diffusion models with human feedback is an effective method for aligning model behavior with human intentions. However, this alignment process often suffers from slow convergence due to the large size and noise present in human feedback datasets. In this work, we propose FiFA, a novel automated data filtering algorithm designed to enhance the fine-tuning of diffusion models using human feedback datasets with direct preference optimization (DPO). Specifically, our approach selects data by solving an optimization problem to maximize three components: preference margin, text quality, and text diversity. The concept of preference margin is used to identify samples that contain high informational value to address the noisy nature of feedback dataset, which is calculated using a proxy reward model. Additionally, we incorporate text quality, assessed by large language models to prevent harmful contents, and consider text diversity through a k-nearest neighbor entropy estimator to improve generalization. Finally, we integrate all these components into an optimization process, with approximating the solution by assigning importance score to each data pair and selecting the most important ones. As a result, our method efficiently filters data automatically, without the need for manual intervention, and can be applied to any large-scale dataset. Experimental results show that FiFA significantly enhances training stability and achieves better performance, being preferred by humans 17% more, while using less than 0.5% of the full data and thus 1% of the GPU hours compared to utilizing full human feedback datasets. Warning: This paper contains offensive contents that may be upsetting. Large-scale models trained on extensive web-scale datasets using diffusion techniques (Ho et al., 2020; Song et al., 2020), such as Stable Diffusion (Rombach et al., 2022), Dall-E (Ramesh et al., 2022), and Imagen (Saharia et al., 2022), have enabled the generation of high-fidelity images from diverse text prompts. However, several failure cases remain, such as difficulties in illustrating text content, incorrect counting, or insufficient aesthetics for certain text prompts (Lee et al., 2023; Fan et al., 2024; Black et al., 2023). Fine-tuning text-to-image diffusion models using human feedback has recently emerged as a powerful approach to address this issue (Black et al., 2023; Fan et al., 2024; Prabhudesai et al., 2023; Clark et al., 2023). Unlike the conventional optimization strategy of likelihood maximization, this framework first trains reward models using human feedback (Kirstain et al., 2024; Wu et al., 2023; Xu et al., 2024) and then fine-tunes the diffusion models to maximize reward scores through policy gradient (Fan et al., 2024; Black et al., 2023) or reward-gradient based techniques (Prabhudesai et al., 2023; Clark et al., 2023).