AITopics | He, Jiajun

Collaborating Authors

He, Jiajun

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Training One-Step Diffusion Models Without Distillation

Zhang, Mingtian, He, Jiajun, Chen, Wenlin, Ou, Zijing, Hernández-Lobato, José Miguel, Schölkopf, Bernhard, Barber, David

arXiv.org Artificial IntelligenceFeb-11-2025

Recent advances in one-step generative models typically follow a two-stage process: first training a teacher diffusion model and then distilling it into a one-step student model. This distillation process traditionally relies on both the teacher model's score function to compute the distillation loss and its weights for student initialization. In this paper, we explore whether one-step generative models can be trained directly without this distillation process. First, we show that the teacher's score function is not essential and propose a family of distillation methods that achieve competitive results without relying on score estimation. Next, we demonstrate that initialization from teacher weights is indispensable in successful training. Surprisingly, we find that this benefit is not due to improved ``input-output" mapping but rather the learned feature representations, which dominate distillation quality. Our findings provide a better understanding of the role of initialization in one-step model training and its impact on distillation quality.

artificial intelligence, arxiv preprint arxiv, machine learning, (11 more...)

arXiv.org Artificial Intelligence

2502.08005

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre: Research Report (1.00)

Industry: Education (0.89)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

No Trick, No Treat: Pursuits and Challenges Towards Simulation-free Training of Neural Samplers

He, Jiajun, Du, Yuanqi, Vargas, Francisco, Zhang, Dinghuai, Padhy, Shreyas, OuYang, RuiKang, Gomes, Carla, Hernández-Lobato, José Miguel

arXiv.org Machine LearningFeb-10-2025

We consider the sampling problem, where the aim is to draw samples from a distribution whose density is known only up to a normalization constant. Recent breakthroughs in generative modeling to approximate a high-dimensional data distribution have sparked significant interest in developing neural network-based methods for this challenging problem. However, neural samplers typically incur heavy computational overhead due to simulating trajectories during training. This motivates the pursuit of simulation-free training procedures of neural samplers. In this work, we propose an elegant modification to previous methods, which allows simulation-free training with the help of a time-dependent normalizing flow. However, it ultimately suffers from severe mode collapse. On closer inspection, we find that nearly all successful neural samplers rely on Langevin preconditioning to avoid mode collapsing. We systematically analyze several popular methods with various objective functions and demonstrate that, in the absence of Langevin preconditioning, most of them fail to adequately cover even a simple target. Finally, we draw attention to a strong baseline by combining the state-of-the-art MCMC method, Parallel Tempering (PT), with an additional generative model to shed light on future explorations of neural samplers.

artificial intelligence, machine learning, sampler, (15 more...)

arXiv.org Machine Learning

2502.06685

Country:

North America > United States (0.28)
Europe > United Kingdom > England (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Government (0.46)
Food & Agriculture (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback

Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

Mi, Jinyi, Shi, Xiaohan, Ma, Ding, He, Jiajun, Fujimura, Takuya, Toda, Tomoki

arXiv.org Artificial IntelligenceDec-17-2024

Developing a robust speech emotion recognition (SER) system in noisy conditions faces challenges posed by different noise properties. Most previous studies have not considered the impact of human speech noise, thus limiting the application scope of SER. In this paper, we propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and SER. We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training. Additionally, we explore a joint training of TSE and SER models in the second stage. Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method, demonstrating the effectiveness of our framework in mitigating the impact of human speech noise. Moreover, we conduct experiments considering speaker gender, showing that our framework performs particularly well in different-gender mixture.

artificial intelligence, machine learning, ser model, (14 more...)

arXiv.org Artificial Intelligence

2409.19585

Country: Asia (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.73)

Add feedback

Hybrid Data-Driven SSM for Interpretable and Label-Free mmWave Channel Prediction

Sun, Yiyong, He, Jiajun, Lin, Zhidi, Pu, Wenqiang, Yin, Feng, So, Hing Cheung

arXiv.org Artificial IntelligenceNov-18-2024

Accurate prediction of mmWave time-varying channels is essential for mitigating the issue of channel aging in complex scenarios owing to high user mobility. Existing channel prediction methods have limitations: classical model-based methods often struggle to track highly nonlinear channel dynamics due to limited expert knowledge, while emerging data-driven methods typically require substantial labeled data for effective training and often lack interpretability. To address these issues, this paper proposes a novel hybrid method that integrates a data-driven neural network into a conventional model-based workflow based on a state-space model (SSM), implicitly tracking complex channel dynamics from data without requiring precise expert knowledge. Additionally, a novel unsupervised learning strategy is developed to train the embedded neural network solely with unlabeled data. Theoretical analyses and ablation studies are conducted to interpret the enhanced benefits gained from the hybrid integration. Numerical simulations based on the 3GPP mmWave channel model corroborate the superior prediction accuracy of the proposed method, compared to state-of-the-art methods that are either purely model-based or data-driven. Furthermore, extensive experiments validate its robustness against various challenging factors, including among others severe channel variations and high noise levels.

artificial intelligence, label-free mmwave channel prediction, machine learning, (1 more...)

arXiv.org Artificial Intelligence

2411.11576

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.44)

Add feedback

Training Neural Samplers with Reverse Diffusive KL Divergence

He, Jiajun, Chen, Wenlin, Zhang, Mingtian, Barber, David, Hernández-Lobato, José Miguel

arXiv.org Machine LearningOct-16-2024

Training generative models to sample from unnormalized density functions is an important and challenging task in machine learning. Traditional training methods often rely on the reverse Kullback-Leibler (KL) divergence due to its tractability. However, the mode-seeking behavior of reverse KL hinders effective approximation of multi-modal target distributions. To address this, we propose to minimize the reverse KL along diffusion trajectories of both model and target densities. We refer to this objective as the reverse diffusive KL divergence, which allows the model to capture multiple modes. Leveraging this objective, we train neural samplers that can efficiently generate samples from the target distribution in one step. We demonstrate that our method enhances sampling performance across various Boltzmann distributions, including both synthetic multi-modal densities and n-body particle systems.

artificial intelligence, machine learning, sampler, (15 more...)

arXiv.org Machine Learning

2410.12456

Country:

North America (0.14)
Asia (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Getting Free Bits Back from Rotational Symmetries in LLMs

He, Jiajun, Flamich, Gergely, Hernández-Lobato, José Miguel

arXiv.org Artificial IntelligenceOct-2-2024

Current methods for compressing neural network weights, such as decomposition, pruning, quantization, and channel simulation, often overlook the inherent symmetries within these networks and thus waste bits on encoding redundant information. In this paper, we propose a format based on bits-back coding for storing rotationally symmetric Transformer weights more efficiently than the usual array layout at the same floating-point precision. We evaluate our method on Large Language Models (LLMs) pruned by SliceGPT (Ashkboos et al., 2024) and achieve a 3-5% reduction in total bit usage for free across different model sizes and architectures without impacting model performance within a certain numerical precision. Modern neural networks, particularly Large Language Models (LLMs), typically contain billions of parameters. Therefore, encoding and transmitting these models efficiently is gaining widespread interest. However, these techniques ignore the fact that neural networks typically exhibit symmetries in their weight space. For example, in feedforward networks, applying a random permutation to the neurons in one layer and its inverse to the weights in the subsequent layer leaves the output unchanged. Encoding weights without accounting for these symmetries will lead to suboptimal codelength.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.01309

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Accelerating Relative Entropy Coding with Space Partitioning

He, Jiajun, Flamich, Gergely, Hernández-Lobato, José Miguel

arXiv.org Artificial IntelligenceMay-24-2024

Relative entropy coding (REC) algorithms encode a random sample following a target distribution $Q$, using a coding distribution $P$ shared between the sender and receiver. Sadly, general REC algorithms suffer from prohibitive encoding times, at least on the order of $2^{D_{\text{KL}}[Q||P]}$, and faster algorithms are limited to very specific settings. This work addresses this issue by introducing a REC scheme utilizing space partitioning to reduce runtime in practical scenarios. We provide theoretical analyses of our method and demonstrate its effectiveness with both toy examples and practical applications. Notably, our method successfully handles REC tasks with $D_{\text{KL}}[Q||P]$ about three times greater than what previous methods can manage, and reduces the bitrate by approximately 5-15% in VAE-based lossless compression on MNIST and INR-based lossy compression on CIFAR-10, compared to previous methods, significantly improving the practicality of REC for neural compression.

artificial intelligence, log 2, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2405.12203

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Bidirectional Consistency Models

Li, Liangchen, He, Jiajun

arXiv.org Artificial IntelligenceMar-30-2024

Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector, a process that corresponds to moving along the probability flow ordinary differential equation (PF ODE). Interestingly, DMs can also invert an input image to noise by moving backward along the PF ODE, a key operation for downstream tasks such as interpolation and image editing. However, the iterative nature of this process restricts its speed, hindering its broader application. Recently, Consistency Models (CMs) have emerged to address this challenge by approximating the integral of the PF ODE, largely reducing the number of iterations. Yet, the absence of an explicit ODE solver complicates the inversion process. To resolve this, we introduce the Bidirectional Consistency Model (BCM), which learns a single neural network that enables both forward and backward traversal along the PF ODE, efficiently unifying generation and inversion tasks within one framework. Notably, our proposed method enables one-step generation and inversion while also allowing the use of additional steps to enhance generation quality or reduce reconstruction error. Furthermore, by leveraging our model's bidirectional consistency, we introduce a sampling strategy that can enhance FID while preserving the generated image content. We further showcase our model's capabilities in several downstream tasks, such as interpolation and inpainting, and present demonstrations of potential applications, including blind restoration of compressed images and defending black-box adversarial attacks.

artificial intelligence, machine learning, noise, (18 more...)

arXiv.org Artificial Intelligence

2403.18035

Country: Europe > Netherlands (0.14)

Genre: Research Report (0.82)

Industry: Transportation > Air (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, ASR Error Detection, and ASR Error Correction

He, Jiajun, Shi, Xiaohan, Li, Xingfeng, Toda, Tomoki

arXiv.org Artificial IntelligenceJan-24-2024

The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxiliary ASR error detection task to adaptively assign weights of each word in ASR hypotheses. However, this approach has limited improvement potential because it does not address the coherence of semantic information in the text. Additionally, the inherent heterogeneity of different modalities leads to distribution gaps between their representations, making their fusion challenging. Therefore, in this paper, we incorporate two auxiliary tasks, ASR error detection (AED) and ASR error correction (AEC), to enhance the semantic coherence of ASR text, and further introduce a novel multi-modal fusion (MF) method to learn shared representations across modalities. We refer to our method as MF-AED-AEC. Experimental results indicate that MF-AED-AEC significantly outperforms the baseline model by a margin of 4.1\%.

data quality, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2401.1326

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.88)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.73)
(2 more...)

Add feedback

On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition

Shi, Xiaohan, He, Jiajun, Li, Xingfeng, Toda, Tomoki

arXiv.org Artificial IntelligenceNov-14-2023

Typically, three common approaches are used to address the issue of noisy This paper proposes an efficient attempt to noisy speech emotion speech emotion recognition (NSER): the signal level, the feature recognition (NSER). Conventional NSER approaches level, and the model level, as outlined by Tiwari et al have proven effective in mitigating the impact of artificial [2]. For instance, Pandharipande et al. [3] used a voice activity noise sources, such as white Gaussian noise, but are limited detector to reduce noise at the signal level. Lachiri et to non-stationary noises in real-world environments due to al. [4] introduced a novel approach involving MFCC-shifteddelta-cepstral their complexity and uncertainty. To overcome this limitation, coefficients at the feature level. Tiwari et al. [2] we introduce a new method for NSER by adopting the devised a generative noise model at the model level. The previously automatic speech recognition (ASR) model as a noise-robust mentioned studies have proven effective in mitigating feature extractor to eliminate non-vocal information in noisy the impact of common noise sources like white Gaussian speech. We first obtain intermediate layer information from noise on speech-related tasks. However, in real-world settings, the ASR model as a feature representation for emotional a distinct category of noise sounds, such as high-heeled speech and then apply this representation for the downstream shoes and door knocking, presents a substantial challenge.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2311.07093

Genre: Research Report > New Finding (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.95)

Add feedback