He, Ting
Qwen2.5-Omni Technical Report
Xu, Jin, Guo, Zhifang, He, Jinzheng, Hu, Hangrui, He, Ting, Bai, Shuai, Chen, Keqin, Wang, Jialin, Fan, Yang, Dang, Kai, Zhang, Bin, Wang, Xiong, Chu, Yunfei, Lin, Junyang
In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.
AI-assisted Knowledge Discovery in Biomedical Literature to Support Decision-making in Precision Oncology
He, Ting, Kreimeyer, Kory, Najjar, Mimi, Spiker, Jonathan, Fatteh, Maria, Anagnostou, Valsamo, Botsis, Taxiarchis
The delivery of appropriate targeted therapies to cancer patients requires the complete analysis of the molecular profiling of tumors and the patient's clinical characteristics in the context of existing knowledge and recent findings described in biomedical literature and several other sources. We evaluated the potential contributions of specific natural language processing solutions to support knowledge discovery from biomedical literature. Two models from the Bidirectional Encoder Representations from Transformers (BERT) family, two Large Language Models, and PubTator 3.0 were tested for their ability to support the named entity recognition (NER) and the relation extraction (RE) tasks. PubTator 3.0 and the BioBERT model performed best in the NER task (best F1-score equal to 0.93 and 0.89, respectively), while BioBERT outperformed all other solutions in the RE task (best F1-score 0.79) and a specific use case it was applied to by recognizing nearly all entity mentions and most of the relations.
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
An, Keyu, Chen, Qian, Deng, Chong, Du, Zhihao, Gao, Changfeng, Gao, Zhifu, Gu, Yue, He, Ting, Hu, Hangrui, Hu, Kai, Ji, Shengpeng, Li, Yabin, Li, Zerui, Lu, Heng, Luo, Haoneng, Lv, Xiang, Ma, Bin, Ma, Ziyang, Ni, Chongjia, Song, Changhe, Shi, Jiaqi, Shi, Xian, Wang, Hao, Wang, Wen, Wang, Yuxuan, Xiao, Zhangyu, Yan, Zhijie, Yang, Yexin, Zhang, Bin, Zhang, Qinglin, Zhang, Shiliang, Zhao, Nan, Zheng, Siqi
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.
Energy-efficient Decentralized Learning via Graph Sparsification
Zhang, Xusheng, Chiu, Cho-Chun, He, Ting
This work aims at improving the energy efficiency of decentralized learning by optimizing the mixing matrix, which controls the communication demands during the learning process. Through rigorous analysis based on a state-of-the-art decentralized learning algorithm, the problem is formulated as a bi-level optimization, with the lower level solved by graph sparsification. A solution with guaranteed performance is proposed for the special case of fully-connected base topology and a greedy heuristic is proposed for the general case. Simulations based on real topology and dataset show that the proposed solution can lower the energy consumption at the busiest node by 54%-76% while maintaining the quality of the trained model.
Sharing Models or Coresets: A Study based on Membership Inference Attack
Lu, Hanlin, Liu, Changchang, He, Ting, Wang, Shiqiang, Chan, Kevin S.
While each approach preserves data them is still missing. In this work, we take a first step privacy to some extent thanks to not sharing the towards filling this gap by comparing the federated learning raw data, the exact extent of protection is unclear approach and the coreset-based approach in terms of (1) under sophisticated attacks that try to infer the the accuracy of the target machine learning model we want raw data from the shared information. We present to train, (2) the communication cost during training, and the first comparison between the two approaches (3) the leakage of the private training data. In particular, in terms of target model accuracy, communication although neither approach will require the data sources to cost, and data privacy, where the last is measured directly share their data, it has been shown in (Shokri et al., by the accuracy of a state-of-the-art attack strategy 2017) that models derived from a dataset can be used to called the membership inference attack. Our infer the membership of the dataset (i.e., whether or not a experiments quantify the accuracy-privacy-cost given data record is contained in the dataset), known as the tradeoff of each approach, and reveal a nontrivial membership inference attack (MIA). Since the coreset can comparison that can be used to guide the design also be viewed as a model, we can thus use the accuracy of of model training processes.
Online Learning of Facility Locations
Pasteris, Stephen, He, Ting, Vitale, Fabio, Wang, Shiqiang, Herbster, Mark
In this paper we consider an online learning version of the Facility location problem where users need to be served one at a time in a sequence of trials. The goal is to select, at each trial, a subset of a given set of sites, and then pay a loss equal to their total "opening cost" plus the minimum "connection cost" for connecting the user to one of the sites in the subset. More precisely, we are given a set of N sites. At the beginning of each trial, an opening cost and a connection cost for the arriving user are associated with each site and are unknown. At each trial, the learner has to select a subset of sites and incurs a loss given by the minimum connection cost over the selected sites plus the sum of the opening costs of all selected sites. After each subset selection, the opening and connection costs of all sites are revealed. To solve this problem, we design and rigorously analyse an algorithm which belongs to the class of online learning algorithms that make use of the Exponentiated gradient method [15]. We measure, and rigorously analyse, the performance of our method by comparing its cumulative loss with that of any fixed subset of sites.
Robust Coreset Construction for Distributed Machine Learning
Lu, Hanlin, Li, Ming-Ju, He, Ting, Wang, Shiqiang, Narayanan, Vijay, Chan, Kevin S
Motivated by the need of solving machine learning problems over distributed datasets, we explore the use of coreset to reduce the communication overhead. Coreset is a summary of the original dataset in the form of a small weighted set in the same sample space. Compared to other data summaries, coreset has the advantage that it can be used as a proxy of the original dataset, potentially for different applications. However, existing coreset construction algorithms are each tailor-made for a specific machine learning problem. Thus, to solve different machine learning problems, one has to collect coresets of different types, defeating the purpose of saving communication overhead. We resolve this dilemma by developing coreset construction algorithms based on k-means/median clustering, that give a provably good approximation for a broad range of machine learning problems with sufficiently continuous cost functions. Through evaluations on diverse datasets and machine learning problems, we verify the robust performance of the proposed algorithms.
When Edge Meets Learning: Adaptive Control for Resource-Constrained Distributed Machine Learning
Wang, Shiqiang, Tuor, Tiffany, Salonidis, Theodoros, Leung, Kin K., Makaya, Christian, He, Ting, Chan, Kevin
Emerging technologies and applications including Internet of Things (IoT), social networking, and crowd-sourcing generate large amounts of data at the network edge. Machine learning models are often built from the collected data, to enable the detection, classification, and prediction of future events. Due to bandwidth, storage, and privacy concerns, it is often impractical to send all the data to a centralized location. In this paper, we consider the problem of learning model parameters from data distributed across multiple edge nodes, without sending raw data to a centralized place. Our focus is on a generic class of machine learning models that are trained using gradient-descent based approaches. We analyze the convergence rate of distributed gradient descent from a theoretical point of view, based on which we propose a control algorithm that determines the best trade-off between local update and global parameter aggregation to minimize the loss function under a given resource budget. The performance of the proposed algorithm is evaluated via extensive experiments with real datasets, both on a networked prototype system and in a larger-scale simulated environment. The experimentation results show that our proposed approach performs near to the optimum with various machine learning models and different data distributions.