Yu, Hao
CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning
Yu, Hao, Zhao, Zhuokai, Yan, Shen, Korycki, Lukasz, Wang, Jianyu, He, Baosheng, Liu, Jiayi, Zhang, Lizhu, Fan, Xiangjun, Yu, Hanchao
The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.
A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff
Yu, Hao, Ji, Xiangyang
We propose a first near complete (that will make explicit sense in the main text) nonasymptotic generalization theory for multilayer neural networks with arbitrary Lipschitz activations and general Lipschitz loss functions (with some very mild conditions). In particular, it doens't require the boundness of loss function, as commonly assumed in the literature. Our theory goes beyond the bias-variance tradeoff, aligned with phenomenon typically encountered in deep learning. It is therefore sharp different with other existing nonasymptotic generalization error bounds for neural networks. More explicitly, we propose an explicit generalization error upper bound for multilayer neural networks with arbitrary Lipschitz activations $\sigma$ with $\sigma(0)=0$ and broad enough Lipschitz loss functions, without requiring either the width, depth or other hyperparameters of the neural network approaching infinity, a specific neural network architect (e.g. sparsity, boundness of some norms), a particular activation function, a particular optimization algorithm or boundness of the loss function, and with taking the approximation error into consideration. General Lipschitz activation can also be accommodated into our framework. A feature of our theory is that it also considers approximation errors. Furthermore, we show the near minimax optimality of our theory for multilayer ReLU networks for regression problems. Notably, our upper bound exhibits the famous double descent phenomenon for such networks, which is the most distinguished characteristic compared with other existing results. This work emphasizes a view that many classical results should be improved to embrace the unintuitive characteristics of deep learning to get a better understanding of it.
LitLinker: Supporting the Ideation of Interdisciplinary Contexts with Large Language Models for Teaching Literature in Elementary Schools
Fan, Haoxiang, Zhou, Changshuang, Yu, Hao, Wu, Xueyang, Gu, Jiangyu, Peng, Zhenhui
Teaching literature under interdisciplinary contexts (e.g., science, art) that connect reading materials has become popular in elementary schools. However, constructing such contexts is challenging as it requires teachers to explore substantial amounts of interdisciplinary content and link it to the reading materials. In this paper, we develop LitLinker via an iterative design process involving 13 teachers to facilitate the ideation of interdisciplinary contexts for teaching literature. Powered by a large language model (LLM), LitLinker can recommend interdisciplinary topics and contextualize them with the literary elements (e.g., paragraphs, viewpoints) in the reading materials. A within-subjects study (N=16) shows that compared to an LLM chatbot, LitLinker can improve the integration depth of different subjects and reduce workload in this ideation task. Expert interviews (N=9) also demonstrate LitLinker's usefulness for supporting the ideation of interdisciplinary contexts for teaching literature. We conclude with concerns and design considerations for supporting interdisciplinary teaching with LLMs.
INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages
Yu, Hao, Alabi, Jesujoba O., Bukula, Andiswa, Zhuang, Jian Yun, Lee, En-Shiun Annie, Guge, Tadesse Kebede, Azime, Israel Abebe, Buzaaba, Happy, Sibanda, Blessing Kudzaishe, Kalipe, Godson K., Mukiibi, Jonathan, Kabenamualu, Salomon Kabongo, Setaka, Mmasibidi, Ndolela, Lolwethu, Odu, Nkiruka, Mabuya, Rooweither, Muhammad, Shamsuddeen Hassan, Osei, Salomey, Samb, Sokhar, Murage, Juliet W., Klakow, Dietrich, Adelani, David Ifeoluwa
Slot-filling and intent detection are well-established tasks in Conversational AI. However, current large-scale benchmarks for these tasks often exclude evaluations of low-resource languages and rely on translations from English benchmarks, thereby predominantly reflecting Western-centric concepts. In this paper, we introduce Injongo -- a multicultural, open-source benchmark dataset for 16 African languages with utterances generated by native speakers across diverse domains, including banking, travel, home, and dining. Through extensive experiments, we benchmark the fine-tuning multilingual transformer models and the prompting large language models (LLMs), and show the advantage of leveraging African-cultural utterances over Western-centric utterances for improving cross-lingual transfer from the English language. Experimental results reveal that current LLMs struggle with the slot-filling task, with GPT-4o achieving an average performance of 26 F1-score. In contrast, intent detection performance is notably better, with an average accuracy of 70.6%, though it still falls behind the fine-tuning baselines. Compared to the English language, GPT-4o and fine-tuning baselines perform similarly on intent detection, achieving an accuracy of approximately 81%. Our findings suggest that the performance of LLMs is still behind for many low-resource African languages, and more work is needed to further improve their downstream performance.
Ten Challenging Problems in Federated Foundation Models
Fan, Tao, Gu, Hanlin, Cao, Xuemei, Chan, Chee Seng, Chen, Qian, Chen, Yiqiang, Feng, Yihui, Gu, Yang, Geng, Jiaxiang, Luo, Bing, Liu, Shuoling, Ong, Win Kent, Ren, Chao, Shao, Jiaqi, Sun, Chuan, Tang, Xiaoli, Tae, Hong Xi, Tong, Yongxin, Wei, Shuyue, Wu, Fan, Xi, Wei, Xu, Mingcong, Yang, He, Yang, Xin, Yan, Jiangpeng, Yu, Hao, Yu, Han, Zhang, Teng, Zhang, Yifei, Zhang, Xiaojin, Zheng, Zhenzhe, Fan, Lixin, Yang, Qiang
Federated Foundation Models (FedFMs) represent a distributed learning paradigm that fuses general competences of foundation models as well as privacy-preserving capabilities of federated learning. This combination allows the large foundation models and the small local domain models at the remote clients to learn from each other in a teacher-student learning setting. This paper provides a comprehensive summary of the ten challenging problems inherent in FedFMs, encompassing foundational theory, utilization of private data, continual learning, unlearning, Non-IID and graph data, bidirectional knowledge transfer, incentive mechanism design, game mechanism design, model watermarking, and efficiency. The ten challenging problems manifest in five pivotal aspects: ``Foundational Theory," which aims to establish a coherent and unifying theoretical framework for FedFMs. ``Data," addressing the difficulties in leveraging domain-specific knowledge from private data while maintaining privacy; ``Heterogeneity," examining variations in data, model, and computational resources across clients; ``Security and Privacy," focusing on defenses against malicious attacks and model theft; and ``Efficiency," highlighting the need for improvements in training, communication, and parameter efficiency. For each problem, we offer a clear mathematical definition on the objective function, analyze existing methods, and discuss the key challenges and potential solutions. This in-depth exploration aims to advance the theoretical foundations of FedFMs, guide practical implementations, and inspire future research to overcome these obstacles, thereby enabling the robust, efficient, and privacy-preserving FedFMs in various real-world applications.
A New Perspective on Privacy Protection in Federated Learning with Granular-Ball Computing
Lai, Guannan, Feng, Yihui, Yang, Xin, Deng, Xiaoyu, Yu, Hao, Xia, Shuyin, Wang, Guoyin, Li, Tianrui
Federated Learning (FL) facilitates collaborative model training while prioritizing privacy by avoiding direct data sharing. However, most existing articles attempt to address challenges within the model's internal parameters and corresponding outputs, while neglecting to solve them at the input level. To address this gap, we propose a novel framework called Granular-Ball Federated Learning (GrBFL) for image classification. GrBFL diverges from traditional methods that rely on the finest-grained input data. Instead, it segments images into multiple regions with optimal coarse granularity, which are then reconstructed into a graph structure. We designed a two-dimensional binary search segmentation algorithm based on variance constraints for GrBFL, which effectively removes redundant information while preserving key representative features. Extensive theoretical analysis and experiments demonstrate that GrBFL not only safeguards privacy and enhances efficiency but also maintains robust utility, consistently outperforming other state-of-the-art FL methods. The code is available at https://github.com/AIGNLAI/GrBFL.
Addressing Spatial-Temporal Data Heterogeneity in Federated Continual Learning via Tail Anchor
Yu, Hao, Yang, Xin, Zhang, Le, Gu, Hanlin, Li, Tianrui, Fan, Lixin, Yang, Qiang
Federated continual learning (FCL) allows each client to continually update its knowledge from task streams, enhancing the applicability of federated learning in real-world scenarios. However, FCL needs to address not only spatial data heterogeneity between clients but also temporal data heterogeneity between tasks. In this paper, empirical experiments demonstrate that such input-level heterogeneity significantly affects the model's internal parameters and outputs, leading to severe spatial-temporal catastrophic forgetting of local and previous knowledge. To this end, we propose Federated Tail Anchor (FedTA) to mix trainable Tail Anchor with the frozen output features to adjust their position in the feature space, thereby overcoming parameter-forgetting and output-forgetting. Moreover, three novel components are also included in FedTA: Input Enhancement for improving the performance of pre-trained models on downstream tasks; Selective Input Knowledge Fusion for fusion of heterogeneous local knowledge on the server side; and Best Global Prototype Selection for finding the best anchor point for each class in the feature space. Extensive experiments demonstrate that FedTA not only outperforms existing FCL methods but also effectively preserves the relative positions of features, remaining unaffected by spatial and temporal changes.
SpasticMyoElbow: Physical Human-Robot Interaction Simulation Framework for Modelling Elbow Spasticity
Yu, Hao, Huang, Zebin, Li, Yutong, Guo, Xinliang, Crocher, Vincent, Carlucho, Ignacio, Erden, Mustafa Suphi
Robotic devices hold great potential for efficient and reliable assessment of neuromotor abnormalities in post-stroke patients. However, spasticity caused by stroke is still assessed manually in clinical settings. The limited and variable nature of data collected from patients has long posed a major barrier to quantitatively modelling spasticity with robotic measurements and fully validating robotic assessment techniques. This paper presents a simulation framework developed to support the design and validation of elbow spasticity models and mitigate data problems. The framework consists of a simulation environment of robot-assisted spasticity assessment, two motion controllers for the robot and human models, and a stretch reflex controller. Our framework allows simulation based on synthetic data without experimental data from human subjects. Using this framework, we replicated the constant-velocity stretch experiment typically used in robot-assisted spasticity assessment and evaluated four types of spasticity models. Our results show that a spasticity reflex model incorporating feedback on both muscle fibre velocity and length more accurately captures joint resistance characteristics during passive elbow stretching in spastic patients than a force-dependent model. When integrated with an appropriate spasticity model, this simulation framework has the potential to generate extensive datasets of virtual patients for future research on spasticity assessment.
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
Xu, Yifan, Liu, Xiao, Sun, Xueqiao, Cheng, Siyi, Yu, Hao, Lai, Hanyu, Zhang, Shudan, Zhang, Dan, Tang, Jie, Dong, Yuxiao
Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.
Graph-Augmented Relation Extraction Model with LLMs-Generated Support Document
Dong, Vicky, Yu, Hao, Chen, Yao
This study introduces a novel approach to sentence-level relation extraction (RE) that integrates Graph Neural Networks (GNNs) with Large Language Models (LLMs) to generate contextually enriched support documents. By harnessing the power of LLMs to generate auxiliary information, our approach crafts an intricate graph representation of textual data. This graph is subsequently processed through a Graph Neural Network (GNN) to refine and enrich the embeddings associated with each entity ensuring a more nuanced and interconnected understanding of the data. This methodology addresses the limitations of traditional sentence-level RE models by incorporating broader contexts and leveraging inter-entity interactions, thereby improving the model's ability to capture complex relationships across sentences. Our experiments, conducted on the CrossRE dataset, demonstrate the effectiveness of our approach, with notable improvements in performance across various domains. The results underscore the potential of combining GNNs with LLM-generated context to advance the field of relation extraction.