Wang, Hao
Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability
Li, Haonan, Han, Xudong, Zhai, Zenan, Mu, Honglin, Wang, Hao, Zhang, Zhenxuan, Geng, Yilin, Lin, Shom, Wang, Renxi, Shelmanov, Artem, Qi, Xiangyu, Wang, Yuxia, Hong, Donghai, Yuan, Youliang, Chen, Meng, Tu, Haoqin, Koto, Fajri, Kuribayashi, Tatsuki, Zeng, Cong, Bhardwaj, Rishabh, Zhao, Bingchen, Duan, Yawen, Liu, Yi, Alghamdi, Emad A., Yang, Yaodong, Dong, Yinpeng, Poria, Soujanya, Liu, Pengfei, Liu, Zhengzhong, Ren, Xuguang, Hovy, Eduard, Gurevych, Iryna, Nakov, Preslav, Choudhury, Monojit, Baldwin, Timothy
To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
Understanding training dynamics and feature evolution is crucial for the mechanistic interpretability of large language models (LLMs). Although sparse autoencoders (SAEs) have been used to identify features within LLMs, a clear picture of how these features evolve during training remains elusive. In this study, we: (1) introduce SAE-Track, a method to efficiently obtain a continual series of SAEs; (2) formulate the process of feature formation and conduct a mechanistic analysis; and (3) analyze and visualize feature drift during training. Our work provides new insights into the dynamics of features in LLMs, enhancing our understanding of training mechanisms and feature evolution.
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs
Pareja, Aldo, Nayak, Nikhil Shivakumar, Wang, Hao, Killamsetty, Krishnateja, Sudalairaj, Shivchander, Zhao, Wenlong, Han, Seungwook, Bhandwaldar, Abhishek, Xu, Guangxuan, Xu, Kai, Han, Ligong, Inglis, Luke, Srivastava, Akash
The rise of large language models (LLMs) has created a significant disparity: industrial research labs with their computational resources, expert teams, and advanced infrastructures, can effectively fine-tune LLMs, while individual developers and small organizations face barriers due to limited resources. In this paper, we aim to bridge this gap by presenting a comprehensive study on supervised fine-tuning of LLMs using instruction-tuning datasets spanning diverse knowledge domains and skills. We focus on small-sized LLMs (3B to 7B parameters) for their cost-efficiency and accessibility. We explore various training configurations and strategies across four open-source pre-trained models. We provide detailed documentation of these configurations, revealing findings that challenge several common training practices, including hyperparameter recommendations from TULU and phased training recommended by Orca. Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance, enabling early termination of sub-optimal runs and significant computational savings; (iii) through a thorough exploration of hyperparameters like warmup steps and learning rate schedules, we provide guidance for practitioners and find that certain simplifications do not compromise performance; and (iv) we observed no significant difference in performance between phased and stacked training strategies, but stacked training is simpler and more sample efficient. With these findings holding robustly across datasets and models, we hope this study serves as a guide for practitioners fine-tuning small LLMs and promotes a more inclusive environment for LLM research.
Predictive Models in Sequential Recommendations: Bridging Performance Laws with Data Quality Insights
Shen, Tingjia, Wang, Hao, Wu, Chuhan, Chin, Jin Yao, Guo, Wei, Liu, Yong, Guo, Huifeng, Lian, Defu, Tang, Ruiming, Chen, Enhong
Sequential Recommendation (SR) plays a critical role in predicting users' sequential preferences. Despite its growing prominence in various industries, the increasing scale of SR models incurs substantial computational costs and unpredictability, challenging developers to manage resources efficiently. Under this predicament, Scaling Laws have achieved significant success by examining the loss as models scale up. However, there remains a disparity between loss and model performance, which is of greater concern in practical applications. Moreover, as data continues to expand, it incorporates repetitive and inefficient data. In response, we introduce the Performance Law for SR models, which aims to theoretically investigate and model the relationship between model performance and data quality. Specifically, we first fit the HR and NDCG metrics to transformer-based SR models. Subsequently, we propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics. Our method enables accurate predictions across various dataset scales and model sizes, demonstrating a strong correlation in large SR models and offering insights into achieving optimal performance for any given model configuration.
Improving Automatic Fetal Biometry Measurement with Swoosh Activation Function
Zhou, Shijia, Ahn, Euijoon, Wang, Hao, Quinton, Ann, Kennedy, Narelle, Sridar, Pradeeba, Nanan, Ralph, Kim, Jinman
The measurement of fetal thalamus diameter (FTD) and fetal head circumference (FHC) are crucial in identifying abnormal fetal thalamus development as it may lead to certain neuropsychiatric disorders in later life. However, manual measurements from 2D-US images are laborious, prone to high inter-observer variability, and complicated by the high signal-to-noise ratio nature of the images. Deep learning-based landmark detection approaches have shown promise in measuring biometrics from US images, but the current state-of-the-art (SOTA) algorithm, BiometryNet, is inadequate for FTD and FHC measurement due to its inability to account for the fuzzy edges of these structures and the complex shape of the FTD structure. To address these inadequacies, we propose a novel Swoosh Activation Function (SAF) designed to enhance the regularization of heatmaps produced by landmark detection algorithms. Our SAF serves as a regularization term to enforce an optimum mean squared error (MSE) level between predicted heatmaps, reducing the dispersiveness of hotspots in predicted heatmaps. Our experimental results demonstrate that SAF significantly improves the measurement performances of FTD and FHC with higher intraclass correlation coefficient scores in FTD and lower mean difference scores in FHC measurement than those of the current SOTA algorithm BiometryNet. Moreover, our proposed SAF is highly generalizable and architecture-agnostic. The SAF's coefficients can be configured for different tasks, making it highly customizable. Our study demonstrates that the SAF activation function is a novel method that can improve measurement accuracy in fetal biometry landmark detection. This improvement has the potential to contribute to better fetal monitoring and improved neonatal outcomes.
An Entailment Tree Generation Approach for Multimodal Multi-Hop Question Answering with Mixture-of-Experts and Iterative Feedback Mechanism
Zhang, Qing, Lv, Haocheng, Liu, Jie, Chen, Zhiyun, Duan, Jianyong, Wang, Hao, He, Li, Xv, Mingying
With the rise of large-scale language models (LLMs), it is currently popular and effective to convert multimodal information into text descriptions for multimodal multi-hop question answering. However, we argue that the current methods of multi-modal multi-hop question answering still mainly face two challenges: 1) The retrieved evidence containing a large amount of redundant information, inevitably leads to a significant drop in performance due to irrelevant information misleading the prediction. 2) The reasoning process without interpretable reasoning steps makes the model difficult to discover the logical errors for handling complex questions. To solve these problems, we propose a unified LLMs-based approach but without heavily relying on them due to the LLM's potential errors, and innovatively treat multimodal multi-hop question answering as a joint entailment tree generation and question answering problem. Specifically, we design a multi-task learning framework with a focus on facilitating common knowledge sharing across interpretability and prediction tasks while preventing task-specific errors from interfering with each other via mixture of experts. Afterward, we design an iterative feedback mechanism to further enhance both tasks by feeding back the results of the joint training to the LLM for regenerating entailment trees, aiming to iteratively refine the potential answer. Notably, our method has won the first place in the official leaderboard of WebQA (since April 10, 2024), and achieves competitive results on MultimodalQA.
Training-Free Bayesianization for Low-Rank Adapters of Large Language Models
Shi, Haizhou, Wang, Yibin, Han, Ligong, Zhang, Huan, Wang, Hao
Estimating the uncertainty of responses of Large Language Models~(LLMs) remains a critical challenge. While recent Bayesian methods have demonstrated effectiveness in quantifying uncertainty through low-rank weight updates, they typically require complex fine-tuning or post-training procedures. In this paper, we propose Training-Free Bayesianization~(TFB), a novel framework that transforms existing off-the-shelf trained LoRA adapters into Bayesian ones without additional training. TFB systematically searches for the maximally acceptable level of variance in the weight posterior, constrained within a family of low-rank isotropic Gaussian distributions. We theoretically demonstrate that under mild conditions, this search process is equivalent to variational inference for the weights. Through comprehensive experiments, we show that TFB achieves superior uncertainty estimation and generalization compared to existing methods while eliminating the need for complex training procedures. Code will be available at https://github.com/Wang-ML-Lab/bayesian-peft.
Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Ming, Lingfeng, Zeng, Bo, Lyu, Chenyang, Shi, Tianqi, Zhao, Yu, Yang, Xue, Liu, Yefeng, Wang, Yiyu, Xu, Linlong, Liu, Yangyang, Zhao, Xiaohu, Wang, Hao, Liu, Heng, Zhou, Hao, Yin, Huifeng, Shang, Zifu, Li, Haijun, Wang, Longyue, Luo, Weihua, Zhang, Kaifu
Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.
Geographical Information Alignment Boosts Traffic Analysis via Transpose Cross-attention
Jiang, Xiangyu, Chen, Xiwen, Wang, Hao, Razi, Abolfazl
Traffic accident prediction is crucial for enhancing road safety and mitigating congestion, and recent Graph Neural Networks (GNNs) have shown promise in modeling the inherent graph-based traffic data. However, existing GNN- based approaches often overlook or do not explicitly exploit geographic position information, which often plays a critical role in understanding spatial dependencies. This is also aligned with our observation, where accident locations are often highly relevant. To address this issue, we propose a plug-in-and-play module for common GNN frameworks, termed Geographic Information Alignment (GIA). This module can efficiently fuse the node feature and geographic position information through a novel Transpose Cross-attention mechanism. Due to the large number of nodes for traffic data, the conventional cross-attention mechanism performing the node-wise alignment may be infeasible in computation-limited resources. Instead, we take the transpose operation for Query, Key, and Value in the Cross-attention mechanism, which substantially reduces the computation cost while maintaining sufficient information. Experimental results for both traffic occurrence prediction and severity prediction (severity levels based on the interval of recorded crash counts) on large-scale city-wise datasets confirm the effectiveness of our proposed method. For example, our method can obtain gains ranging from 1.3% to 10.9% in F1 score and 0.3% to 4.8% in AUC.
Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations
Feng, Yu, Zhang, Shunsi, Shu, Jian, Zhao, Hanfeng, Pang, Guoliang, Zhang, Chi, Wang, Hao
Generating multi-view human images from a single view is a complex and significant challenge. Although recent advancements in multi-view object generation have shown impressive results with diffusion models, novel view synthesis for humans remains constrained by the limited availability of 3D human datasets. Consequently, many existing models struggle to produce realistic human body shapes or capture fine-grained facial details accurately. To address these issues, we propose an innovative framework that leverages transferred body and facial representations for multi-view human synthesis. Specifically, we use a single-view model pretrained on a large-scale human dataset to develop a multi-view body representation, aiming to extend the 2D knowledge of the single-view model to a multi-view diffusion model. Additionally, to enhance the model's detail restoration capability, we integrate transferred multimodal facial features into our trained human diffusion model. Experimental evaluations on benchmark datasets demonstrate that our approach outperforms the current state-of-the-art methods, achieving superior performance in multi-view human synthesis.