Ren, Yiming
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
Zhang, Zefeng, Tang, Hengzhu, Sheng, Jiawei, Zhang, Zhenyu, Ren, Yiming, Li, Zhenyang, Yin, Dawei, Ma, Duohe, Liu, Tingwen
Multimodal Large Language Models excel in various tasks, yet often struggle with modality bias, where the model tends to rely heavily on a single modality and overlook critical information in other modalities, which leads to incorrect focus and generating irrelevant responses. In this paper, we propose using the paradigm of preference optimization to solve the modality bias problem, including RLAIFVBias, a debiased preference optimization dataset, and a Noise Aware Preference Optimization algorithm. Specifically, we first construct the dataset by introducing perturbations to reduce the informational content of certain modalities, compelling the model to rely on a specific modality when generating negative responses. To address the inevitable noise in automatically constructed data, we combine the noise robust Mean Absolute Error with the Binary Cross Entropy in Direct Preference Optimization by a negative Box Cox transformation, and dynamically adjust the algorithm noise robustness based on the evaluated noise levels in the data. Extensive experiments validate our approach, demonstrating not only its effectiveness in mitigating modality bias but also its significant role in minimizing hallucinations.
SymbioSim: Human-in-the-loop Simulation Platform for Bidirectional Continuing Learning in Human-Robot Interaction
Chen, Haoran, Xu, Yiteng, Ren, Yiming, Ye, Yaoqin, Li, Xinran, Ding, Ning, Cong, Peishan, Wang, Ziyi, Liu, Bushi, Chen, Yuhan, Dou, Zhiyang, Leng, Xiaokun, Li, Manyi, Ma, Yuexin, Tu, Changhe
The development of intelligent robots seeks to seamlessly integrate them into the human world, providing assistance and companionship in daily life and work, with the ultimate goal of achieving human-robot symbiosis. To realize this vision, robots must continuously learn and evolve through consistent interaction and collaboration with humans, while humans need to gradually develop an understanding of and trust in robots through shared experiences. However, training and testing algorithms directly on physical robots involve substantial costs and safety risks. Moreover, current robotic simulators fail to support real human participation, limiting their ability to provide authentic interaction experiences and gather valuable human feedback. In this paper, we introduce SymbioSim, a novel human-in-the-loop robotic simulation platform designed to enable the safe and efficient development, evaluation, and optimization of human-robot interactions. By leveraging a carefully designed system architecture and modules, SymbioSim delivers a natural and realistic interaction experience, facilitating bidirectional continuous learning and adaptation for both humans and robots. Extensive experiments and user studies demonstrate the platform's promising performance and highlight its potential to significantly advance research on human-robot symbiosis.
NPC: Neural Predictive Control for Fuel-Efficient Autonomous Trucks
Ren, Jiaping, Xiang, Jiahao, Gao, Hongfei, Zhang, Jinchuan, Ren, Yiming, Ma, Yuexin, Wu, Yi, Yang, Ruigang, Li, Wei
Fuel efficiency is a crucial aspect of long-distance cargo transportation by oil-powered trucks that economize on costs and decrease carbon emissions. Current predictive control methods depend on an accurate model of vehicle dynamics and engine, including weight, drag coefficient, and the Brake-specific Fuel Consumption (BSFC) map of the engine. We propose a pure data-driven method, Neural Predictive Control (NPC), which does not use any physical model for the vehicle. After training with over 20,000 km of historical data, the novel proposed NVFormer implicitly models the relationship between vehicle dynamics, road slope, fuel consumption, and control commands using the attention mechanism. Based on the online sampled primitives from the past of the current freight trip and anchor-based future data synthesis, the NVFormer can infer optimal control command for reasonable fuel consumption. The physical model-free NPC outperforms the base PCC method with 2.41% and 3.45% more significant fuel saving in simulation and open-road highway testing, respectively.
Evaluation of OpenAI o1: Opportunities and Challenges of AGI
Zhong, Tianyang, Liu, Zhengliang, Pan, Yi, Zhang, Yutong, Zhou, Yifan, Liang, Shizhe, Wu, Zihao, Lyu, Yanjun, Shu, Peng, Yu, Xiaowei, Cao, Chao, Jiang, Hanqi, Chen, Hanxu, Li, Yiwei, Chen, Junhao, Hu, Huawen, Liu, Yihen, Zhao, Huaqin, Xu, Shaochen, Dai, Haixing, Zhao, Lin, Zhang, Ruidong, Zhao, Wei, Yang, Zhenyuan, Chen, Jingyuan, Wang, Peilong, Ruan, Wei, Wang, Hui, Zhao, Huan, Zhang, Jing, Ren, Yiming, Qin, Shihuan, Chen, Tong, Li, Jiaxi, Zidan, Arif Hassan, Jahin, Afrar, Chen, Minheng, Xia, Sichen, Holmes, Jason, Zhuang, Yan, Wang, Jiaqi, Xu, Bochen, Xia, Weiran, Yu, Jichao, Tang, Kaibo, Yang, Yaxuan, Sun, Bolun, Yang, Tao, Lu, Guoyu, Wang, Xianqiao, Chai, Lilong, Li, He, Lu, Jin, Sun, Lichao, Zhang, Xin, Ge, Bao, Hu, Xintao, Zhang, Lian, Zhou, Hua, Zhang, Lu, Zhang, Shu, Liu, Ninghao, Jiang, Bei, Kong, Linglong, Xiang, Zhen, Ren, Yudan, Liu, Jun, Jiang, Xi, Bao, Yu, Zhang, Wei, Li, Xiang, Li, Gang, Liu, Wei, Shen, Dinggang, Sikora, Andrea, Zhai, Xiaoming, Zhu, Dajiang, Liu, Tianming
This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.
Needle In A Multimodal Haystack
Wang, Weiyun, Zhang, Shuibo, Ren, Yiming, Duan, Yuchen, Li, Tiantong, Liu, Shuo, Hu, Mengkang, Chen, Zhe, Zhang, Kaipeng, Lu, Lewei, Zhu, Xizhou, Luo, Ping, Qiao, Yu, Dai, Jifeng, Shao, Wenqi, Wang, Wenhai
With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplored. In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. In each task, the model is required to answer the questions according to different key information scattered throughout the given multimodal document. Evaluating the leading MLLMs on MM-NIAH, we observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation. We hope this work can provide a platform for further research on long multimodal document comprehension and contribute to the advancement of MLLMs.
GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text
Liu, Pengfei, Ren, Yiming, Tao, Jun, Ren, Zhixiang
Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.
3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information
Kuang, Taojie, Ren, Yiming, Ren, Zhixiang
Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous representations where a single one might represent multiple distinct molecules. Moreover, existing molecular modeling methods focus predominantly on the most stable 3D conformations, neglecting other viable conformations present in reality. To address these issues, we propose 3D-Mol, a novel approach designed for more accurate spatial structure representation. It deconstructs molecules into three hierarchical graphs to better extract geometric information. Additionally, 3D-Mol leverages contrastive learning for pretraining on 20 million unlabeled data, treating their conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, based on the similarity of their 3D conformation descriptors and fingerprints. We compare 3D-Mol with various state-of-the-art baselines on 7 benchmarks and demonstrate our outstanding performance.
Predicting Football Match Outcomes with eXplainable Machine Learning and the Kelly Index
Ren, Yiming, Susnjak, Teo
In this work, a machine learning approach is developed for predicting the outcomes of football matches. The novelty of this research lies in the utilisation of the Kelly Index to first classify matches into categories where each one denotes the different levels of predictive difficulty. Classification models using a wide suite of algorithms were developed for each category of matches in order to determine the efficacy of the approach. In conjunction to this, a set of previously unexplored features were engineering including Elo-based variables. The dataset originated from the Premier League match data covering the 2019-2021 seasons. The findings indicate that the process of decomposing the predictive problem into sub-tasks was effective and produced competitive results with prior works, while the ensemble-based methods were the most effective. The paper also devised an investment strategy in order to evaluate its effectiveness by benchmarking against bookmaker odds. An approach was developed that minimises risk by combining the Kelly Index with the predefined confidence thresholds of the predictive models. The experiments found that the proposed strategy can return a profit when following a conservative approach that focuses primarily on easy-to-predict matches where the predictive models display a high confidence level.