Goto

Collaborating Authors

 Ma, Liang


Unbiased Evaluation of Large Language Models from a Causal Perspective

arXiv.org Artificial Intelligence

Benchmark contamination has become a significant concern in the LLM evaluation community. Previous Agents-as-an-Evaluator address this issue by involving agents in the generation of questions. Despite their success, the biases in Agents-as-an-Evaluator methods remain largely unexplored. In this paper, we present a theoretical formulation of evaluation bias, providing valuable insights into designing unbiased evaluation protocols. Furthermore, we identify two type of bias in Agents-as-an-Evaluator through carefully designed probing tasks on a minimal Agents-as-an-Evaluator setup. To address these issues, we propose the Unbiased Evaluator, an evaluation protocol that delivers a more comprehensive, unbiased, and interpretable assessment of LLMs.Extensive experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only offers strong evidence of benchmark contamination but also provides interpretable evaluation results.


EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation

arXiv.org Artificial Intelligence

Multimodal large language models (MLLMs) have achieved remarkable progress on various visual question answering and reasoning tasks leveraging instruction fine-tuning specific datasets. They can also learn from preference data annotated by human to enhance their reasoning ability and mitigate hallucinations. Most of preference data is generated from the model itself. However, existing methods require high-quality critical labels, which are costly and rely on human or proprietary models like GPT-4V. In this work, we propose Enhancing Alignment in MLLMs via Critical Observation (EACO), which aligns MLLMs by self-generated preference data using only 5k images economically. Our approach begins with collecting and refining a Scoring Evaluation Instruction-tuning dataset to train a critical evaluation model, termed the Critic. This Critic observes model responses across multiple dimensions, selecting preferred and non-preferred outputs for refined Direct Preference Optimization (DPO) tuning. To further enhance model performance, we employ an additional supervised fine-tuning stage after preference tuning. EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement over LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also shows the potential critical ability in open-source MLLMs, demonstrating that EACO is a viable path to boost the competence of MLLMs.


RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

arXiv.org Artificial Intelligence

Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes $\sim$100K open-ended description-enriched trajectories with $\sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.


An Outline of Prognostics and Health Management Large Model: Concepts, Paradigms, and Challenges

arXiv.org Artificial Intelligence

Prognosis and Health Management (PHM), critical for ensuring task completion by complex systems and preventing unexpected failures, is widely adopted in aerospace, manufacturing, maritime, rail, energy, etc. However, PHM's development is constrained by bottlenecks like generalization, interpretation and verification abilities. Presently, generative artificial intelligence (AI), represented by Large Model, heralds a technological revolution with the potential to fundamentally reshape traditional technological fields and human production methods. Its capabilities, including strong generalization, reasoning, and generative attributes, present opportunities to address PHM's bottlenecks. To this end, based on a systematic analysis of the current challenges and bottlenecks in PHM, as well as the research status and advantages of Large Model, we propose a novel concept and three progressive paradigms of Prognosis and Health Management Large Model (PHM-LM) through the integration of the Large Model with PHM. Subsequently, we provide feasible technical approaches for PHM-LM to bolster PHM's core capabilities within the framework of the three paradigms. Moreover, to address core issues confronting PHM, we discuss a series of technical challenges of PHM-LM throughout the entire process of construction and application. This comprehensive effort offers a holistic PHM-LM technical framework, and provides avenues for new PHM technologies, methodologies, tools, platforms and applications, which also potentially innovates design, research & development, verification and application mode of PHM. And furthermore, a new generation of PHM with AI will also capably be realized, i.e., from custom to generalized, from discriminative to generative, and from theoretical conditions to practical applications.


GarchingSim: An Autonomous Driving Simulator with Photorealistic Scenes and Minimalist Workflow

arXiv.org Artificial Intelligence

Conducting real road testing for autonomous driving algorithms can be expensive and sometimes impractical, particularly for small startups and research institutes. Thus, simulation becomes an important method for evaluating these algorithms. However, the availability of free and open-source simulators is limited, and the installation and configuration process can be daunting for beginners and interdisciplinary researchers. We introduce an autonomous driving simulator with photorealistic scenes, meanwhile keeping a user-friendly workflow. The simulator is able to communicate with external algorithms through ROS2 or Socket.IO, making it compatible with existing software stacks. Furthermore, we implement a highly accurate vehicle dynamics model within the simulator to enhance the realism of the vehicle's physical effects. The simulator is able to serve various functions, including generating synthetic data and driving with machine learning-based algorithms. Moreover, we prioritize simplicity in the deployment process, ensuring that beginners find it approachable and user-friendly.


MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation

arXiv.org Artificial Intelligence

Given a natural language, a general robot has to comprehend the instruction and find the target object or location based on visual observations even in unexplored environments. Most agents rely on massive diverse training data to achieve better generalization, which requires expensive labor. These agents often focus on common objects and fewer tasks, thus are not intelligent enough to handle different types of instructions. To facilitate research in open-set vision-and-language navigation, we propose a benchmark named MO-VLN, aiming at testing the effectiveness and generalization of the agent in the multi-task setting. First, we develop a 3D simulator rendered by realistic scenarios using Unreal Engine 5, containing more realistic lights and details. The simulator contains three scenes, i.e., cafe, restaurant, and nursing house, of high value in the industry. Besides, our simulator involves multiple uncommon objects, such as takeaway cup and medical adhesive tape, which are more complicated compared with existing environments. Inspired by the recent success of large language models (e.g., ChatGPT, Vicuna), we construct diverse high-quality data of instruction type without human annotation. Our benchmark MO-VLN provides four tasks: 1) goal-conditioned navigation given a specific object category (e.g., "fork"); 2) goal-conditioned navigation given simple instructions (e.g., "Search for and move towards a tennis ball"); 3) step-by-step instruction following; 4) finding abstract object based on high-level instruction (e.g., "I am thirsty").


BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics

arXiv.org Artificial Intelligence

The proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., indicate lower faithfulness as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, and 3) unlike datasets containing generated summaries with multiple errors, BUMP enables the measurement of metrics' performance on individual error types.


Deep Transfer Learning with Graph Neural Network for Sensor-Based Human Activity Recognition

arXiv.org Artificial Intelligence

Abstract--The sensor-based human activity recognition (HAR) in mobile application scenarios is often confronted with sensor modalities variation and annotated data deficiency. Given this observation, we devised a graph-inspired deep learning approach toward the sensor-based HAR tasks, which was further used to build a deep transfer learning model toward giving a tentative solution for these two challenging problems. Specifically, we present a multi-layer residual structure involved graph convolutional neural network (ResGCNN) toward the sensor-based HAR tasks, namely the HAR-ResGCNN approach. Experimental results on the PAMAP2 and mHealth data sets demonstrate that our ResGCNN is effective at capturing the characteristics of actions with comparable results compared to other sensor-based HAR models (with an average accuracy of 98.18% and 99.07%, respectively). The graph-based framework shows good meta-learning ability and is supposed to be a promising solution in sensor-based HAR tasks. However, this method machines the ability to track the human activity state. HAR is restricted with the impact brought with the complex systems tracks the activity states through processing and scenarios, the uncertainty of the action, and needs to consider learning information from some carriers that can record the privacy problems caused by the camera, and is human actions (such as cameras [1], sensors [2], radars only suitable for some specific scenes.


Query Answering via Decentralized Search

arXiv.org Artificial Intelligence

Expert networks are formed by a group of expert-professionals with different specialties to collaboratively resolve specific queries posted to the network. In such networks, when a query reaches an expert who does not have sufficient expertise, this query needs to be routed to other experts for further processing until it is completely solved; therefore, query answering efficiency is sensitive to the underlying query routing mechanism being used. Among all possible query routing mechanisms, decentralized search, operating purely on each expert's local information without any knowledge of network global structure, represents the most basic and scalable routing mechanism, which is applicable to any network scenarios even in dynamic networks. However, there is still a lack of fundamental understanding of the efficiency of decentralized search in expert networks. In this regard, we investigate decentralized search by quantifying its performance under a variety of network settings. Our key findings reveal the existence of network conditions, under which decentralized search can achieve significantly short query routing paths (i.e., between $O(\log n)$ and $O(\log^2 n)$ hops, $n$: total number of experts in the network). Based on such theoretical foundation, we further study how the unique properties of decentralized search in expert networks is related to the anecdotal small-world phenomenon. In addition, we demonstrate that decentralized search is robust against estimation errors introduced by misinterpreting the required expertise levels. To the best of our knowledge, this is the first work studying fundamental behaviors of decentralized search in expert networks. The developed performance bounds, confirmed by real datasets, are able to assist in predicting network performance and designing complex expert networks.


Influence Maximization Under Generic Threshold-based Non-submodular Model

arXiv.org Artificial Intelligence

As a widely observable social effect, influence diffusion refers to a process where innovations, trends, awareness, etc. spread across the network via the social impact among individuals. Motivated by such social effect, the concept of influence maximization is coined, where the goal is to select a bounded number of the most influential nodes (seed nodes) from a social network so that they can jointly trigger the maximal influence diffusion. A rich body of research in this area is performed under statistical diffusion models with provable submodularity, which essentially simplifies the problem as the optimal result can be approximated by the simple greedy search. When the diffusion models are non-submodular, however, the research community mostly focuses on how to bound/approximate them by tractable submodular functions so as to estimate the optimal result. In other words, there is still a lack of efficient methods that can directly resolve non-submodular influence maximization problems. In this regard, we fill the gap by proposing seed selection strategies using network graphical properties in a generalized threshold-based model, called influence barricade model, which is non-submodular. Specifically, under this model, we first establish theories to reveal graphical conditions that ensure the network generated by node removals has the same optimal seed set as that in the original network. We then exploit these theoretical conditions to develop efficient algorithms by strategically removing less-important nodes and selecting seeds only in the remaining network. To the best of our knowledge, this is the first graph-based approach that directly tackles non-submodular influence maximization.