Goto

Collaborating Authors

 Wei, Lai


Ensuring Force Safety in Vision-Guided Robotic Manipulation via Implicit Tactile Calibration

arXiv.org Artificial Intelligence

In dynamic environments, robots often encounter constrained movement trajectories when manipulating objects with specific properties, such as doors. Therefore, applying the appropriate force is crucial to prevent damage to both the robots and the objects. However, current vision-guided robot state generation methods often falter in this regard, as they lack the integration of tactile perception. To tackle this issue, this paper introduces a novel state diffusion framework termed SafeDiff. It generates a prospective state sequence from the current robot state and visual context observation while incorporating real-time tactile feedback to refine the sequence. As far as we know, this is the first study specifically focused on ensuring force safety in robotic manipulation. It significantly enhances the rationality of state planning, and the safe action trajectory is derived from inverse dynamics based on this refined planning. In practice, unlike previous approaches that concatenate visual and tactile data to generate future robot state sequences, our method employs tactile data as a calibration signal to adjust the robot's state within the state space implicitly. Additionally, we've developed a large-scale simulation dataset called SafeDoorManip50k, offering extensive multimodal data to train and evaluate the proposed method. Extensive experiments show that our visual-tactile model substantially mitigates the risk of harmful forces in the door opening, across both simulated and real-world settings.


Contextual Bandits with Arm Request Costs and Delays

arXiv.org Machine Learning

We introduce a novel extension of the contextual bandit problem, where new sets of arms can be requested with stochastic time delays and associated costs. In this setting, the learner can select multiple arms from a decision set, with each selection taking one unit of time. The problem is framed as a special case of semi-Markov decision processes (SMDPs). The arm contexts, request times, and costs are assumed to follow an unknown distribution. We consider the regret of an online learning algorithm with respect to the optimal policy that achieves the maximum average reward. By leveraging the Bellman optimality equation, we design algorithms that can effectively select arms and determine the appropriate time to request new arms, thereby minimizing their regret. Under the realizability assumption, we analyze the proposed algorithms and demonstrate that their regret upper bounds align with established results in the contextual bandit literature. We validate the algorithms through experiments on simulated data and a movie recommendation dataset, showing that their performance is consistent with theoretical analyses.


An adapted large language model facilitates multiple medical tasks in diabetes care

arXiv.org Artificial Intelligence

Diabetes is a chronic disease that poses a significant global health burden, and optimizing diabetes management requires multi-stakeholder collaboration. Large language models (LLMs) have shown promise in various healthcare scenarios, but their effectiveness across a diverse range of diabetes tasks remains unproven. In this study, we introduced a framework to train and validate diabetes-specific LLMs. We first developed a comprehensive data processing pipeline that includes data collection, filtering, augmentation and refinement. This approach contributes to creating a high-quality, diabetes-specific dataset, and several evaluation benchmarks entirely from scratch. Utilizing the collected training dataset, we fine-tuned a diabetes-specific LLM family that demonstrated state-of-the-art proficiency in understanding and processing various diabetes tasks compared to other LLMs. Furthermore, clinical studies showed the potential applications of our models in diabetes care, including providing personalized healthcare, assisting medical education, and streamlining clinical tasks. In conclusion, our study introduced a framework to develop and evaluate a diabetes-specific LLM family, and highlighted its potential to enhance clinical practice and provide personalized, data-driven support for diabetes support when facing different end users.


Adaptive Online Experimental Design for Causal Discovery

arXiv.org Artificial Intelligence

Causal discovery aims to uncover cause-and-effect relationships encoded in causal graphs by leveraging observational, interventional data, or their combination. The majority of existing causal discovery methods are developed assuming infinite interventional data. We focus on data interventional efficiency and formalize causal discovery from the perspective of online learning, inspired by pure exploration in bandit problems. A graph separating system, consisting of interventions that cut every edge of the graph at least once, is sufficient for learning causal graphs when infinite interventional data is available, even in the worst case. We propose a track-and-stop causal discovery algorithm that adaptively selects interventions from the graph separating system via allocation matching and learns the causal graph based on sampling history. Given any desired confidence value, the algorithm determines a termination condition and runs until it is met. We analyze the algorithm to establish a problem-dependent upper bound on the expected number of required interventional samples. Our proposed algorithm outperforms existing methods in simulations across various randomly generated causal graphs. It achieves higher accuracy, measured by the structural hamming distance (SHD) between the learned causal graph and the ground truth, with significantly fewer samples.


Can I understand what I create? Self-Knowledge Evaluation of Large Language Models

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved remarkable progress in linguistic tasks, necessitating robust evaluation frameworks to understand their capabilities and limitations. Inspired by Feynman's principle of understanding through creation, we introduce a self-knowledge evaluation framework that is easy to implement, evaluating models on their ability to comprehend and respond to self-generated questions. Our findings, based on testing multiple models across diverse tasks, reveal significant gaps in the model's self-knowledge ability. Further analysis indicates these gaps may be due to misalignment with human attention mechanisms. Additionally, fine-tuning on self-generated math task may enhance the model's math performance, highlighting the potential of the framework for efficient and insightful model evaluation and may also contribute to the improvement of LLMs.


AlphaCrystal-II: Distance matrix based crystal structure prediction using deep learning

arXiv.org Artificial Intelligence

Computational prediction of stable crystal structures has a profound impact on the large-scale discovery of novel functional materials. However, predicting the crystal structure solely from a material's composition or formula is a promising yet challenging task, as traditional ab initio crystal structure prediction (CSP) methods rely on time-consuming global searches and first-principles free energy calculations. Inspired by the recent success of deep learning approaches in protein structure prediction, which utilize pairwise amino acid interactions to describe 3D structures, we present AlphaCrystal-II, a novel knowledge-based solution that exploits the abundant inter-atomic interaction patterns found in existing known crystal structures. AlphaCrystal-II predicts the atomic distance matrix of a target crystal material and employs this matrix to reconstruct its 3D crystal structure. By leveraging the wealth of inter-atomic relationships of known crystal structures, our approach demonstrates remarkable effectiveness and reliability in structure prediction through comprehensive experiments. This work highlights the potential of data-driven methods in accelerating the discovery and design of new materials with tailored properties.


Large Language Model Evaluation via Matrix Entropy

arXiv.org Artificial Intelligence

Large language models (LLMs) have revolutionized the field of natural language processing, extending their strong capabilities into multi-modal domains. Thus, it is vital to define proper and diversified metrics for the evaluation of LLMs. In this paper, we introduce matrix entropy, a novel metric rooted in information theory and geometry principles to quantify the data compression proficiency in LLMs. It reflects the model's ability to extract relevant information and eliminate unnecessary elements, thereby providing insight into the language model's intrinsic capability. Specifically, we demonstrate its applicability in both single-modal (language) and multi-modal settings. For language models, our findings reveal that the matrix entropy of representations follows a scaling law type reduction when the model scales up, serving as a complement to the traditional loss scaling law. For the multi-modal setting, we also propose an evaluation method based on matrix entropy for assessing alignment quality and we find that modern large multi-modal models exhibit great alignment performance.


Multi-view Subspace Clustering via An Adaptive Consensus Graph Filter

arXiv.org Artificial Intelligence

Multiview subspace clustering (MVSC) has attracted an increasing amount of attention in recent years. Most existing MVSC methods first collect complementary information from different views and consequently derive a consensus reconstruction coefficient matrix to indicate the subspace structure of a multi-view data set. In this paper, we initially assume the existence of a consensus reconstruction coefficient matrix and then use it to build a consensus graph filter. In each view, the filter is employed for smoothing the data and designing a regularizer for the reconstruction coefficient matrix. Finally, the obtained reconstruction coefficient matrices from different views are used to create constraints for the consensus reconstruction coefficient matrix. Therefore, in the proposed method, the consensus reconstruction coefficient matrix, the consensus graph filter, and the reconstruction coefficient matrices from different views are interdependent. We provide an optimization algorithm to obtain their optimal values. Extensive experiments on diverse multi-view data sets demonstrate that our approach outperforms some state-of-the-art methods.


InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

arXiv.org Artificial Intelligence

Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent studies have shown that large language models can achieve satisfactory results even with a limited amount of high-quality instruction-following data. In this paper, we introduce InstructionGPT-4, which is fine-tuned on a small dataset comprising only 200 examples, amounting to approximately 6\% of the instruction-following data used in the alignment dataset for MiniGPT-4. To achieve this, we first propose several metrics to access the quality of multimodal instruction data. Based on these metrics, we present an effective and trainable data selector to automatically identify and filter low-quality vision-language data. By employing this method, InstructionGPT-4 outperforms the original MiniGPT-4 on various evaluations. Overall, our findings demonstrate that less but high-quality instruction tuning data is efficient in enabling multimodal large language models to generate better output. Our code is available at https://github.com/waltonfuture/InstructionGPT-4.


Unidirectional brain-computer interface: Artificial neural network encoding natural images to fMRI response in the visual cortex

arXiv.org Artificial Intelligence

While significant advancements in artificial intelligence (AI) have catalyzed progress across various domains, its full potential in understanding visual perception remains underexplored. We propose an artificial neural network dubbed VISION, an acronym for "Visual Interface System for Imaging Output of Neural activity," to mimic the human brain and show how it can foster neuroscientific inquiries. Using visual and contextual inputs, this multimodal model predicts the brain's functional magnetic resonance imaging (fMRI) scan response to natural images. VISION successfully predicts human hemodynamic responses as fMRI voxel values to visual inputs with an accuracy exceeding state-of-the-art performance by 45%. We further probe the trained networks to reveal representational biases in different visual areas, generate experimentally testable hypotheses, and formulate an interpretable metric to associate these hypotheses with cortical functions. With both a model and evaluation metric, the cost and time burdens associated with designing and implementing functional analysis on the visual cortex could be reduced. Our work suggests that the evolution of computational models may shed light on our fundamental understanding of the visual cortex and provide a viable approach toward reliable brain-machine interfaces.