Li, Lin
Accelerating and enhancing thermodynamic simulations of electrochemical interfaces
Du, Xiaochen, Liu, Mengren, Peng, Jiayu, Chun, Hoje, Hoffman, Alexander, Yildiz, Bilge, Li, Lin, Bazant, Martin Z., Gómez-Bombarelli, Rafael
Electrochemical interfaces are crucial in catalysis, energy storage, and corrosion, where their stability and reactivity depend on complex interactions between the electrode, adsorbates, and electrolyte. Predicting stable surface structures remains challenging, as traditional surface Pourbaix diagrams tend to either rely on expert knowledge or costly $\textit{ab initio}$ sampling, and neglect thermodynamic equilibration with the environment. Machine learning (ML) potentials can accelerate static modeling but often overlook dynamic surface transformations. Here, we extend the Virtual Surface Site Relaxation-Monte Carlo (VSSR-MC) method to autonomously sample surface reconstructions modeled under aqueous electrochemical conditions. Through fine-tuning foundational ML force fields, we accurately and efficiently predict surface energetics, recovering known Pt(111) phases and revealing new LaMnO$_\mathrm{3}$(001) surface reconstructions. By explicitly accounting for bulk-electrolyte equilibria, our framework enhances electrochemical stability predictions, offering a scalable approach to understanding and designing materials for electrochemical applications.
A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images
Liang, Xiaoyi, Bian, Mouxiao, Chen, Moxin, Liu, Lihao, He, Junjun, Xu, Jie, Li, Lin
In recent years, large language models (LLMs) have demonstrated remarkable potential across various medical applications. Building on this foundation, multimodal large language models (MLLMs) integrate LLMs with visual models to process diverse inputs, including clinical data and medical images. In ophthalmology, LLMs have been explored for analyzing optical coherence tomography (OCT) reports, assisting in disease classification, and even predicting treatment outcomes. However, existing MLLM benchmarks often fail to capture the complexities of real-world clinical practice, particularly in the analysis of OCT images. Many suffer from limitations such as small sample sizes, a lack of diverse OCT datasets, and insufficient expert validation. These shortcomings hinder the accurate assessment of MLLMs' ability to interpret OCT scans and their broader applicability in ophthalmology. Our dataset, curated through rigorous quality control and expert annotation, consists of 439 fundus images and 75 OCT images. Using a standardized API-based framework, we assessed seven mainstream MLLMs and observed significant variability in diagnostic accuracy across different diseases. While some models performed well in diagnosing conditions such as diabetic retinopathy and age-related macular degeneration, they struggled with others, including choroidal neovascularization and myopia, highlighting inconsistencies in performance and the need for further refinement. Our findings emphasize the importance of developing clinically relevant benchmarks to provide a more accurate assessment of MLLMs' capabilities. By refining these models and expanding their scope, we can enhance their potential to transform ophthalmic diagnosis and treatment.
Reducing Large Language Model Safety Risks in Women's Health using Semantic Entropy
Penny-Dimri, Jahan C., Bachmann, Magdalena, Cooke, William R., Mathewlynn, Sam, Dockree, Samuel, Tolladay, John, Kossen, Jannik, Li, Lin, Gal, Yarin, Jones, Gabriel Davis
Large language models (LLMs) hold substantial promise for clinical decision support. However, their widespread adoption in medicine, particularly in healthcare, is hindered by their propensity to generate false or misleading outputs, known as hallucinations. In high-stakes domains such as women's health (obstetrics & gynaecology), where errors in clinical reasoning can have profound consequences for maternal and neonatal outcomes, ensuring the reliability of AI-generated responses is critical. Traditional methods for quantifying uncertainty, such as perplexity, fail to capture meaning-level inconsistencies that lead to misinformation. Here, we evaluate semantic entropy (SE), a novel uncertainty metric that assesses meaning-level variation, to detect hallucinations in AI-generated medical content. Using a clinically validated dataset derived from UK RCOG MRCOG examinations, we compared SE with perplexity in identifying uncertain responses. SE demonstrated superior performance, achieving an AUROC of 0.76 (95% CI: 0.75-0.78), compared to 0.62 (0.60-0.65) for perplexity. Clinical expert validation further confirmed its effectiveness, with SE achieving near-perfect uncertainty discrimination (AUROC: 0.97). While semantic clustering was successful in only 30% of cases, SE remains a valuable tool for improving AI safety in women's health. These findings suggest that SE could enable more reliable AI integration into clinical practice, particularly in resource-limited settings where LLMs could augment care. This study highlights the potential of SE as a key safeguard in the responsible deployment of AI-driven tools in women's health, leading to safer and more effective digital health interventions.
Random Forest-of-Thoughts: Uncertainty-aware Reasoning for Computational Social Science
Wu, Xiaohua, Tao, Xiaohui, Wu, Wenjie, Li, Yuefeng, Li, Lin
Social surveys in computational social science are well-designed by elaborate domain theories that can effectively reflect the interviewee's deep thoughts without concealing their true feelings. The candidate questionnaire options highly depend on the interviewee's previous answer, which results in the complexity of social survey analysis, the time, and the expertise required. The ability of large language models (LLMs) to perform complex reasoning is well-enhanced by prompting learning such as Chain-of-thought (CoT) but still confined to left-to-right decision-making processes or limited paths during inference. This means they can fall short in problems that require exploration and uncertainty searching. In response, a novel large language model prompting method, called Random Forest of Thoughts (RFoT), is proposed for generating uncertainty reasoning to fit the area of computational social science. The RFoT allows LLMs to perform deliberate decision-making by generating diverse thought space and randomly selecting the sub-thoughts to build the forest of thoughts. It can extend the exploration and prediction of overall performance, benefiting from the extensive research space of response. The method is applied to optimize computational social science analysis on two datasets covering a spectrum of social survey analysis problems. Our experiments show that RFoT significantly enhances language models' abilities on two novel social survey analysis problems requiring non-trivial reasoning.
Exploring Incremental Unlearning: Techniques, Challenges, and Future Directions
Qureshi, Sadia, Shaik, Thanveer, Tao, Xiaohui, Xie, Haoran, Li, Lin, Yong, Jianming, Jia, Xiaohua
The growing demand for data privacy in Machine Learning (ML) applications has seen Machine Unlearning (MU) emerge as a critical area of research. As the `right to be forgotten' becomes regulated globally, it is increasingly important to develop mechanisms that delete user data from AI systems while maintaining performance and scalability of these systems. Incremental Unlearning (IU) is a promising MU solution to address the challenges of efficiently removing specific data from ML models without the need for expensive and time-consuming full retraining. This paper presents the various techniques and approaches to IU. It explores the challenges faced in designing and implementing IU mechanisms. Datasets and metrics for evaluating the performance of unlearning techniques are discussed as well. Finally, potential solutions to the IU challenges alongside future research directions are offered. This survey provides valuable insights for researchers and practitioners seeking to understand the current landscape of IU and its potential for enhancing privacy-preserving intelligent systems.
Representational Alignment with Chemical Induced Fit for Molecular Relational Learning
Zhang, Peiliang, Yuan, Jingling, Xie, Qing, Zhu, Yongjun, Li, Lin
Molecular Relational Learning (MRL) is widely applied in natural sciences to predict relationships between molecular pairs by extracting structural features. The representational similarity between substructure pairs determines the functional compatibility of molecular binding sites. Nevertheless, aligning substructure representations by attention mechanisms lacks guidance from chemical knowledge, resulting in unstable model performance in chemical space (\textit{e.g.}, functional group, scaffold) shifted data. With theoretical justification, we propose the \textbf{Re}presentational \textbf{Align}ment with Chemical Induced \textbf{Fit} (ReAlignFit) to enhance the stability of MRL. ReAlignFit dynamically aligns substructure representation in MRL by introducing chemical Induced Fit-based inductive bias. In the induction process, we design the Bias Correction Function based on substructure edge reconstruction to align representations between substructure pairs by simulating chemical conformational changes (dynamic combination of substructures). ReAlignFit further integrates the Subgraph Information Bottleneck during fit process to refine and optimize substructure pairs exhibiting high chemical functional compatibility, leveraging them to generate molecular embeddings. Experimental results on nine datasets demonstrate that ReAlignFit outperforms state-of-the-art models in two tasks and significantly enhances model's stability in both rule-shifted and scaffold-shifted data distributions.
YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks
Bandyopadhyay, Saptarashmi, Bahirwani, Vikas, Aggarwal, Lavisha, Guda, Bhanu, Li, Lin, Colaco, Andrea
Multimodal AI Agents are AI models that have the capability of interactively and cooperatively assisting human users to solve day-to-day tasks. Augmented Reality (AR) head worn devices can uniquely improve the user experience of solving procedural day-to-day tasks by providing egocentric multimodal (audio and video) observational capabilities to AI Agents. Such AR capabilities can help AI Agents see and listen to actions that users take which can relate to multimodal capabilities of human users. Existing AI Agents, either Large Language Models (LLMs) or Multimodal Vision-Language Models (VLMs) are reactive in nature, which means that models cannot take an action without reading or listening to the human user's prompts. Proactivity of AI Agents on the other hand can help the human user detect and correct any mistakes in agent observed tasks, encourage users when they do tasks correctly or simply engage in conversation with the user - akin to a human teaching or assisting a user. Our proposed YET to Intervene (YETI) multimodal agent focuses on the research question of identifying circumstances that may require the agent to intervene proactively. This allows the agent to understand when it can intervene in a conversation with human users that can help the user correct mistakes on tasks, like cooking, using AR. Our YETI Agent learns scene understanding signals based on interpretable notions of Structural Similarity (SSIM) on consecutive video frames. We also define the alignment signal which the AI Agent can learn to identify if the video frames corresponding to the user's actions on the task are consistent with expected actions. These signals are used by our AI Agent to determine when it should proactively intervene. We compare our results on the instances of proactive intervention in the HoloAssist multimodal benchmark for an expert agent guiding a user to complete procedural tasks.
BENet: A Cross-domain Robust Network for Detecting Face Forgeries via Bias Expansion and Latent-space Attention
Liu, Weihua, Qiu, Jianhua, Boumaraf, Said, lin, Chaochao, liyuan, Pan, Li, Lin, Bennamoun, Mohammed, Werghi, Naoufel
In response to the growing threat of deepfake technology, we introduce BENet, a Cross-Domain Robust Bias Expansion Network. BENet enhances the detection of fake faces by addressing limitations in current detectors related to variations across different types of fake face generation techniques, where ``cross-domain" refers to the diverse range of these deepfakes, each considered a separate domain. BENet's core feature is a bias expansion module based on autoencoders. This module maintains genuine facial features while enhancing differences in fake reconstructions, creating a reliable bias for detecting fake faces across various deepfake domains. We also introduce a Latent-Space Attention (LSA) module to capture inconsistencies related to fake faces at different scales, ensuring robust defense against advanced deepfake techniques. The enriched LSA feature maps are multiplied with the expanded bias to create a versatile feature space optimized for subtle forgeries detection. To improve its ability to detect fake faces from unknown sources, BENet integrates a cross-domain detector module that enhances recognition accuracy by verifying the facial domain during inference. We train our network end-to-end with a novel bias expansion loss, adopted for the first time, in face forgery detection. Extensive experiments covering both intra and cross-dataset demonstrate BENet's superiority over current state-of-the-art solutions.
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge
Zhao, Yaqi, Yin, Yuanyang, Li, Lin, Lin, Mingan, Huang, Victor Shea-Jay, Chen, Siwei, Chen, Weipeng, Yin, Baoqun, Zhou, Zenan, Zhang, Wentao
Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM). Specifically, the VE's representation of visual information may not fully align with LLM's cognitive framework, leading to a mismatch where visual features exceed the language model's interpretive range. To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data-images whose ambiguous visual representations challenge the VE's interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data on interpretive abilities. Our results show that VE-Unknown data limits LVLM's capacity for accurate understanding, while VE-Known data, rich in distinctive features, helps reduce cognitive misalignment. Building on these insights, we propose Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the LLM's embedding space but also align with the LLM's cognitive framework. This alignment markedly enhances LVLM performance in landmark recognition. Our findings underscore the challenges posed by VE-Unknown data and highlight the essential role of cognitive alignment in advancing multimodal systems.
MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers
Ding, Ning, Tang, Yehui, Qin, Haochen, Zhou, Zhenli, Xu, Chao, Li, Lin, Han, Kai, Liao, Heng, Wang, Yunhe
In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection. We then use a hash algorithm to retrieve a correlated subset of vectors dynamically based on the input embedding. The retrieved vectors combined together will form the output embedding, which provides an estimation of the result of matrix multiplication operation in a fully-connected layer. Compared to conducting matrix multiplication, retrieving data blocks from memory is a much cheaper operation which requires little computations. We train MemoryFormer from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.