Oceania
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Yan, Yibo, Wang, Shen, Huo, Jiahao, Ye, Jingheng, Chu, Zhendong, Hu, Xuming, Yu, Philip S., Gomes, Carla, Selman, Bart, Wen, Qingsong
Scientific reasoning, the process through which humans apply logic, evidence, and critical thinking to explore and interpret scientific phenomena, is essential in advancing knowledge reasoning across diverse fields. However, despite significant progress, current scientific reasoning models still struggle with generalization across domains and often fall short of multimodal perception. Multimodal Large Language Models (MLLMs), which integrate text, images, and other modalities, present an exciting opportunity to overcome these limitations and enhance scientific reasoning. Therefore, this position paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology. First, we propose a four-stage research roadmap of scientific reasoning capabilities, and highlight the current state of MLLM applications in scientific reasoning, noting their ability to integrate and reason over diverse data types. Second, we summarize the key challenges that remain obstacles to achieving MLLM's full potential. To address these challenges, we propose actionable insights and suggestions for the future. Overall, our work offers a novel perspective on MLLM integration with scientific reasoning, providing the LLM community with a valuable vision for achieving Artificial General Intelligence (AGI).
Graph Structure Learning for Tumor Microenvironment with Cell Type Annotation from non-spatial scRNA-seq data
Huang, Yu-An, Li, Yue-Chao, You, Hai-Ru, Pan, Jie, Cao, Xiyue, Li, Xinyuan, Huang, Zhi-An, You, Zhu-Hong
The exploration of cellular heterogeneity within the tumor microenvironment (TME) via single-cell RNA sequencing (scRNA-seq) is essential for understanding cancer progression and response to therapy. Current scRNA-seq approaches, however, lack spatial context and rely on incomplete datasets of ligand-receptor interactions (LRIs), limiting accurate cell type annotation and cell-cell communication (CCC) inference. This study addresses these challenges using a novel graph neural network (GNN) model that enhances cell type prediction and cell interaction analysis. Our study utilized a dataset consisting of 49,020 cells from 19 patients across three cancer types: Leukemia, Breast Invasive Carcinoma, and Colorectal Cancer. The proposed scGSL model demonstrated robust performance, achieving an average accuracy of 84.83%, precision of 86.23%, recall of 81.51%, and an F1 score of 80.92% across all datasets. These metrics represent a significant enhancement over existing methods, which typically exhibit lower performance metrics. Additionally, by reviewing existing literature on gene interactions within the TME, the scGSL model proves to robustly identify biologically meaningful gene interactions in an unsupervised manner, validated by significant expression differences in key gene pairs across various cancers. The source code and data used in this paper can be found in https://github.com/LiYuechao1998/scGSL.
Extracting Problem Structure with LLMs for Optimized SAT Local Search
Schidler, André, Szeider, Stefan
These tools apply basic strategies that work well for random problems but miss critical patterns in structured instances. SAT encodings of real problems contain inherited patterns from graph layouts, data connections, and domain-specific rules. The transformation to Conjunctive Normal Form (CNF) obscures these patterns. Current local search methods skip these structures in favor of general approaches. This paper addresses these limitations by introducing a framework that leverages LLMs to generate local search strategies tailored to encoding structures, enabling solvers to take advantage of these patterns for improved performance. Our research addresses three questions: 1. How can LLMs analyze PySAT [Ignatiev et al., 2024] code to interpret how problem structure translates to SAT clauses? 2. How can we create local search strategies that recognize and exploit these encoding patterns?
Exploring Spatial Language Grounding Through Referring Expressions
Tumu, Akshar, Kordjamshidi, Parisa
Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.
Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation
Liu, Jian, Sun, Wei, Yang, Hui, Deng, Pengchao, Liu, Chongpei, Sebe, Nicu, Rahmani, Hossein, Mian, Ajmal
Nine-degrees-of-freedom (9-DoF) object pose and size estimation is crucial for enabling augmented reality and robotic manipulation. Category-level methods have received extensive research attention due to their potential for generalization to intra-class unknown objects. However, these methods require manual collection and labeling of large-scale real-world training data. To address this problem, we introduce a diffusion-based paradigm for domain-generalized category-level 9-DoF object pose estimation. Our motivation is to leverage the latent generalization ability of the diffusion model to address the domain generalization challenge in object pose estimation. This entails training the model exclusively on rendered synthetic data to achieve generalization to real-world scenes. We propose an effective diffusion model to redefine 9-DoF object pose estimation from a generative perspective. Our model does not require any 3D shape priors during training or inference. By employing the Denoising Diffusion Implicit Model, we demonstrate that the reverse diffusion process can be executed in as few as 3 steps, achieving near real-time performance. Finally, we design a robotic grasping system comprising both hardware and software components. Through comprehensive experiments on two benchmark datasets and the real-world robotic system, we show that our method achieves state-of-the-art domain generalization performance. Our code will be made public at https://github.com/CNJianLiu/Diff9D.
Hierarchical Sparse Bayesian Multitask Model with Scalable Inference for Microbiome Analysis
Zhu, Haonan, Goncalves, Andre R., Valdes, Camilo, Ranganathan, Hiranmayi, Zhang, Boya, Martí, Jose Manuel, Kok, Car Reen, Borucki, Monica K., Mulakken, Nisha J., Thissen, James B., Jaing, Crystal, Hero, Alfred, Be, Nicholas A.
This paper proposes a hierarchical Bayesian multitask learning model that is applicable to the general multi-task binary classification learning problem where the model assumes a shared sparsity structure across different tasks. We derive a computationally efficient inference algorithm based on variational inference to approximate the posterior distribution. We demonstrate the potential of the new approach on various synthetic datasets and for predicting human health status based on microbiome profile. Our analysis incorporates data pooled from multiple microbiome studies, along with a comprehensive comparison with other benchmark methods. Results in synthetic datasets show that the proposed approach has superior support recovery property when the underlying regression coefficients share a common sparsity structure across different tasks. Our experiments on microbiome classification demonstrate the utility of the method in extracting informative taxa while providing well-calibrated predictions with uncertainty quantification and achieving competitive performance in terms of prediction metrics. Notably, despite the heterogeneity of the pooled datasets (e.g., different experimental objectives, laboratory setups, sequencing equipment, patient demographics), our method delivers robust results.
Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review
Gaya-Morey, F. Xavier, Buades-Rubio, Jose M., Palanque, Philippe, Lacuesta, Raquel, Manresa-Yee, Cristina
The rapid aging of the global population has highlighted the need for technologies to support elderly, particularly in healthcare and emotional well-being. Facial expression recognition (FER) systems offer a non-invasive means of monitoring emotional states, with applications in assisted living, mental health support, and personalized care. This study presents a systematic review of deep learning-based FER systems, focusing on their applications for the elderly population. Following a rigorous methodology, we analyzed 31 studies published over the last decade, addressing challenges such as the scarcity of elderly-specific datasets, class imbalances, and the impact of age-related facial expression differences. Our findings show that convolutional neural networks remain dominant in FER, and especially lightweight versions for resource-constrained environments. However, existing datasets often lack diversity in age representation, and real-world deployment remains limited. Additionally, privacy concerns and the need for explainable artificial intelligence emerged as key barriers to adoption. This review underscores the importance of developing age-inclusive datasets, integrating multimodal solutions, and adopting XAI techniques to enhance system usability, reliability, and trustworthiness. We conclude by offering recommendations for future research to bridge the gap between academic progress and real-world implementation in elderly care.
Policy-Guided Causal State Representation for Offline Reinforcement Learning Recommendation
Wang, Siyu, Chen, Xiaocong, Yao, Lina
In offline reinforcement learning-based recommender systems (RLRS), learning effective state representations is crucial for capturing user preferences that directly impact long-term rewards. However, raw state representations often contain high-dimensional, noisy information and components that are not causally relevant to the reward. Additionally, missing transitions in offline data make it challenging to accurately identify features that are most relevant to user satisfaction. To address these challenges, we propose Policy-Guided Causal Representation (PGCR), a novel two-stage framework for causal feature selection and state representation learning in offline RLRS. In the first stage, we learn a causal feature selection policy that generates modified states by isolating and retaining only the causally relevant components (CRCs) while altering irrelevant components. This policy is guided by a reward function based on the Wasserstein distance, which measures the causal effect of state components on the reward and encourages the preservation of CRCs that directly influence user interests. In the second stage, we train an encoder to learn compact state representations by minimizing the mean squared error (MSE) loss between the latent representations of the original and modified states, ensuring that the representations focus on CRCs. We provide a theoretical analysis proving the identifiability of causal effects from interventions, validating the ability of PGCR to isolate critical state components for decision-making. Extensive experiments demonstrate that PGCR significantly improves recommendation performance, confirming its effectiveness for offline RL-based recommender systems.
FAB-PPI: Frequentist, Assisted by Bayes, Prediction-Powered Inference
Cortinovis, Stefano, Caron, François
Prediction-powered inference (PPI) enables valid statistical inference by combining experimental data with machine learning predictions. When a sufficient number of high-quality predictions is available, PPI results in more accurate estimates and tighter confidence intervals than traditional methods. In this paper, we propose to inform the PPI framework with prior knowledge on the quality of the predictions. The resulting method, which we call frequentist, assisted by Bayes, PPI (FAB-PPI), improves over PPI when the observed prediction quality is likely under the prior, while maintaining its frequentist guarantees. Furthermore, when using heavy-tailed priors, FAB-PPI adaptively reverts to standard PPI in low prior probability regions. We demonstrate the benefits of FAB-PPI in real and synthetic examples.
Dementia Classification Using Acoustic Speech and Feature Selection
Niemelä, Marko, von Bonsdorff, Mikaela, Äyrämö, Sami, Kärkkäinen, Tommi
Dementia is a general term for a group of syndromes that affect cognitive functions such as memory, thinking, reasoning, and the ability to perform daily tasks. The number of dementia patients is increasing as the population ages, and it is estimated that over 10 million people develop dementia each year. Dementia progresses gradually, and the sooner a patient receives help and support, the better their chances of maintaining their functional abilities. For this reason, early diagnosis of dementia is important. In recent years, machine learning models based on naturally spoken language have been developed for the early diagnosis of dementia. These methods have proven to be user-friendly, cost-effective, scalable, and capable of providing extremely fast diagnoses. This study utilizes the well-known ADReSS challenge dataset for classifying healthy controls and Alzheimer's patients. The dataset contains speech recordings from a picture description task featuring a kitchen scene, collected from both healthy controls and dementia patients. Unlike most studies, this research does not segment the audio recordings into active speech segments; instead, acoustic features are extracted from entire recordings. The study employs Ridge linear regression, Extreme Minimal Learning Machine, and Linear Support Vector Machine machine learning models to compute feature importance scores based on model outputs. The Ridge model performed best in Leave-One-Subject-Out cross-validation, achieving a classification accuracy of 87.8%. The EMLM model, proved to be effective in both cross-validation and the classification of a separate test dataset, with accuracies of 85.3% and 79.2%, respectively. The study's results rank among the top compared to other studies using the same dataset and acoustic feature extraction for dementia diagnosis.