Overview
From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery
Zheng, Tianshi, Deng, Zheye, Tsang, Hong Ting, Wang, Weiqi, Bai, Jiaxin, Wang, Zihao, Song, Yangqiu
Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery, evolving from task-specific automation tools into increasingly autonomous agents and fundamentally redefining research processes and human-AI collaboration. This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science. Through the lens of the scientific method, we introduce a foundational three-level taxonomy-Tool, Analyst, and Scientist-to delineate their escalating autonomy and evolving responsibilities within the research lifecycle. We further identify pivotal challenges and future research trajectories such as robotic automation, self-improvement, and ethical governance. Overall, this survey provides a conceptual architecture and strategic foresight to navigate and shape the future of AI-driven scientific discovery, fostering both rapid innovation and responsible advancement. Github Repository: https://github.com/HKUST-KnowComp/Awesome-LLM-Scientific-Discovery.
Exploring the Relationship between Brain Hemisphere States and Frequency Bands through Deep Learning Optimization Techniques
Islam, Robiul, Ignatov, Dmitry I., Kaberg, Karl, Nabatchikov, Roman
This study investigates classifier performance across EEG frequency bands using various optimizers and evaluates efficient class prediction for the left and right hemispheres. Three neural network architectures - a deep dense network, a shallow three-layer network, and a convolutional neural network (CNN) - are implemented and compared using the TensorFlow and PyTorch frameworks. Results indicate that the Adagrad and RMSprop optimizers consistently perform well across different frequency bands, with Adadelta exhibiting robust performance in cross-model evaluations. Specifically, Adagrad excels in the beta band, while RMSprop achieves superior performance in the gamma band. Conversely, SGD and FTRL exhibit inconsistent performance. Among the models, the CNN demonstrates the second highest accuracy, particularly in capturing spatial features of EEG data. The deep dense network shows competitive performance in learning complex patterns, whereas the shallow three-layer network, sometimes being less accurate, provides computational efficiency. SHAP (Shapley Additive Explanations) plots are employed to identify efficient class prediction, revealing nuanced contributions of EEG frequency bands to model accuracy. Overall, the study highlights the importance of optimizer selection, model architecture, and EEG frequency band analysis in enhancing classifier performance and understanding feature importance in neuroimaging-based classification tasks.
CrowdAgent: Multi-Agent Managed Multi-Source Annotation System
Qin, Maosheng, Zhu, Renyu, Xia, Mingxuan, Chen, Chenkai, Zhu, Zhen, Lin, Minmin, Zhao, Junbo, Xu, Lu, Fan, Changjie, Wu, Runze, Wang, Haobo
High-quality annotated data is a cornerstone of modern Natural Language Processing (NLP). While recent methods begin to leverage diverse annotation sources-including Large Language Models (LLMs), Small Language Models (SLMs), and human experts-they often focus narrowly on the labeling step itself. A critical gap remains in the holistic process control required to manage these sources dynamically, addressing complex scheduling and quality-cost trade-offs in a unified manner. Inspired by real-world crowdsourcing companies, we introduce CrowdAgent, a multi-agent system that provides end-to-end process control by integrating task assignment, data annotation, and quality/cost management. It implements a novel methodology that rationally assigns tasks, enabling LLMs, SLMs, and human experts to advance synergistically in a collaborative annotation workflow. We demonstrate the effectiveness of CrowdAgent through extensive experiments on six diverse multimodal classification tasks. The source code and video demo are available at https://github.com/QMMMS/CrowdAgent.
An Exhaustive DPLL Approach to Model Counting over Integer Linear Constraints with Simplification Techniques
Zhang, Mingwei, Gu, Zhenhao, Fang, Liangda, Ge, Cunjing, Chen, Ziliang, Lai, Zhao-Rong, Guan, Quanlong
Linear constraints are one of the most fundamental constraints in fields such as computer science, operations research and optimization. Many applications reduce to the task of model counting over integer linear constraints (MCILC). In this paper, we design an exact approach to MCILC based on an exhaustive DPLL architecture. To improve the efficiency, we integrate several effective simplification techniques from mixed integer programming into the architecture. We compare our approach to state-of-the-art MCILC counters and propositional model counters on 2840 random and 4131 application benchmarks. Experimental results show that our approach significantly outperforms all exact methods in random benchmarks solving 1718 instances while the state-of-the-art approach only computes 1470 instances. In addition, our approach is the only approach to solve all 4131 application instances.
Understanding the Process of Human-AI Value Alignment
McKinlay, Jack, De Vos, Marina, Hoffmann, Janina A., Theodorou, Andreas
Background: Value alignment in computer science research is often used to refer to the process of aligning artificial intelligence with humans, but the way the phrase is used often lacks precision. Objectives: In this paper, we conduct a systematic literature review to advance the understanding of value alignment in artificial intelligence by characterising the topic in the context of its research literature. We use this to suggest a more precise definition of the term. Methods: We analyse 172 value alignment research articles that have been published in recent years and synthesise their content using thematic analyses. Results: Our analysis leads to six themes: value alignment drivers & approaches; challenges in value alignment; values in value alignment; cognitive processes in humans and AI; human-agent teaming; and designing and developing value-aligned systems. Conclusions: By analysing these themes in the context of the literature we define value alignment as an ongoing process between humans and autonomous agents that aims to express and implement abstract values in diverse contexts, while managing the cognitive limits of both humans and AI agents and also balancing the conflicting ethical and political demands generated by the values in different groups. Our analysis gives rise to a set of research challenges and opportunities in the field of value alignment for future work.
Large Language Models Discriminate Against Speakers of German Dialects
Bui, Minh Duc, Holtermann, Carolin, Hofmann, Valentin, Lauscher, Anne, von der Wense, Katharina
Dialects represent a significant component of human culture and are found across all regions of the world. In Germany, more than 40% of the population speaks a regional dialect (Adler and Hansen, 2022). However, despite cultural importance, individuals speaking dialects often face negative societal stereotypes. We examine whether such stereotypes are mirrored by large language models (LLMs). We draw on the sociolinguistic literature on dialect perception to analyze traits commonly associated with dialect speakers. Based on these traits, we assess the dialect naming bias and dialect usage bias expressed by LLMs in two tasks: an association task and a decision task. To assess a model's dialect usage bias, we construct a novel evaluation corpus that pairs sentences from seven regional German dialects (e.g., Alemannic and Bavarian) with their standard German counterparts. We find that: (1) in the association task, all evaluated LLMs exhibit significant dialect naming and dialect usage bias against German dialect speakers, reflected in negative adjective associations; (2) all models reproduce these dialect naming and dialect usage biases in their decision making; and (3) contrary to prior work showing minimal bias with explicit demographic mentions, we find that explicitly labeling linguistic demographics--German dialect speakers--amplifies bias more than implicit cues like dialect usage.
MIRA: Empowering One-Touch AI Services on Smartphones with MLLM-based Instruction Recommendation
Bian, Zhipeng, Zhu, Jieming, Xie, Xuyang, Dai, Quanyu, Zhao, Zhou, Dong, Zhenhua
The rapid advancement of generative AI technologies is driving the integration of diverse AI-powered services into smartphones, transforming how users interact with their devices. To simplify access to predefined AI services, this paper introduces MIRA, a pioneering framework for task instruction recommendation that enables intuitive one-touch AI tasking on smartphones. With MIRA, users can long-press on images or text objects to receive contextually relevant instruction recommendations for executing AI tasks. Our work introduces three key innovations: 1) A multimodal large language model (MLLM)-based recommendation pipeline with structured reasoning to extract key entities, infer user intent, and generate precise instructions; 2) A template-augmented reasoning mechanism that integrates high-level reasoning templates, enhancing task inference accuracy; 3) A prefix-tree-based constrained decoding strategy that restricts outputs to predefined instruction candidates, ensuring coherent and intent-aligned suggestions. Through evaluation using a real-world annotated datasets and a user study, MIRA has demonstrated substantial improvements in the accuracy of instruction recommendation. The encouraging results highlight MIRA's potential to revolutionize the way users engage with AI services on their smartphones, offering a more seamless and efficient experience.
DSCC-HS: A Dynamic Self-Reinforcing Framework for Hallucination Suppression in Large Language Models
Large Language Model (LLM) hallucination is a significant barrier to their reliable deployment. Current methods like Retrieval-Augmented Generation (RAG) are often reactive. We introduce **Dynamic Self-reinforcing Calibration for Hallucination Suppression (DSCC-HS)**, a novel, proactive framework that intervenes during autoregressive decoding. Inspired by dual-process cognitive theory, DSCC-HS uses a compact proxy model, trained in adversarial roles as a Factual Alignment Proxy (FAP) and a Hallucination Detection Proxy (HDP). During inference, these proxies dynamically steer a large target model by injecting a real-time steering vector, which is the difference between FAP and HDP logits, at each decoding step. This plug-and-play approach requires no modification to the target model. Our experiments on TruthfulQA and BioGEN show DSCC-HS achieves state-of-the-art performance. On TruthfulQA, it reached a 99.2% Factual Consistency Rate (FCR). On the long-form BioGEN benchmark, it attained the highest FActScore of 46.50. These results validate DSCC-HS as a principled and efficient solution for enhancing LLM factuality.
Op-Fed: Opinion, Stance, and Monetary Policy Annotations on FOMC Transcripts Using Active Learning
Kanganis, Alisa, Keith, Katherine A.
The U.S. Federal Open Market Committee (FOMC) regularly discusses and sets monetary policy, affecting the borrowing and spending decisions of millions of people. In this work, we release Op-Fed, a dataset of 1044 human-annotated sentences and their contexts from FOMC transcripts. We faced two major technical challenges in dataset creation: imbalanced classes -- we estimate fewer than 8% of sentences express a non-neutral stance towards monetary policy -- and inter-sentence dependence -- 65% of instances require context beyond the sentence-level. To address these challenges, we developed a five-stage hierarchical schema to isolate aspects of opinion, monetary policy, and stance towards monetary policy as well as the level of context needed. Second, we selected instances to annotate using active learning, roughly doubling the number of positive instances across all schema aspects. Using Op-Fed, we found a top-performing, closed-weight LLM achieves 0.80 zero-shot accuracy in opinion classification but only 0.61 zero-shot accuracy classifying stance towards monetary policy -- below our human baseline of 0.89. We expect Op-Fed to be useful for future model training, confidence calibration, and as a seed dataset for future annotation efforts.
A Domain Knowledge Informed Approach for Anomaly Detection of Electric Vehicle Interior Sounds
Kunte, Deepti, Cornelis, Bram, Colangeli, Claudio, Janssens, Karl, Van Baelen, Brecht, Gryllias, Konstantinos
The detection of anomalies in automotive cabin sounds is critical for ensuring vehicle quality and maintaining passenger comfort. In many real-world settings, this task is more appropriately framed as an unsupervised learning problem rather than the supervised case due to the scarcity or complete absence of labeled faulty data. In such an unsupervised setting, the model is trained exclusively on healthy samples and detects anomalies as deviations from normal behavior. However, in the absence of labeled faulty samples for validation and the limited reliability of commonly used metrics, such as validation reconstruction error, effective model selection remains a significant challenge. To overcome these limitations, a domain-knowledge-informed approach for model selection is proposed, in which proxy-anomalies engineered through structured perturbations of healthy spectrograms are used in the validation set to support model selection. The proposed methodology is evaluated on a high-fidelity electric vehicle dataset comprising healthy and faulty cabin sounds across five representative fault types viz., Imbalance, Modulation, Whine, Wind, and Pulse Width Modulation. This dataset, generated using advanced sound synthesis techniques, and validated via expert jury assessments, has been made publicly available to facilitate further research. Experimental evaluations on the five fault cases demonstrate the selection of optimal models using proxy-anomalies, significantly outperform conventional model selection strategies.