Goto

Collaborating Authors

 Mallick, Tanwi


EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants

arXiv.org Artificial Intelligence

Recent advancements have positioned AI, and particularly Large Language Models (LLMs), as transformative tools for scientific research, capable of addressing complex tasks that require reasoning, problem-solving, and decision-making. Their exceptional capabilities suggest their potential as scientific research assistants but also highlight the need for holistic, rigorous, and domain-specific evaluation to assess effectiveness in real-world scientific applications. This paper describes a multifaceted methodology for Evaluating AI models as scientific Research Assistants (EAIRA) developed at Argonne National Laboratory. This methodology incorporates four primary classes of evaluations. 1) Multiple Choice Questions to assess factual recall; 2) Open Response to evaluate advanced reasoning and problem-solving skills; 3) Lab-Style Experiments involving detailed analysis of capabilities as research assistants in controlled environments; and 4) Field-Style Experiments to capture researcher-LLM interactions at scale in a wide range of scientific domains and applications. These complementary methods enable a comprehensive analysis of LLM strengths and weaknesses with respect to their scientific knowledge, reasoning abilities, and adaptability. Recognizing the rapid pace of LLM advancements, we designed the methodology to evolve and adapt so as to ensure its continued relevance and applicability. This paper describes the methodology state at the end of February 2025. Although developed within a subset of scientific domains, the methodology is designed to be generalizable to a wide range of scientific domains.


Debiasing Watermarks for Large Language Models via Maximal Coupling

arXiv.org Machine Learning

Watermarking language models is essential for distinguishing between human and machine-generated text and thus maintaining the integrity and trustworthiness of digital communication. We present a novel green/red list watermarking approach that partitions the token set into ``green'' and ``red'' lists, subtly increasing the generation probability for green tokens. To correct token distribution bias, our method employs maximal coupling, using a uniform coin flip to decide whether to apply bias correction, with the result embedded as a pseudorandom watermark signal. Theoretical analysis confirms this approach's unbiased nature and robust detection capabilities. Experimental results show that it outperforms prior techniques by preserving text quality while maintaining high detectability, and it demonstrates resilience to targeted modifications aimed at improving text quality. This research provides a promising watermarking solution for language models, balancing effective detection with minimal impact on text quality.


ChatVis: Automating Scientific Visualization with a Large Language Model

arXiv.org Artificial Intelligence

We develop an iterative assistant we call ChatVis that can synthetically generate Python scripts for data analysis and visualization using a large language model (LLM). The assistant allows a user to specify the operations in natural language, attempting to generate a Python script for the desired operations, prompting the LLM to revise the script as needed until it executes correctly. The iterations include an error detection and correction mechanism that extracts error messages from the execution of the script and subsequently prompts LLM to correct the error. Our method demonstrates correct execution on five canonical visualization scenarios, comparing results with ground truth. We also compared our results with scripts generated by several other LLMs without any assistance. In every instance, ChatVis successfully generated the correct script, whereas the unassisted LLMs failed to do so. The code is available on GitHub: https://github.com/tanwimallick/ChatVis/.


Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey

arXiv.org Artificial Intelligence

Rationality is the quality of being guided by reason, characterized by logical thinking and decision-making that align with evidence and logical rules. This quality is essential for effective problem-solving, as it ensures that solutions are well-founded and systematically derived. Despite the advancements of large language models (LLMs) in generating human-like text with remarkable accuracy, they present biases inherited from the training data, inconsistency across different contexts, and difficulty understanding complex scenarios involving multiple layers of context. Therefore, recent research attempts to leverage the strength of multiple agents working collaboratively with various types of data and tools for enhanced consistency and reliability. To that end, this paper aims to understand whether multi-modal and multi-agent systems are advancing toward rationality by surveying the state-of-the-art works, identifying advancements over single-agent and single-modal systems in terms of rationality, and discussing open problems and future directions. We maintain an open repository at https://github.com/bowen-upenn/MMMA_Rationality.


A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

arXiv.org Artificial Intelligence

This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems. Our framework outlines a list of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities.


WildfireGPT: Tailored Large Language Model for Wildfire Analysis

arXiv.org Artificial Intelligence

Understanding and adapting to climate change is paramount for professionals such as urban planners, emergency managers, and infrastructure operators, as it directly influences urban development, disaster response, and the maintenance of essential services. Nonetheless, this task presents a complex challenge that necessitates the integration of advanced technology and scientific insights. Recent advances in LLMs present an innovative solution, particularly in democratizing climate science. They possess the unique capability to interpret and explain technical aspects of climate change through conversations, making this crucial information accessible to people from all backgrounds Rillig et al. [2023], Bulian et al. [2023], Chen et al. [2023]. However, given that LLMs are generalized models, their performance can be improved by providing additional domain-specific information. Recent research has been focusing on augmenting LLMs with external tools and data sources to ensure that the information provided is scientifically accurate: for example, leveraging authoritative data sources such as ClimateWatch Kraus et al. [2023] and findings from the IPCC AR6 reports Vaghefi et al. [2023] helps in refining the LLM's outputs, ensuring that the information is grounded in the latest research.


Analyzing Regional Impacts of Climate Change using Natural Language Processing Techniques

arXiv.org Artificial Intelligence

Understanding the multifaceted effects of climate change across diverse geographic locations is crucial for timely adaptation and the development of effective mitigation strategies. As the volume of scientific literature on this topic continues to grow exponentially, manually reviewing these documents has become an immensely challenging task. Utilizing Natural Language Processing (NLP) techniques to analyze this wealth of information presents an efficient and scalable solution. By gathering extensive amounts of peer-reviewed articles and studies, we can extract and process critical information about the effects of climate change in specific regions. We employ BERT (Bidirectional Encoder Representations from Transformers) for Named Entity Recognition (NER), which enables us to efficiently identify specific geographies within the climate literature. This, in turn, facilitates location-specific analyses. We conduct region-specific climate trend analyses to pinpoint the predominant themes or concerns related to climate change within a particular area, trace the temporal progression of these identified issues, and evaluate their frequency, severity, and potential development over time. These in-depth examinations of location-specific climate data enable the creation of more customized policy-making, adaptation, and mitigation strategies, addressing each region's unique challenges and providing more effective solutions rooted in data-driven insights. This approach, founded on a thorough exploration of scientific texts, offers actionable insights to a wide range of stakeholders, from policymakers to engineers to environmentalists. By proactively understanding these impacts, societies are better positioned to prepare, allocate resources wisely, and design tailored strategies to cope with future climate conditions, ensuring a more resilient future for all.


Wavelet-Inspired Multiscale Graph Convolutional Recurrent Network for Traffic Forecasting

arXiv.org Artificial Intelligence

Traffic forecasting is the foundation for intelligent transportation systems. Spatiotemporal graph neural networks have demonstrated state-of-the-art performance in traffic forecasting. However, these methods do not explicitly model some of the natural characteristics in traffic data, such as the multiscale structure that encompasses spatial and temporal variations at different levels of granularity or scale. To that end, we propose a Wavelet-Inspired Graph Convolutional Recurrent Network (WavGCRN) which combines multiscale analysis (MSA)-based method with Deep Learning (DL)-based method. In WavGCRN, the traffic data is decomposed into time-frequency components with Discrete Wavelet Transformation (DWT), constructing a multi-stream input structure; then Graph Convolutional Recurrent networks (GCRNs) are employed as encoders for each stream, extracting spatiotemporal features in different scales; and finally the learnable Inversed DWT and GCRN are combined as the decoder, fusing the information from all streams for traffic metrics reconstruction and prediction. Furthermore, road-network-informed graphs and data-driven graph learning are combined to accurately capture spatial correlation. The proposed method can offer well-defined interpretability, powerful learning capability, and competitive forecasting performance on real-world traffic data sets.


Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach

arXiv.org Artificial Intelligence

Natural language processing (NLP) is a promising approach for analyzing large volumes of climate-change and infrastructure-related scientific literature. However, best-in-practice NLP techniques require large collections of relevant documents (corpus). Furthermore, NLP techniques using machine learning and deep learning techniques require labels grouping the articles based on user-defined criteria for a significant subset of a corpus in order to train the supervised model. Even labeling a few hundred documents with human subject-matter experts is a time-consuming process. To expedite this process, we developed a weak supervision-based NLP approach that leverages semantic similarity between categories and documents to (i) establish a topic-specific corpus by subsetting a large-scale open-access corpus and (ii) generate category labels for the topic-specific corpus. In comparison with a months-long process of subject-matter expert labeling, we assign category labels to the whole corpus using weak supervision and supervised learning in about 13 hours. The labeled climate and NCF corpus enable targeted, efficient identification of documents discussing a topic (or combination of topics) of interest and identification of various effects of climate change on critical infrastructure, improving the usability of scientific literature and ultimately supporting enhanced policy and decision making. To demonstrate this capability, we conduct topic modeling on pairs of climate hazards and NCFs to discover trending topics at the intersection of these categories. This method is useful for analysts and decision-makers to quickly grasp the relevant topics and most important documents linked to the topic.


A data-centric weak supervised learning for highway traffic incident detection

arXiv.org Artificial Intelligence

Using the data from loop detector sensors for near-real-time detection of traffic incidents in highways is crucial to averting major traffic congestion. While recent supervised machine learning methods offer solutions to incident detection by leveraging human-labeled incident data, the false alarm rate is often too high to be used in practice. Specifically, the inconsistency in the human labeling of the incidents significantly affects the performance of supervised learning models. To that end, we focus on a data-centric approach to improve the accuracy and reduce the false alarm rate of traffic incident detection on highways. We develop a weak supervised learning workflow to generate high-quality training labels for the incident data without the ground truth labels, and we use those generated labels in the supervised learning setup for final detection. This approach comprises three stages. First, we introduce a data preprocessing and curation pipeline that processes traffic sensor data to generate high-quality training data through leveraging labeling functions, which can be domain knowledge-related or simple heuristic rules. Second, we evaluate the training data generated by weak supervision using three supervised learning models -- random forest, k-nearest neighbors, and a support vector machine ensemble -- and long short-term memory classifiers. The results show that the accuracy of all of the models improves significantly after using the training data generated by weak supervision. Third, we develop an online real-time incident detection approach that leverages the model ensemble and the uncertainty quantification while detecting incidents. Overall, we show that our proposed weak supervised learning workflow achieves a high incident detection rate (0.90) and low false alarm rate (0.08).