South America
Incongruence Identification in Eyewitness Testimony
Nair, Akshara, Afroz, Zeba, Akhtar, Md Shad
Incongruence detection in eyewitness narratives is critical for understanding the reliability of testimonies, yet traditional approaches often fail to address the nuanced inconsistencies inherent in such accounts. In this paper, we introduce a novel task of incongruence detection in eyewitness testimonies. Given a pair of testimonies containing of multiple pairs of question and answer by two subjects, we identify contextually related incongruence between the two subjects. We also mark the span of incongruences in the utterances. To achieve this, we developed MIND(MultI-EyewitNess Deception) - a comprehensive dataset consisting of 2927 pairs of contextually related answers designed to capture both explicit and implicit contradictions. INstruction - TunEd iNcongruity Detection framework based on 6W and multi-hop reasoning approach, aka. INTEND. Drawing from investigative techniques, INTEND address the task as a close-style problem, contradicting on the who, what, when, where and why aspect of the content. Our findings shows that prompt tuning, especially when utilizing our framework, enhances the detection of incongruences by a margin of +5.63 percent. We compare our approach with multiple fine-tuning and prompt tuning techniques on MLMs and LLMs. Emperical results demonstrate convincing performance improvement in F1-score over fine-tuned and regular prompt-tuning techniques, highlighting the effectiveness of our approach.
Mitigating Sensitive Information Leakage in LLMs4Code through Machine Unlearning
Geng, Ruotong, Geng, Mingyang, Wang, Shangwen, Wang, Haotian, Lin, Zhipeng, Dong, Dezun
Large Language Models for Code (LLMs4Code) excel at code generation tasks, yielding promise to release developers from huge software development burdens. Nonetheless, these models have been shown to suffer from the significant privacy risks due to the potential leakage of sensitive information embedded during training, known as the memorization problem. Addressing this issue is crucial for ensuring privacy compliance and upholding user trust, but till now there is a dearth of dedicated studies in the literature that focus on this specific direction. Recently, machine unlearning has emerged as a promising solution by enabling models to "forget" sensitive information without full retraining, offering an efficient and scalable approach compared to traditional data cleaning methods. In this paper, we empirically evaluate the effectiveness of unlearning techniques for addressing privacy concerns in LLMs4Code.Specifically, we investigate three state-of-the-art unlearning algorithms and three well-known open-sourced LLMs4Code, on a benchmark that takes into consideration both the privacy data to be forgotten as well as the code generation capabilites of these models. Results show that it is feasible to mitigate the privacy concerns of LLMs4Code through machine unlearning while maintain their code generation capabilities at the same time. We also dissect the forms of privacy protection/leakage after unlearning and observe that there is a shift from direct leakage to indirect leakage, which underscores the need for future studies addressing this risk.
Online Bidding Algorithms with Strict Return on Spend (ROS) Constraint
Auto-bidding problem under a strict return-on-spend constraint (ROSC) is considered, where an algorithm has to make decisions about how much to bid for an ad slot depending on the revealed value, and the hidden allocation and payment function that describes the probability of winning the ad-slot depending on its bid. The objective of an algorithm is to maximize the expected utility (product of ad value and probability of winning the ad slot) summed across all time slots subject to the total expected payment being less than the total expected utility, called the ROSC. A (surprising) impossibility result is derived that shows that no online algorithm can achieve a sub-linear regret even when the value, allocation and payment function are drawn i.i.d. from an unknown distribution. The problem is non-trivial even when the revealed value remains constant across time slots, and an algorithm with regret guarantee that is optimal up to logarithmic factor is derived.
Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents
Kumbhar, Shrinidhi, Mishra, Venkatesh, Coutinho, Kevin, Handa, Divij, Iquebal, Ashif, Baral, Chitta
Materials discovery and design are essential for advancing technology across various industries by enabling the development of application-specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this process. We explore the potential of LLMs to generate viable hypotheses that, once validated, can expedite materials discovery. Collaborating with materials science experts, we curated a novel dataset from recent journal publications, featuring real-world goals, constraints, and methods for designing real-world applications. Using this dataset, we test LLM-based agents that generate hypotheses for achieving given goals under specific constraints. To assess the relevance and quality of these hypotheses, we propose a novel scalable evaluation metric that emulates the process a materials scientist would use to evaluate a hypothesis critically. Our curated dataset, proposed method, and evaluation framework aim to advance future research in accelerating materials discovery and design with LLMs.
Convolutional Fourier Analysis Network (CFAN): A Unified Time-Frequency Approach for ECG Classification
Machine learning has transformed the classification of biomedical signals such as electrocardiograms (ECGs). Advances in deep learning, particularly convolutional neural networks (CNNs), enable automatic feature extraction, raising the question: Can combining time- and frequency-domain attributes enhance classification accuracy? To explore this, we evaluated three ECG classification tasks: (1) arrhythmia classification, (2) identity recognition, and (3) apnea detection. We initially tested three methods: (i) 2-D spectrogram-based frequency-time classification (SPECT), (ii) time-domain classification using a 1-D CNN (CNN1D), and (iii) frequency-domain classification using a Fourier transform-based CNN (FFT1D). Performance was validated using K-fold cross-validation. Among these, CNN1D (time only) performed best, followed by SPECT (time-frequency) and FFT1D (frequency only). Surprisingly, SPECT, which integrates time- and frequency-domain features, performed worse than CNN1D, suggesting a need for a more effective time and frequency fusion approach. To address this, we tested the recently proposed Fourier Analysis Network (FAN), which combines time- and frequency-domain features. However, FAN performed comparably to CNN1D, excelling in some tasks while underperforming in others. To enhance this approach, we developed the Convolutional Fourier Analysis Network (CFAN), which integrates FAN with CNN. CFAN outperformed all previous methods across all classification tasks. These findings underscore the advantages of combining time- and frequency-domain features, demonstrating CFAN's potential as a powerful and versatile solution for ECG classification and broader biomedical signal analysis
RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care
Yang, Ziqi, Lu, Yuxuan, Bagdasarian, Jennifer, Swain, Vedant Das, Agarwal, Ritu, Campbell, Collin, Al-Refaire, Waddah, El-Bayoumi, Jehan, Gao, Guodong, Wang, Dakuo, Yao, Bingsheng, Shara, Nawar
Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group of cancers that account for more than 35% of cancer-related deaths worldwide, but postoperative complications are unpredictable and can be life-threatening. In this paper, we investigate how recent advancements in large language models (LLMs) can benefit remote patient monitoring (RPM) systems through clinical integration by designing RECOVER, an LLM-powered RPM system for postoperative GI cancer care. To closely engage stakeholders in the design process, we first conducted seven participatory design sessions with five clinical staff and interviewed five cancer patients to derive six major design strategies for integrating clinical guidelines and information needs into LLM-based RPM systems. We then designed and implemented RECOVER, which features an LLM-powered conversational agent for cancer patients and an interactive dashboard for clinical staff to enable efficient postoperative RPM. Finally, we used RECOVER as a pilot system to assess the implementation of our design strategies with four clinical staff and five patients, providing design implications by identifying crucial design elements, offering insights on responsible AI, and outlining opportunities for future LLM-powered RPM systems.
Evaluating Vision-Language Models for Emotion Recognition
Bhattacharyya, Sree, Wang, James Z.
Large Vision-Language Models (VLMs) have achieved unprecedented success in several objective multimodal reasoning tasks. However, to further enhance their capabilities of empathetic and effective communication with humans, improving how VLMs process and understand emotions is crucial. Despite significant research attention on improving affective understanding, there is a lack of detailed evaluations of VLMs for emotion-related tasks, which can potentially help inform downstream fine-tuning efforts. In this work, we present the first comprehensive evaluation of VLMs for recognizing evoked emotions from images. We create a benchmark for the task of evoked emotion recognition and study the performance of VLMs for this task, from perspectives of correctness and robustness. Through several experiments, we demonstrate important factors that emotion recognition performance depends on, and also characterize the various errors made by VLMs in the process. Finally, we pinpoint potential causes for errors through a human evaluation study. We use our experimental results to inform recommendations for the future of emotion research in the context of VLMs.
Football Manager 25 cancelled after two delays
The latest update in the popular Football Manager series has been cancelled, its makers have announced. Fans of the long-running video game began to speculate about its fate when an update due to be unveiled late last month did not arrive. In a blog post, developer Sports Interactive told players it had made the "difficult decision" to cancel the 2025 edition as it was "too far away from the standards you deserve". It said it would now shift focus to the 2026 version of the game and fans who had preordered the cancelled release could obtain a refund. Football Manager, first launched in 2004, allows fans to step into the shoes of a gaffer and guide a chosen team through a season.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
Update after rebuttal I thank the authors for a comprehensive rebuttal and extra experiments. It has addressed most of my concerns, and I have updated my score. The authors should make sure to properly tone down the claims about improved training for LDA (vs. It seems to me that we do not really understand very well what is happening in these models at this stage; this perplexity experiment is just scratching the surface (and should be presented as such). I am also a bit puzzled by the use of alpha 1.001 (vs.
Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject
Duan, Zenghao, Duan, Wenbin, Yin, Zhiyi, Shen, Yinghan, Jing, Shaoling, Zhang, Jie, Shen, Huawei, Cheng, Xueqi
Knowledge editing has become a promising approach for efficiently and precisely updating knowledge embedded in large language models (LLMs). In this work, we focus on Same-Subject Editing, which involves modifying multiple attributes of a single entity to ensure comprehensive and consistent updates to entity-centric knowledge. Through preliminary observation, we identify a significant challenge: Current state-of-the-art editing methods struggle when tasked with editing multiple related knowledge pieces for the same subject. To address the lack of relevant editing data for identical subjects in traditional benchmarks, we introduce the $\text{S}^2\text{RKE}$(Same-Subject Related Knowledge Editing) benchmark. Our extensive experiments reveal that only mainstream locate-then-edit methods, such as ROME and MEMIT, exhibit "related knowledge perturbation," where subsequent edits interfere with earlier ones. Further analysis reveals that these methods over-rely on subject information, neglecting other critical factors, resulting in reduced editing effectiveness.