Overview
Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Models
Rooein, Donya, Plaza-del-Arco, Flor Miriam, Nozza, Debora, Hovy, Dirk
Given Farsi's speaker base of over 127 million people and the growing availability of digital text, including more than 1.3 million articles on Wikipedia, it is considered a middle-resource language. However, this label quickly crumbles when the situation is examined more closely. We focus on three subjective tasks (Sentiment Analysis, Emotion Analysis, and Toxicity Detection) and find significant challenges in data availability and quality, despite the overall increase in data availability. We review 110 publications on subjective tasks in Farsi and observe a lack of publicly available datasets. Furthermore, existing datasets often lack essential demographic factors, such as age and gender, that are crucial for accurately modeling subjectivity in language. When evaluating prediction models using the few available datasets, the results are highly unstable across both datasets and models. Our findings indicate that the volume of data is insufficient to significantly improve a language's prospects in NLP.
A Survey of the State-of-the-Art in Conversational Question Answering Systems
Perera, Manoj Madushanka, Mahmood, Adnan, Wijethilake, Kasun Eranda, Islam, Fahmida, Tahermazandarani, Maryam, Sheng, Quan Z.
Conversational Question Answering (ConvQA) systems have emerged as a pivotal area within Natural Language Processing (NLP) by driving advancements that enable machines to engage in dynamic and context-aware conversations. These capabilities are increasingly being applied across various domains, i.e., customer support, education, legal, and healthcare where maintaining a coherent and relevant conversation is essential. Building on recent advancements, this survey provides a comprehensive analysis of the state-of-the-art in ConvQA. This survey begins by examining the core components of ConvQA systems, i.e., history selection, question understanding, and answer prediction, highlighting their interplay in ensuring coherence and relevance in multi-turn conversations. It further investigates the use of advanced machine learning techniques, including but not limited to, reinforcement learning, contrastive learning, and transfer learning to improve ConvQA accuracy and efficiency. The pivotal role of large language models, i.e., RoBERTa, GPT-4, Gemini 2.0 Flash, Mistral 7B, and LLaMA 3, is also explored, thereby showcasing their impact through data scalability and architectural advancements. Additionally, this survey presents a comprehensive analysis of key ConvQA datasets and concludes by outlining open research directions. Overall, this work offers a comprehensive overview of the ConvQA landscape and provides valuable insights to guide future advancements in the field.
MonoGlass3D: Monocular 3D Glass Detection with Plane Regression and Adaptive Feature Fusion
Zhang, Kai, Zhao, Guoyang, Shi, Jianxing, Liu, Bonan, Qi, Weiqing, Ma, Jun
Detecting and localizing glass in 3D environments poses significant challenges for visual perception systems, as the optical properties of glass often hinder conventional sensors from accurately distinguishing glass surfaces. The lack of real-world datasets focused on glass objects further impedes progress in this field. To address this issue, we introduce a new dataset featuring a wide range of glass configurations with precise 3D annotations, collected from distinct real-world scenarios. On the basis of this dataset, we propose MonoGlass3D, a novel approach tailored for monocular 3D glass detection across diverse environments. To overcome the challenges posed by the ambiguous appearance and context diversity of glass, we propose an adaptive feature fusion module that empowers the network to effectively capture contextual information in varying conditions. Additionally, to exploit the distinct planar geometry of glass surfaces, we present a plane regression pipeline, which enables seamless integration of geometric properties within our framework. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both glass segmentation and monocular glass depth estimation. Our results highlight the advantages of combining geometric and contextual cues for transparent surface understanding.
User Privacy and Large Language Models: An Analysis of Frontier Developers' Privacy Policies
King, Jennifer, Klyman, Kevin, Capstick, Emily, Saade, Tiffany, Hsieh, Victoria
Model developers are eager to acquire new sources of high-quality training data as they race to improve model capabilities and win market share. This paper analyzes the privacy policies of six U.S. frontier AI developers to understand how they use their users' chats to train models. Drawing primarily on the California Consumer Privacy Act, we develop a novel qualitative coding schema that we apply to each developer's relevant privacy policies to compare data collection and use practices across the six companies. We find that all six developers appear to employ their users' chat data to train and improve their models by default, and that some retain this data indefinitely. Developers may collect and train on personal information disclosed in chats, including sensitive information such as biometric and health data, as well as files uploaded by users. Four of the six companies we examined appear to include children's chat data for model training, as well as customer data from other products. On the whole, developers' privacy policies often lack essential information about their practices, highlighting the need for greater transparency and accountability. We address the implications of users' lack of consent for the use of their chat data for model training, data security risks arising from indefinite chat data retention, and issues with training on children's chat data. We conclude by providing recommendations to policymakers and developers to address the data privacy challenges posed by LLM-powered chatbots.
Cumplimiento del Reglamento (UE) 2024/1689 en robรณtica y sistemas autรณnomos: una revisiรณn sistemรกtica de la literatura
This systematic literature review analyzes the current state of compliance with Regulation (EU) 2024/1689 in autonomous robotic systems, focusing on cybersecurity frameworks and methodologies. Using the PRISMA protocol, 22 studies were selected from 243 initial records across IEEE Xplore, ACM DL, Scopus, and Web of Science. Findings reveal partial regulatory alignment: while progress has been made in risk management and encrypted communications, significant gaps persist in explainability modules, real-time human oversight, and knowledge base traceability. Only 40% of reviewed solutions explicitly address transparency requirements, and 30% implement failure intervention mechanisms. The study concludes that modular approaches integrating risk, supervision, and continuous auditing are essential to meet the AI Act mandates in autonomous robotics.
VILOD: A Visual Interactive Labeling Tool for Object Detection
The advancement of Object Detection (OD) using Deep Learning (DL) is often hindered by the significant challenge of acquiring large, accurately labeled datasets, a process that is time-consuming and expensive. While techniques like Active Learning (AL) can reduce annotation effort by intelligently querying informative samples, they often lack transparency, limit the strategic insight of human experts, and may overlook informative samples not aligned with an employed query strategy. To mitigate these issues, Human-in-the-Loop (HITL) approaches integrating human intelligence and intuition throughout the machine learning life-cycle have gained traction. Leveraging Visual Analytics (VA), effective interfaces can be created to facilitate this human-AI collaboration. This thesis explores the intersection of these fields by developing and investigating "VILOD: A Visual Interactive Labeling tool for Object Detection". VILOD utilizes components such as a t-SNE projection of image features, together with uncertainty heatmaps and model state views. Enabling users to explore data, interpret model states, AL suggestions, and implement diverse sample selection strategies within an iterative HITL workflow for OD. An empirical investigation using comparative use cases demonstrated how VILOD, through its interactive visualizations, facilitates the implementation of distinct labeling strategies by making the model's state and dataset characteristics more interpretable (RQ1). The study showed that different visually-guided labeling strategies employed within VILOD result in competitive OD performance trajectories compared to an automated uncertainty sampling AL baseline (RQ2). This work contributes a novel tool and empirical insight into making the HITL-AL workflow for OD annotation more transparent, manageable, and potentially more effective.
Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare
Shankar, Ravi, Wong, Sheng, Li, Lin, Bachmann, Magdalena, Silverthorne, Alex, Albert, Beth, Jones, Gabriel Davis
Reliable abstention is critical for retrieval-augmented generation (RAG) systems, particularly in safety-critical domains such as women's health, where incorrect answers can lead to harm. We present an energy-based model (EBM) that learns a smooth energy landscape over a dense semantic corpus of 2.6M guideline-derived questions, enabling the system to decide when to generate or abstain. We benchmark the EBM against a calibrated softmax baseline and a k-nearest neighbour (kNN) density heuristic across both easy and hard abstention splits, where hard cases are semantically challenging near-distribution queries. The EBM achieves superior abstention performance abstention on semantically hard cases, reaching AUROC 0.961 versus 0.950 for softmax, while also reducing FPR@95 (0.235 vs 0.331). On easy negatives, performance is comparable across methods, but the EBM's advantage becomes most pronounced in safety-critical hard distributions. A comprehensive ablation with controlled negative sampling and fair data exposure shows that robustness stems primarily from the energy scoring head, while the inclusion or exclusion of specific negative types (hard, easy, mixed) sharpens decision boundaries but is not essential for generalisation to hard cases. These results demonstrate that energy-based abstention scoring offers a more reliable confidence signal than probability-based softmax confidence, providing a scalable and interpretable foundation for safe RAG systems.
E-THER: A Multimodal Dataset for Empathic AI -- Towards Emotional Mismatch Awareness
Tahir, Sharjeel, Johnson, Judith, Abu-Khalaf, Jumana, Shah, Syed Afaq Ali
A prevalent shortfall among current empathic AI systems is their inability to recognize when verbal expressions may not fully reflect underlying emotional states. This is because the existing datasets, used for the training of these systems, focus on surface-level emotion recognition without addressing the complex verbal-visual incongruence (mismatch) patterns useful for empathic understanding. In this paper, we present E-THER, the first Person-Centered Therapy-grounded multimodal dataset with multidimensional annotations for verbal-visual incongruence detection, enabling training of AI systems that develop genuine rather than performative empathic capabilities. The annotations included in the dataset are drawn from humanistic approach, i.e., identifying verbal-visual emotional misalignment in client-counsellor interactions - forming a framework for training and evaluating AI on empathy tasks. Additional engagement scores provide behavioral annotations for research applications. Notable gains in empathic and therapeutic conversational qualities are observed in state-of-the-art vision-language models (VLMs), such as IDEFICS and VideoLLAVA, using evaluation metrics grounded in empathic and therapeutic principles. Empirical findings indicate that our incongruence-trained models outperform general-purpose models in critical traits, such as sustaining therapeutic engagement, minimizing artificial or exaggerated linguistic patterns, and maintaining fidelity to PCT theoretical framework.
Preacher: Paper-to-Video Agentic System
Liu, Jingwei, Yang, Ling, Luo, Hao, Wang, Fan, Li, Hongyan, Wang, Mengdi
The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a topdown approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/Gen-Verse/Paper2Video
Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems
Akter, Sanjeda, Shihab, Ibne Farabi, Sharma, Anuj
The integration of Large Language Models (LLMs) with computer vision is profoundly transforming perception tasks like image segmentation. For intelligent transportation systems (ITS), where accurate scene understanding is critical for safety and efficiency, this new paradigm offers unprecedented capabilities. This survey systematically reviews the emerging field of LLM-augmented image segmentation, focusing on its applications, challenges, and future directions within ITS. We provide a taxonomy of current approaches based on their prompting mechanisms and core architectures, and we highlight how these innovations can enhance road scene understanding for autonomous driving, traffic monitoring, and infrastructure maintenance. Finally, we identify key challenges, including real-time performance and safety-critical reliability, and outline a perspective centered on explainable, human-centric AI as a prerequisite for the successful deployment of this technology in next-generation transportation systems.