Law
Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA
Turski, Michał, Chiliński, Mateusz, Borchmann, Łukasz
Checkboxes are critical in real-world document processing where the presence or absence of ticks directly informs data extraction and decision-making processes. Yet, despite the strong performance of Large Vision and Language Models across a wide range of tasks, they struggle with interpreting checkable content. This challenge becomes particularly pressing in industries where a single overlooked checkbox may lead to costly regulatory or contractual oversights. To address this gap, we introduce the CheckboxQA dataset, a targeted resource designed to evaluate and improve model performance on checkbox-related tasks. It reveals the limitations of current models and serves as a valuable tool for advancing document comprehension systems, with significant implications for applications in sectors such as legal tech and finance. The dataset is publicly available at: https://github.com/Snowflake-Labs/CheckboxQA
Phase two of military AI has arrived
As I also write in my story, this push raises alarms from some AI safety experts about whether large language models are fit to analyze subtle pieces of intelligence in situations with high geopolitical stakes. It also accelerates the US toward a world where AI is not just analyzing military data but suggesting actions--for example, generating lists of targets. Proponents say this promises greater accuracy and fewer civilian deaths, but many human rights groups argue the opposite. With that in mind, here are three open questions to keep your eye on as the US military, and others around the world, bring generative AI to more parts of the so-called "kill chain." Talk to as many defense-tech companies as I have and you'll hear one phrase repeated quite often: "human in the loop."
Four arrested over obscene AI images in Japan first: reports
Police have arrested four people for selling obscene images created using generative AI in the first crackdown of its kind, local media reports said Tuesday. The four, aged in their 20s to 50s, allegedly made posters featuring indecent images of women and sold them on internet auction sites, public broadcaster NHK and other outlets said, citing police sources. Police could not immediately confirm the reports. NHK said the suspects had used free AI software to create images of naked adult women, who do not exist in the real world, using prompts including terms such as "legs open". They reportedly sold the posters for several thousand yen (several multiples of 7) each.
Confirmation Bias in Generative AI Chatbots: Mechanisms, Risks, Mitigation Strategies, and Future Research Directions
Drawing on cognitive psychology and computational linguistics, it examines how confirmation bias--commonly understood as the tendency to seek information that aligns with existing beliefs--can be replicated and amplified by the design and functioning of large language models. The article analyzes the mechanisms by which confirmation bias may manifest in chatbot interactions, assesses the ethical and practical risks associated with such bias, and proposes a range of mitigation strategies. These include technical interventions, interface redesign, and policy measures aimed at promoting balanced AI-generated discourse. The article concludes by outlining future research directions, emphasizing the need for interdisciplinary collaboration and empirical evaluation to better understand and address confirmation bias in generative AI systems. Keywords: confirmation bias, generative AI, chatbots, large language models, AI ethics, user interaction 1. Introduction The emergence of generative AI chatbots has marked a significant turning point in the field of artificial intelligence (AI) (Chang et al., 2 0 2 4). These systems, underpinned by large-scale language models, have demonstrated a remarkable capacity for producing coherent, contextually relevant, and often creative responses to human queries (Wang et al., 2 0 2 4).
DICE: A Framework for Dimensional and Contextual Evaluation of Language Models
Shrivastava, Aryan, Aoyagui, Paula Akemi
Language models (LMs) are increasingly being integrated into a wide range of applications, yet the modern evaluation paradigm does not sufficiently reflect how they are actually being used. Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed. To address this gap, we propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions. In this position paper, we begin by examining the insufficiency of existing LM benchmarks, highlighting their limited applicability to real-world use cases. Next, we propose a set of granular evaluation parameters that capture dimensions of LM behavior that are more meaningful to stakeholders across a variety of application domains. Specifically, we introduce the concept of context-agnostic parameters - such as robustness, coherence, and epistemic honesty - and context-specific parameters that must be tailored to the specific contextual constraints and demands of stakeholders choosing to deploy LMs into a particular setting. We then discuss potential approaches to operationalize this evaluation framework, finishing with the opportunities and challenges DICE presents to the LM evaluation landscape. Ultimately, this work serves as a practical and approachable starting point for context-specific and stakeholder-relevant evaluation of LMs.
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure
Gigant, Théo, Guinaudeau, Camille, Dufaux, Frédéric
Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.
Privacy Meets Explainability: Managing Confidential Data and Transparency Policies in LLM-Empowered Science
Shanmugarasa, Yashothara, Pan, Shidong, Ding, Ming, Zhao, Dehai, Rakotoarivelo, Thierry
As Large Language Models (LLMs) become integral to scientific workflows, concerns over the confidentiality and ethical handling of confidential data have emerged. This paper explores data exposure risks through LLM-powered scientific tools, which can inadvertently leak confidential information, including intellectual property and proprietary data, from scientists' perspectives. We propose "DataShield", a framework designed to detect confidential data leaks, summarize privacy policies, and visualize data flow, ensuring alignment with organizational policies and procedures. Our approach aims to inform scientists about data handling practices, enabling them to make informed decisions and protect sensitive information. Ongoing user studies with scientists are underway to evaluate the framework's usability, trustworthiness, and effectiveness in tackling real-world privacy challenges.
Refining Financial Consumer Complaints through Multi-Scale Model Interaction
Chen, Bo-Wei, Yen, An-Zi, Chen, Chung-Chi
Legal writing demands clarity, formality, and domain-specific precision-qualities often lacking in documents authored by individuals without legal training. To bridge this gap, this paper explores the task of legal text refinement that transforms informal, conversational inputs into persuasive legal arguments. We introduce FinDR, a Chinese dataset of financial dispute records, annotated with official judgments on claim reasonableness. Our proposed method, Multi-Scale Model Interaction (MSMI), leverages a lightweight classifier to evaluate outputs and guide iterative refinement by Large Language Models (LLMs). Experimental results demonstrate that MSMI significantly outperforms single-pass prompting strategies. Additionally, we validate the generalizability of MSMI on several short-text benchmarks, showing improved adversarial robustness. Our findings reveal the potential of multi-model collaboration for enhancing legal document generation and broader text refinement tasks.
Improving the Accuracy and Efficiency of Legal Document Tagging with Large Language Models and Instruction Prompts
Johnson, Emily, Holt, Xavier, Wilson, Noah
Legal multi-label classification is a critical task for organizing and accessing the vast amount of legal documentation. Despite its importance, it faces challenges such as the complexity of legal language, intricate label dependencies, and significant label imbalance. In this paper, we propose Legal-LLM, a novel approach that leverages the instruction-following capabilities of Large Language Models (LLMs) through fine-tuning. We reframe the multi-label classification task as a structured generation problem, instructing the LLM to directly output the relevant legal categories for a given document. We evaluate our method on two benchmark datasets, POSTURE50K and EURLEX57K, using micro-F1 and macro-F1 scores. Our experimental results demonstrate that Legal-LLM outperforms a range of strong baseline models, including traditional methods and other Transformer-based approaches. Furthermore, ablation studies and human evaluations validate the effectiveness of our approach, particularly in handling label imbalance and generating relevant and accurate legal labels.
Towards Stepwise Domain Knowledge-Driven Reasoning Optimization and Reflection Improvement
Liu, Chengyuan, Wang, Shihang, Qing, Lizhi, Song, Kaisong, Cao, Junjie, Lin, Jun, Zhang, Ji, Li, Ang, Kuang, Kun, Wu, Fei
Recently, stepwise supervision on Chain of Thoughts (CoTs) presents an enhancement on the logical reasoning tasks such as coding and math, with the help of Monte Carlo Tree Search (MCTS). However, its contribution to tasks requiring domain-specific expertise and knowledge remains unexplored. Motivated by the interest, we identify several potential challenges of vanilla MCTS within this context, and propose the framework of Stepwise Domain Knowledge-Driven Reasoning Optimization, employing the MCTS algorithm to develop step-level supervision for problems that require essential comprehension, reasoning, and specialized knowledge. Additionally, we also introduce the Preference Optimization towards Reflection Paths, which iteratively learns self-reflection on the reasoning thoughts from better perspectives. We have conducted extensive experiments to evaluate the advantage of the methodologies. Empirical results demonstrate the effectiveness on various legal-domain problems. We also report a diverse set of valuable findings, hoping to encourage the enthusiasm to the research of domain-specific LLMs and MCTS.