Goto

Collaborating Authors

 insurance


Federated Learning for the Design of Parametric Insurance Indices under Heterogeneous Renewable Production Losses

Niakh, Fallou

arXiv.org Machine Learning

We propose a federated learning framework for the calibration of parametric insurance indices under heterogeneous renewable energy production losses. Producers locally model their losses using Tweedie generalized linear models and private data, while a common index is learned through federated optimization without sharing raw observations. The approach accommodates heterogeneity in variance and link functions and directly minimizes a global deviance objective in a distributed setting. We implement and compare FedAvg, FedProx and FedOpt, and benchmark them against an existing approximation-based aggregation method. An empirical application to solar power production in Germany shows that federated learning recovers comparable index coefficients under moderate heterogeneity, while providing a more general and scalable framework.


I Have a Job Offer I Can't Refuse. The Company It Comes From Has a Terrible Reputation for Women.

Slate

Good Job I Have a Job Offer I Can't Refuse. The Company It Comes From Has a Terrible Reputation for Women. My company unexpectedly outsourced my entire department to a firm that uses AI for our jobs, even though I don't work a job that can really be done by machine learning. I have some savings but can't go without health insurance: my daughter and I both have the same complex chronic condition. I was briefly on public insurance in the past and it was a nightmare of waitlists leading to a cascade of hospital stays.


Design, Results and Industry Implications of the World's First Insurance Large Language Model Evaluation Benchmark

Zhou, Hua, Ma, Bing, Zhang, Yufei, Zhao, Yi

arXiv.org Artificial Intelligence

This paper comprehensively elaborates on the construction methodology, multi-dimensional evaluation system, and underlying design philosophy of CUFEInse v1.0. Adhering to the principles of "quantitative-oriented, expert-driven, and multi-validation," the benchmark establishes an evaluation framework covering 5 core dimensions, 54 sub-indicators, and 14,430 high-quality questions, encompassing insurance theoretical knowledge, industry understanding, safety and compliance, intelligent agent application, and logical rigor. Based on this benchmark, a comprehensive evaluation was conducted on 11 mainstream large language models. The evaluation results reveal that general-purpose models suffer from common bottlenecks such as weak actuarial capabilities and inadequate compliance adaptation. High-quality domain-specific training demonstrates significant advantages in insurance vertical scenarios but exhibits shortcomings in business adaptation and compliance. The evaluation also accurately identifies the common bottlenecks of current large models in professional scenarios such as insurance actuarial, underwriting and claim settlement reasoning, and compliant marketing copywriting. The establishment of CUFEInse not only fills the gap in professional evaluation benchmarks for the insurance field, providing academia and industry with a professional, systematic, and authoritative evaluation tool, but also its construction concept and methodology offer important references for the evaluation paradigm of large models in vertical fields, serving as an authoritative reference for academic model optimization and industrial model selection. Finally, the paper looks forward to the future iteration direction of the evaluation benchmark and the core development direction of "domain adaptation + reasoning enhancement" for insurance large models.


ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Gomaa, Amr, Salem, Ahmed, Abdelnabi, Sahar

arXiv.org Artificial Intelligence

As language models evolve into autonomous agents that act and communicate on behalf of users, ensuring safety in multi-agent ecosystems becomes a central challenge. Interactions between personal assistants and external service providers expose a core tension between utility and protection: effective collaboration requires information sharing, yet every exchange creates new attack surfaces. We introduce ConVerse, a dynamic benchmark for evaluating privacy and security risks in agent-agent interactions. ConVerse spans three practical domains (travel, real estate, insurance) with 12 user personas and over 864 contextually grounded attacks (611 privacy, 253 security). Unlike prior single-agent settings, it models autonomous, multi-turn agent-to-agent conversations where malicious requests are embedded within plausible discourse. Privacy is tested through a three-tier taxonomy assessing abstraction quality, while security attacks target tool use and preference manipulation. Evaluating seven state-of-the-art models reveals persistent vulnerabilities; privacy attacks succeed in up to 88% of cases and security breaches in up to 60%, with stronger models leaking more. By unifying privacy and security within interactive multi-agent contexts, ConVerse reframes safety as an emergent property of communication.


Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

Soroka, Emi, Chopra, Tanmay, Desai, Krish, Lall, Sanjay

arXiv.org Artificial Intelligence

Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.


InsurAgent: A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance

Geng, Ziheng, Liu, Jiachen, Cao, Ran, Cheng, Lu, Frangopol, Dan M., Cheng, Minghui

arXiv.org Artificial Intelligence

Flood insurance is an effective strategy for individuals to mitigate disaster-related losses. However, participation rates among at-risk populations in the United States remain strikingly low. This gap underscores the need to understand and model the behavioral mechanisms underlying insurance decisions. Large language models (LLMs) have recently exhibited human-like intelligence across wide-ranging tasks, offering promising tools for simulating human decision-making. This study constructs a benchmark dataset to capture insurance purchase probabilities across factors. Using this dataset, the capacity of LLMs is evaluated: while LLMs exhibit a qualitative understanding of factors, they fall short in estimating quantitative probabilities. To address this limitation, InsurAgent, an LLM-empowered agent comprising five modules including perception, retrieval, reasoning, action, and memory, is proposed. The retrieval module leverages retrieval-augmented generation (RAG) to ground decisions in empirical survey data, achieving accurate estimation of marginal and bivariate probabilities. The reasoning module leverages LLM common sense to extrapolate beyond survey data, capturing contextual information that is intractable for traditional models. The memory module supports the simulation of temporal decision evolutions, illustrated through a roller coaster life trajectory. Overall, InsurAgent provides a valuable tool for behavioral modeling and policy analysis.


Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments

Purushothama, Abhishek, Min, Junghyun, Waldon, Brandon, Schneider, Nathan

arXiv.org Artificial Intelligence

Legal interpretation frequently involves assessing how a legal text, as understood by an 'ordinary' speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.


GPO: Learning from Critical Steps to Improve LLM Reasoning

Yu, Jiahao, Cheng, Zelei, Wu, Xian, Xing, Xinyu

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly used in various domains, showing impressive potential on different tasks. Recently, reasoning LLMs have been proposed to improve the \textit{reasoning} or \textit{thinking} capabilities of LLMs to solve complex problems. Despite the promising results of reasoning LLMs, enhancing the multi-step reasoning capabilities of LLMs still remains a significant challenge. While existing optimization methods have advanced the LLM reasoning capabilities, they often treat reasoning trajectories as a whole, without considering the underlying critical steps within the trajectory. In this paper, we introduce \textbf{G}uided \textbf{P}ivotal \textbf{O}ptimization (GPO), a novel fine-tuning strategy that dives into the reasoning process to enable more effective improvements. GPO first identifies the `critical step' within a reasoning trajectory - a point that the model must carefully proceed to succeed at the problem. We locate the critical step by estimating the advantage function. GPO then resets the policy to the critical step, samples the new rollout and prioritizes the learning process on those rollouts. This focus allows the model to learn more effectively from pivotal moments within the reasoning process to improve the reasoning performance. We demonstrate that GPO is a general strategy that can be integrated with various optimization methods to improve reasoning performance. Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhance the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process.


WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation

Wang, Kuang-Da, Wang, Zhao, Shimose, Yotaro, Wang, Wei-Yao, Takamatsu, Shingo

arXiv.org Artificial Intelligence

Witnessed by the recent advancements on leveraging LLM for coding and multimodal understanding, we present WebGen-V, a new benchmark and framework for instruction-to-HTML generation that enhances both data quality and evaluation granularity. WebGen-V contributes three key innovations: (1) an unbounded and extensible agentic crawling framework that continuously collects real-world webpages and can leveraged to augment existing benchmarks; (2) a structured, section-wise data representation that integrates metadata, localized UI screenshots, and JSON-formatted text and image assets, explicit alignment between content, layout, and visual components for detailed multimodal supervision; and (3) a section-level multimodal evaluation protocol aligning text, layout, and visuals for high-granularity assessment. Experiments with state-of-the-art LLMs and ablation studies validate the effectiveness of our structured data and section-wise evaluation, as well as the contribution of each component. To the best of our knowledge, WebGen-V is the first work to enable high-granularity agentic crawling and evaluation for instruction-to-HTML generation, providing a unified pipeline from real-world data acquisition and webpage generation to structured multimodal assessment.


My Mom Cured Her Post-Divorce Loneliness by Becoming a Scammer. I Have to Get Her to Stop.

Slate

My mother and father divorced two years ago after a long marriage. She is 64, and the divorce hit her hard. She was very upset because, among other things, my father started dating soon after the divorce and has been steadily going out with a woman for the past six months. Meanwhile, my mother had a hard time dating. She complained about it bitterly, saying it was not fair my father got to restart his life so easily while no one would go out with her.