Goto

Collaborating Authors

 account balance


Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks

arXiv.org Artificial Intelligence

As Large Language Models (LLMs) increasingly integrate into everyday workflows, where users shape outcomes through multi-turn collaboration, a critical question emerges: do users with different personality traits systematically prefer certain LLMs over others? We conducted a study with 32 participants evenly distributed across four Keirsey personality types, evaluating their interactions with GPT-4 and Claude 3.5 across four collaborative tasks: data analysis, creative writing, information retrieval, and writing assistance. Results revealed significant personality-driven preferences: Rationals strongly preferred GPT-4, particularly for goal-oriented tasks, while idealists favored Claude 3.5, especially for creative and analytical tasks. Other personality types showed task-dependent preferences. Sentiment analysis of qualitative feedback confirmed these patterns. Notably, aggregate helpfulness ratings were similar across models, showing how personality-based analysis reveals LLM differences that traditional evaluations miss.


Trivial Trojans: How Minimal MCP Servers Enable Cross-Tool Exfiltration of Sensitive Data

arXiv.org Artificial Intelligence

The Model Context Protocol (MCP) represents a significant advancement in AI-tool integration, enabling seamless communication between AI agents and external services. However, this connectivity introduces novel attack vectors that remain largely unexplored. This paper demonstrates how unsophisticated threat actors, requiring only basic programming skills and free web tools, can exploit MCP's trust model to exfiltrate sensitive financial data. We present a proof-of-concept attack where a malicious weather MCP server, disguised as benign functionality, discovers and exploits legitimate banking tools to steal user account balances. The attack chain requires no advanced technical knowledge, server infrastructure, or monetary investment. The findings reveal a critical security gap in the emerging MCP ecosystem: while individual servers may appear trustworthy, their combination creates unexpected cross-server attack surfaces. Unlike traditional cybersecurity threats that assume sophisticated adversaries, our research shows that the barrier to entry for MCP-based attacks is alarmingly low. A threat actor with undergraduate-level Python knowledge can craft convincing social engineering attacks that exploit the implicit trust relationships MCP establishes between AI agents and tool providers. This work contributes to the nascent field of MCP security by demonstrating that current MCP implementations allow trivial cross-server attacks and proposing both immediate mitigations and protocol improvements to secure this emerging ecosystem.


POLYRAG: Integrating Polyviews into Retrieval-Augmented Generation for Medical Applications

arXiv.org Artificial Intelligence

Large language models (LLMs) have become a disruptive force in the industry, introducing unprecedented capabilities in natural language processing, logical reasoning and so on. However, the challenges of knowledge updates and hallucination issues have limited the application of LLMs in medical scenarios, where retrieval-augmented generation (RAG) can offer significant assistance. Nevertheless, existing retrieve-then-read approaches generally digest the retrieved documents, without considering the timeliness, authoritativeness and commonality of retrieval. We argue that these approaches can be suboptimal, especially in real-world applications where information from different sources might conflict with each other and even information from the same source in different time scale might be different, and totally relying on this would deteriorate the performance of RAG approaches. We propose PolyRAG that carefully incorporate judges from different perspectives and finally integrate the polyviews for retrieval augmented generation in medical applications. Due to the scarcity of real-world benchmarks for evaluation, to bridge the gap we propose PolyEVAL, a benchmark consists of queries and documents collected from real-world medical scenarios (including medical policy, hospital & doctor inquiry and healthcare) with multiple tagging (e.g., timeliness, authoritativeness) on them. Extensive experiments and analysis on PolyEVAL have demonstrated the superiority of PolyRAG.


AgentOrca: A Dual-System Framework to Evaluate Language Agents on Operational Routine and Constraint Adherence

arXiv.org Artificial Intelligence

As language agents progressively automate critical tasks across domains, their ability to operate within operational constraints and safety protocols becomes essential. While extensive research has demonstrated these agents' effectiveness in downstream task completion, their reliability in following operational procedures and constraints remains largely unexplored. To this end, we present AgentOrca, a dual-system framework for evaluating language agents' compliance with operational constraints and routines. Our framework encodes action constraints and routines through both natural language prompts for agents and corresponding executable code serving as ground truth for automated verification. Through an automated pipeline of test case generation and evaluation across five real-world domains, we quantitatively assess current language agents' adherence to operational constraints. Our findings reveal notable performance gaps among state-of-the-art models, with large reasoning models like o1 demonstrating superior compliance while others show significantly lower performance, particularly when encountering complex constraints or user persuasion attempts.


Assessing the impacts of tradable credit schemes through agent-based simulation

arXiv.org Machine Learning

Tradable credit schemes (TCS) have been attracting interest from the transportation research community as an appealing alternative to congestion pricing, due to the advantages of revenue neutrality and equity. Nonetheless, existing research has largely employed network and market equilibrium approaches with simplistic characterizations of transportation demand, supply, credit market operations, and market behavior. Agent- and activity-based simulation affords a natural means to comprehensively assess TCS by more realistically modeling demand, supply, and individual market interactions. We propose an integrated simulation framework for modeling a TCS, and implements it within the state-of-the-art open-source urban simulation platform SimMobility, including: (a) a flexible TCS design that considers multiple trips and explicitly accounts for individual trading behaviors; (b) a simulation framework that captures the complex interactions between a TCS regulator, the traveler, and the TCS market itself, with the flexibility to test future TCS designs and relevant mobility models; and (c) a set of simulation experiments on a large mesoscopic multimodal network combined with a Bayesian Optimization approach for TCS optimal design. The experiment results indicate network and market performance to stabilize over the day-to-day process, showing the alignment of our agent-based simulation with the known theoretical properties of TCS. We confirm the efficiency of TCS in reducing congestion under the adopted market behavioral assumptions and open the door for simulating different individual behaviors. We measure how TCS impacts differently the local network, heterogeneous users, the different travel behaviors, and how testing different TCS designs can avoid negative market trading behaviors.


A Simple and Fast Way to Handle Semantic Errors in Transactions

arXiv.org Artificial Intelligence

Many computer systems are now being redesigned to incorporate LLM-powered agents, enabling natural language input and more flexible operations. This paper focuses on handling database transactions created by large language models (LLMs). Transactions generated by LLMs may include semantic errors, requiring systems to treat them as long-lived. This allows for human review and, if the transaction is incorrect, removal from the database history. Any removal action must ensure the database's consistency (the "C" in ACID principles) is maintained throughout the process. We propose a novel middleware framework based on Invariant Satisfaction (I-Confluence), which ensures consistency by identifying and coordinating dependencies between long-lived transactions and new transactions. This middleware buffers suspicious or compensating transactions to manage coordination states. Using the TPC-C benchmark, we evaluate how transaction generation frequency, user reviews, and invariant completeness impact system performance. For system researchers, this study establishes an interactive paradigm between LLMs and database systems, providing an "undoing" mechanism for handling incorrect operations while guaranteeing database consistency. For system engineers, this paper offers a middleware design that integrates removable LLM-generated transactions into existing systems with minimal modifications.


Improving Classification Performance With Human Feedback: Label a few, we label the rest

arXiv.org Artificial Intelligence

In the realm of artificial intelligence, where a vast majority of data is unstructured, obtaining substantial amounts of labeled data to train supervised machine learning models poses a significant challenge. To address this, we delve into few-shot and active learning, where are goal is to improve AI models with human feedback on a few labeled examples. This paper focuses on understanding how a continuous feedback loop can refine models, thereby enhancing their accuracy, recall, and precision through incremental human input. By employing Large Language Models (LLMs) such as GPT-3.5, BERT, and SetFit, we aim to analyze the efficacy of using a limited number of labeled examples to substantially improve model accuracy. We benchmark this approach on the Financial Phrasebank, Banking, Craigslist, Trec, Amazon Reviews datasets to prove that with just a few labeled examples, we are able to surpass the accuracy of zero shot large language models to provide enhanced text classification performance. We demonstrate that rather than needing to manually label millions of rows of data, we just need to label a few and the model can effectively predict the rest.


Principles and Practices of Real-Time Feature Computing Platforms for ML

Communications of the ACM

Real-time feature computation, which calculates features from raw data on demand, is a crucial component in the machine learning (ML) application process. These real-time features are vital for various real-world ML applications, such as anti-fraud management, risk control, and personalized recommendations. In these cases, low latency (milliseconds) in computing fresh data features is crucial for accurate and high-quality online inference. As illustrated in the accompanying figure, a data scientist typically begins an ML application by developing feature computation scripts (for example, using Python or SparkSQL) for offline training. However, these scripts cannot meet the demands of online serving, including low latency, high throughput, and high availability. Hence, it is necessary to transform these scripts into performance-optimized code (for example, using C) that can be developed by an engineering team with system and production knowledge.


Improved Churn Causal Analysis Through Restrained High-Dimensional Feature Space Effects in Financial Institutions

arXiv.org Artificial Intelligence

Customer churn describes terminating a relationship with a business or reducing customer engagement over a specific period. Customer acquisition cost can be five to six times that of customer retention, hence investing in customers with churn risk is wise. Causal analysis of the churn model can predict whether a customer will churn in the foreseeable future and identify effects and possible causes for churn. In general, this study presents a conceptual framework to discover the confounding features that correlate with independent variables and are causally related to those dependent variables that impact churn. We combine different algorithms including the SMOTE, ensemble ANN, and Bayesian networks to address churn prediction problems on a massive and high-dimensional finance data that is usually generated in financial institutions due to employing interval-based features used in Customer Relationship Management systems. The effects of the curse and blessing of dimensionality assessed by utilising the Recursive Feature Elimination method to overcome the high dimension feature space problem. Moreover, a causal discovery performed to find possible interpretation methods to describe cause probabilities that lead to customer churn. Evaluation metrics on validation data confirm the random forest and our ensemble ANN model, with %86 accuracy, outperformed other approaches. Causal analysis results confirm that some independent causal variables representing the level of super guarantee contribution, account growth, and account balance amount were identified as confounding variables that cause customer churn with a high degree of belief. This article provides a real-world customer churn analysis from current status inference to future directions in local superannuation funds.


Causal Analysis of Customer Churn Using Deep Learning

arXiv.org Artificial Intelligence

Customer churn describes terminating a relationship with a business or reducing customer engagement over a specific period. Two main business marketing strategies play vital roles to increase market share dollar-value: gaining new and preserving existing customers. Customer acquisition cost can be five to six times that for customer retention, hence investing in customers with churn risk is smart. Causal analysis of the churn model can predict whether a customer will churn in the foreseeable future and assist enterprises to identify effects and possible causes for churn and subsequently use that knowledge to apply tailored incentives. This paper proposes a framework using a deep feedforward neural network for classification accompanied by a sequential pattern mining method on high-dimensional sparse data. We also propose a causal Bayesian network to predict cause probabilities that lead to customer churn. Evaluation metrics on test data confirm the XGBoost and our deep learning model outperformed previous techniques. Experimental analysis confirms that some independent causal variables representing the level of super guarantee contribution, account growth, and customer tenure were identified as confounding factors for customer churn with a high degree of belief. This paper provides a real-world customer churn analysis from current status inference to future directions in local superannuation funds.