sensitive data
The Age of the All-Access AI Agent Is Here
Big AI companies courted controversy by scraping wide swaths of the public internet. With the rise of AI agents, the next data grab is far more private. For years, the cost of using "free" services from Google, Facebook, Microsoft, and other Big Tech firms has been handing over your data. Uploading your life into the cloud and using free tech brings conveniences, but it puts personal information in the hands of giant corporations that will often be looking to monetize it. Now, the next wave of generative AI systems are likely to want more access to your data than ever before. Over the past two years, generative AI tools--such as OpenAI's ChatGPT and Google's Gemini--have moved beyond the relatively straightforward, text-only chatbots that the companies initially released.
- Asia > China (0.05)
- North America > United States > California (0.05)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.05)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.76)
Towards Contextual Sensitive Data Detection
Telkamp, Liang, Hulsebos, Madelon
The emergence of open data portals necessitates more attention to protecting sensitive data before datasets get published and exchanged. While an abundance of methods for suppressing sensitive data exist, the conceptualization of sensitive data and methods to detect it, focus particularly on personal data that, if disclosed, may be harmful or violate privacy. We observe the need for refining and broadening our definitions of sensitive data, and argue that the sensitivity of data depends on its context. Based on this definition, we introduce two mechanisms for contextual sensitive data detection that consider the broader context of a dataset at hand. First, we introduce type contextualization, which first detects the semantic type of particular data values, then considers the overall context of the data values within the dataset or document. Second, we introduce domain contextualization which determines sensitivity of a given dataset in the broader context based on the retrieval of relevant rules from documents that specify data sensitivity (e.g., data topic and geographic origin). Experiments with these mechanisms, assisted by large language models (LLMs), confirm that: 1) type-contextualization significantly reduces the number of false positives for type-based sensitive data detection and reaches a recall of 94% compared to 63% with commercial tools, and 2) domain-contextualization leveraging sensitivity rule retrieval is effective for context-grounded sensitive data detection in non-standard data domains such as humanitarian datasets. Evaluation with humanitarian data experts also reveals that context-grounded LLM explanations provide useful guidance in manual data auditing processes, improving consistency. We open-source mechanisms and annotated datasets for contextual sensitive data detection at https://github.com/trl-lab/sensitive-data-detection.
- North America > United States (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Monaco (0.04)
- (2 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Social Security Data Is Openly Being Shared With DHS to Target Immigrants
For months, the Social Security Administration was quietly sharing sensitive data about immigrants with DHS. Last week, the Social Security Administration (SSA) quietly updated a public notice to reveal that the agency would be sharing "citizenship and immigration information" with the Department of Homeland Security (DHS). This data sharing was already happening: WIRED reported in April that the Trump administration had already started pooling sensitive data from across the government for the purpose of immigration enforcement. This public notice issued by SSA makes that official, months after the fact. The notice is known as a system of record notice (SORN), a document that outlines how an agency will share the data it has, with whom, and for what purpose. This notice is required under the Privacy Act of 1974.
- Asia > Nepal (0.14)
- North America > United States > Texas (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.05)
- (9 more...)
The Role of Federated Learning in Improving Financial Security: A Survey
Kennedy, Cade Houston, Hilal, Amr, Momeni, Morteza
With the growth of digital financial systems, robust security and privacy have become a concern for financial institutions. Even though traditional machine learning models have shown to be effective in fraud detections, they often compromise user data by requiring centralized access to sensitive information. In IoT-enabled financial endpoints such as ATMs and POS Systems that regularly produce sensitive data that is sent over the network. Federated Learning (FL) offers a privacy-preserving, decentralized model training across institutions without sharing raw data. FL enables cross-silo collaboration among banks while also using cross-device learning on IoT endpoints. This survey explores the role of FL in enhancing financial security and introduces a novel classification of its applications based on regulatory and compliance exposure levels ranging from low-exposure tasks such as collaborative portfolio optimization to high-exposure tasks like real-time fraud detection. Unlike prior surveys, this work reviews FL's practical use within financial systems, discussing its regulatory compliance and recent successes in fraud prevention and blockchain-integrated frameworks. However, FL deployment in finance is not without challenges. Data heterogeneity, adversarial attacks, and regulatory compliance make implementation far from easy. This survey reviews current defense mechanisms and discusses future directions, including blockchain integration, differential privacy, secure multi-party computation, and quantum-secure frameworks. Ultimately, this work aims to be a resource for researchers exploring FL's potential to advance secure, privacy-compliant financial systems.
- Research Report > New Finding (1.00)
- Overview (1.00)
- Law Enforcement & Public Safety (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (2 more...)
You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models
Dubniczky, Richard A., Borsos, Bertalan, Norbert, Tihanyi
The widespread use of preprint repositories such as arXiv has accelerated the communication of scientific results but also introduced overlooked security risks. Beyond PDFs, these platforms provide unrestricted access to original source materials, including LaTeX sources, auxiliary code, figures, and embedded comments. In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence. In this work, we present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions. We introduce LaTeXpOsEd, a four-stage framework that integrates pattern matching, logical filtering, traditional harvesting techniques, and large language models (LLMs) to uncover hidden disclosures within non-referenced files and LaTeX comments. To evaluate LLMs' secret-detection capabilities, we introduce LLMSec-DB, a benchmark on which we tested 25 state-of-the-art models. Our analysis uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, and cloud API keys. We also uncovered confidential author communications, internal disagreements, and conference submission credentials, exposing information that poses serious reputational risks to both researchers and institutions. We urge the research community and repository operators to take immediate action to close these hidden security gaps. To support open science, we release all scripts and methods from this study but withhold sensitive findings that could be misused, in line with ethical principles. The source code and related material are available at the project website https://github.com/LaTeXpOsEd
- North America > United States (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Asia > China (0.04)
- (5 more...)
Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers
Green, Tommaso, Gubri, Martin, Puerto, Haritz, Yun, Sangdoo, Oh, Seong Joon
We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model's internal thinking, not just its outputs.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- (8 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models
Small language models (SLMs) become unprecedentedly appealing due to their approximately equivalent performance compared to large language models (LLMs) in certain fields with less energy and time consumption during training and inference. However, the personally identifiable information (PII) leakage of SLMs for downstream tasks has yet to be explored. In this study, we investigate the PII leakage of the chatbot based on SLM. We first finetune a new chatbot, i.e., ChatBioGPT based on the backbone of BioGPT using medical datasets Alpaca and HealthCareMagic. It shows a matchable performance in BERTscore compared with previous studies of ChatDoctor and ChatGPT. Based on this model, we prove that the previous template-based PII attacking methods cannot effectively extract the PII in the dataset for leakage detection under the SLM condition. We then propose GEP, which is a greedy coordinate gradient-based (GCG) method specifically designed for PII extraction. We conduct experimental studies of GEP and the results show an increment of up to 60 more leakage compared with the previous template-based methods. We further expand the capability of GEP in the case of a more complicated and realistic situation by conducting free-style insertion where the inserted PII in the dataset is in the form of various syntactic expressions instead of fixed templates, and GEP is still able to reveal a PII leakage rate of up to 4.53%. LLM is one of the most centric research concentrations in the Artificial Intelligence (AI) field. It contributes dramatically to various domains (Zhao et al., 2023; Xu et al., 2024) and tasks (Zhao et al., 2023).
- North America > United States (0.28)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Norway > Northern Norway > Troms > Tromsø (0.04)
- Information Technology > Security & Privacy (1.00)
- Government (0.68)
- Health & Medicine > Therapeutic Area (0.68)
Can You Trust Your Copilot? A Privacy Scorecard for AI Coding Assistants
The rapid integration of AI-powered coding assistants into developer workflows has raised significant privacy and trust concerns. As developers entrust proprietary code to services like OpenAI's GPT, Google's Gemini, and GitHub Copilot, the unclear data handling practices of these tools create security and compliance risks. This paper addresses this challenge by introducing and applying a novel, expert-validated privacy scorecard. The methodology involves a detailed analysis of four document types; from legal policies to external audits; to score five leading assistants against 14 weighted criteria. A legal expert and a data protection officer refined these criteria and their weighting. The results reveal a distinct hierarchy of privacy protections, with a 20-point gap between the highest- and lowest-ranked tools. The analysis uncovers common industry weaknesses, including the pervasive use of opt-out consent for model training and a near-universal failure to filter secrets from user prompts proactively. The resulting scorecard provides actionable guidance for developers and organizations, enabling evidence-based tool selection. This work establishes a new benchmark for transparency and advocates for a shift towards more user-centric privacy standards in the AI industry.
- Europe > Germany (0.40)
- North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
- Europe > Italy (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.36)
Enterprise AI Must Enforce Participant-Aware Access Control
Bhatt, Shashank Shreedhar, Rajore, Tanmay, Aggarwal, Khushboo, Ananthanarayanan, Ganesh, Chandra, Ranveer, Chandran, Nishanth, Choudhury, Suyash, Gupta, Divya, Kiciman, Emre, Pandey, Sumit Kumar, Setty, Srinath, Sharma, Rahul, Zhao, Teijia
Large language models (LLMs) are increasingly deployed in enterprise settings where they interact with multiple users and are trained or fine-tuned on sensitive internal data. While fine-tuning enhances performance by internalizing domain knowledge, it also introduces a critical security risk: leakage of confidential training data to unauthorized users. These risks are exacerbated when LLMs are combined with Retrieval-Augmented Generation (RAG) pipelines that dynamically fetch contextual documents at inference time. We demonstrate data exfiltration attacks on AI assistants where adversaries can exploit current fine-tuning and RAG architectures to leak sensitive information by leveraging the lack of access control enforcement. We show that existing defenses, including prompt sanitization, output filtering, system isolation, and training-level privacy mechanisms, are fundamentally probabilistic and fail to offer robust protection against such attacks. We take the position that only a deterministic and rigorous enforcement of fine-grained access control during both fine-tuning and RAG-based inference can reliably prevent the leakage of sensitive data to unauthorized recipients. We introduce a framework centered on the principle that any content used in training, retrieval, or generation by an LLM is explicitly authorized for \emph{all users involved in the interaction}. Our approach offers a simple yet powerful paradigm shift for building secure multi-user LLM systems that are grounded in classical access control but adapted to the unique challenges of modern AI workflows. Our solution has been deployed in Microsoft Copilot Tuning, a product offering that enables organizations to fine-tune models using their own enterprise-specific data.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Singapore (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (3 more...)
Secure, Scalable and Privacy Aware Data Strategy in Cloud
Butte, Vijay Kumar, Butte, Sujata
The enterprises today are faced with the tough challenge of processing, storing large amounts of data in a secure, scalable manner and enabling decision makers to make quick, informed data driven decisions. This paper addresses this challenge and develops an effective enterprise data strategy in the cloud. Various components of an effective data strategy are discussed and architectures addressing security, scalability and privacy aspects are provided.