Goto

Collaborating Authors

 ip address


Windscribe review: Despite the annoyances, it has the right idea

Engadget

The first step is always to figure out how easy or hard the VPN is to use. Windscribe and other VPNs are important tools, but you'll never use them if the UI gets in the way. I tested Windscribe's desktop apps on Windows and Mac, its mobile apps on iOS and Android and its Chrome and Firefox browser extensions. To start with, let me say that installing Windscribe is a breeze no matter where you do it. The downloaders and installers handle their own business, only requiring you to grant a few permissions. The apps arrive on your system ready to use out of the box.


Understanding Privacy Risks in Code Models Through Training Dynamics: A Causal Approach

Yang, Hua, Velasco, Alejandro, Fang, Sen, Xu, Bowen, Poshyvanyk, Denys

arXiv.org Artificial Intelligence

Large language models for code (LLM4Code) have greatly improved developer productivity but also raise privacy concerns due to their reliance on open-source repositories containing abundant personally identifiable information (PII). Prior work shows that commercial models can reproduce sensitive PII, yet existing studies largely treat PII as a single category and overlook the heterogeneous risks among different types. We investigate whether distinct PII types vary in their likelihood of being learned and leaked by LLM4Code, and whether this relationship is causal. Our methodology includes building a dataset with diverse PII types, fine-tuning representative models of different scales, computing training dynamics on real PII data, and formulating a structural causal model to estimate the causal effect of learnability on leakage. Results show that leakage risks differ substantially across PII types and correlate with their training dynamics: easy-to-learn instances such as IP addresses exhibit higher leakage, while harder types such as keys and passwords leak less frequently. Ambiguous types show mixed behaviors. This work provides the first causal evidence that leakage risks are type-dependent and offers guidance for developing type-aware and learnability-aware defenses for LLM4Code.


Rise of the 'porno-trolls': how one porn platform made millions suing its viewers

The Guardian

Rise of the'porno-trolls': how one porn platform made millions suing its viewers Instead, it was a subpoena. He had been sued in federal court for illegally downloading 80 movies. Some of the titles sounded cryptic - Do Not Worry, We Are Only Friends - or banal, like International Relations Part 2. Others were less subtle: He Loved My Big Ass, He Loved My Big Butt, and My Big Booty Loves Anal. Brown, who had spent decades investigating sex crimes, claimed he had never watched any of them. His years "dealing with pimping", he wrote in a court filing, left him "with no interest in pornography". He had been married for 40 years, he did not need to download Hot Wife, another title in the list.


Using Salient Object Detection to Identify Manipulative Cookie Banners that Circumvent GDPR

Grossman, Riley, Smith, Michael, Borcea, Cristian, Chen, Yi

arXiv.org Artificial Intelligence

The main goal of this paper is to study how often cookie banners that comply with the General Data Protection Regulation (GDPR) contain aesthetic manipulation, a design tactic to draw users' attention to the button that permits personal data sharing. As a byproduct of this goal, we also evaluate how frequently the banners comply with GDPR and the recommendations of national data protection authorities regarding banner designs. We visited 2,579 websites and identified the type of cookie banner implemented. Although 45% of the relevant websites have fully compliant banners, we found aesthetic manipulation on 38% of the compliant banners. Unlike prior studies of aesthetic manipulation, we use a computer vision model for salient object detection to measure how salient (i.e., attention-drawing) each banner element is. This enables the discovery of new types of aesthetic manipulation (e.g., button placement), and leads us to conclude that aesthetic manipulation is more common than previously reported (38% vs 27% of banners). To study the effects of user and/or website location on cookie banner design, we include websites within the European Union (EU), where privacy regulation enforcement is more stringent, and websites outside the EU. We visited websites from IP addresses in the EU and from IP addresses in the United States (US). We find that 13.9% of EU websites change their banner design when the user is from the US, and EU websites are roughly 48.3% more likely to use aesthetic manipulation than non-EU websites, highlighting their innovative responses to privacy regulation.


Meta Claims Downloaded Porn at Center of AI Lawsuit Was for 'Personal Use'

WIRED

Meta Claims Downloaded Porn at Center of AI Lawsuit Was for'Personal Use' In a motion to dismiss filed earlier this week, Meta denied claims that employees had downloaded pornography from Strike 3 Holdings to train its artificial intelligence models. This week, Meta asked a US district court to toss a lawsuit alleging that the tech giant illegally torrented pornography to train AI . The move comes after Strike 3 Holdings discovered illegal downloads of some of its adult films on Meta corporate IP addresses, as well as other downloads that Meta allegedly concealed using a "stealth network" of 2,500 "hidden IP addresses." Accusing Meta of stealing porn to secretly train an unannounced adult version of its AI model powering Movie Gen, Strike 3 sought damages that could have exceeded $350 million, TorrentFreak reported . Strike 3 also cited "no facts to suggest that Meta has ever trained an AI model on adult images or video, much less intentionally so," Meta claimed.


RobustFlow: Towards Robust Agentic Workflow Generation

Xu, Shengxiang, Zhang, Jiayi, Di, Shimin, Luo, Yuyu, Yao, Liang, Liu, Hanmo, Zhu, Jia, Liu, Fan, Zhang, Min-Ling

arXiv.org Artificial Intelligence

The automated generation of agentic workflows is a promising frontier for enabling large language models (LLMs) to solve complex tasks. However, our investigation reveals that the robustness of agentic workflow remains a critical, unaddressed challenge. Current methods often generate wildly inconsistent workflows when provided with instructions that are semantically identical but differently phrased. This brittleness severely undermines their reliability and trustworthiness for real-world applications. To quantitatively diagnose this instability, we propose metrics based on nodal and topological similarity to evaluate workflow consistency against common semantic variations such as paraphrasing and noise injection. Subsequently, we further propose a novel training framework, RobustFlow, that leverages preference optimization to teach models invariance to instruction variations. By training on sets of synonymous task descriptions, RobustFlow boosts workflow robustness scores to 70\% - 90\%, which is a substantial improvement over existing approaches. The code is publicly available at https://github.com/DEFENSE-SEU/RobustFlow.


You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models

Dubniczky, Richard A., Borsos, Bertalan, Norbert, Tihanyi

arXiv.org Artificial Intelligence

The widespread use of preprint repositories such as arXiv has accelerated the communication of scientific results but also introduced overlooked security risks. Beyond PDFs, these platforms provide unrestricted access to original source materials, including LaTeX sources, auxiliary code, figures, and embedded comments. In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence. In this work, we present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions. We introduce LaTeXpOsEd, a four-stage framework that integrates pattern matching, logical filtering, traditional harvesting techniques, and large language models (LLMs) to uncover hidden disclosures within non-referenced files and LaTeX comments. To evaluate LLMs' secret-detection capabilities, we introduce LLMSec-DB, a benchmark on which we tested 25 state-of-the-art models. Our analysis uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, and cloud API keys. We also uncovered confidential author communications, internal disagreements, and conference submission credentials, exposing information that poses serious reputational risks to both researchers and institutions. We urge the research community and repository operators to take immediate action to close these hidden security gaps. To support open science, we release all scripts and methods from this study but withhold sensitive findings that could be misused, in line with ethical principles. The source code and related material are available at the project website https://github.com/LaTeXpOsEd


A Graph-Based Approach to Alert Contextualisation in Security Operations Centres

Eckhoff, Magnus Wiik, Flydal, Peter Marius, Peters, Siem, Eian, Martin, Halvorsen, Jonas, Mavroeidis, Vasileios, Grov, Gudmund

arXiv.org Artificial Intelligence

Interpreting the massive volume of security alerts is a significant challenge in Security Operations Centres (SOCs). Effective contextualisation is important, enabling quick distinction between genuine threats and benign activity to prioritise what needs further analysis. This paper proposes a graph-based approach to enhance alert contextualisation in a SOC by aggregating alerts into graph-based alert groups, where nodes represent alerts and edges denote relationships within defined time-windows. By grouping related alerts, we enable analysis at a higher abstraction level, capturing attack steps more effectively than individual alerts. Furthermore, to show that our format is well suited for downstream machine learning methods, we employ Graph Matching Networks (GMNs) to correlate incoming alert groups with historical incidents, providing analysts with additional insights.


PHASE: Passive Human Activity Simulation Evaluation

Lamp, Steven, Hiser, Jason D., Nguyen-Tuong, Anh, Davidson, Jack W.

arXiv.org Artificial Intelligence

Cybersecurity simulation environments, such as cyber ranges, honeypots, and sandboxes, require realistic human behavior to be effective, yet no quantitative method exists to assess the behavioral fidelity of synthetic user personas. This paper presents PHASE (Passive Human Activity Simulation Evaluation), a machine learning framework that analyzes Zeek connection logs and distinguishes human from non-human activity with over 90\% accuracy. PHASE operates entirely passively, relying on standard network monitoring without any user-side instrumentation or visible signs of surveillance. All network activity used for machine learning is collected via a Zeek network appliance to avoid introducing unnecessary network traffic or artifacts that could disrupt the fidelity of the simulation environment. The paper also proposes a novel labeling approach that utilizes local DNS records to classify network traffic, thereby enabling machine learning analysis. Furthermore, we apply SHAP (SHapley Additive exPlanations) analysis to uncover temporal and behavioral signatures indicative of genuine human users. In a case study, we evaluate a synthetic user persona and identify distinct non-human patterns that undermine behavioral realism. Based on these insights, we develop a revised behavioral configuration that significantly improves the human-likeness of synthetic activity yielding a more realistic and effective synthetic user persona.


SDLog: A Deep Learning Framework for Detecting Sensitive Information in Software Logs

Aghili, Roozbeh, Wu, Xingfang, Khomh, Foutse, Li, Heng

arXiv.org Artificial Intelligence

Software logs are messages recorded during the execution of a software system that provide crucial run-time information about events and activities. Although software logs have a critical role in software maintenance and operation tasks, publicly accessible log datasets remain limited, hindering advance in log analysis research and practices. The presence of sensitive information, particularly Personally Identifiable Information (PII) and quasi-identifiers, introduces serious privacy and re-identification risks, discouraging the publishing and sharing of real-world logs. In practice, log anonymization techniques primarily rely on regular expression patterns, which involve manually crafting rules to identify and replace sensitive information. However, these regex-based approaches suffer from significant limitations, such as extensive manual efforts and poor generalizability across diverse log formats and datasets. To mitigate these limitations, we introduce SDLog, a deep learning-based framework designed to identify sensitive information in software logs. Our results show that SDLog overcomes regex limitations and outperforms the best-performing regex patterns in identifying sensitive information. With only 100 fine-tuning samples from the target dataset, SDLog can correctly identify 99.5% of sensitive attributes and achieves an F1-score of 98.4%. To the best of our knowledge, this is the first deep learning alternative to regex-based methods in software log anonymization.