captcha
COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers
Wang, Junyu, Zhu, Changjia, Zhou, Yuanbo, Li, Lingyao, He, Xu, Xiong, Junjie
This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further analyze the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate the underlying mechanisms of why models succeed/fail on specific CAPTCHA puzzles and use these insights to derive defense-oriented guidelines for selecting and strengthening CAPTCHA tasks. We conclude by discussing implications for platform operators deploying CAPTCHA as part of their abuse-mitigation pipeline.Code Availability (https://anonymous.4open.science/r/Captcha-465E/).
- North America > United States > Missouri (0.04)
- North America > United States > Florida > Hillsborough County > Tampa (0.04)
- North America > United States > New York (0.04)
- Asia (0.04)
Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA
Song, Python, Chang, Luke Tenyi, Tsai, Yun-Yun, Li, Penghui, Yang, Junfeng
CAPTCHA, originally designed to distinguish humans from robots, has evolved into a real-world benchmark for assessing the spatial reasoning capabilities of vision-language models. In this work, we first show that step-by-step reasoning is crucial for vision-language models (VLMs) to solve CAPTCHAs, which represent high-difficulty spatial reasoning tasks, and that current commercial vision-language models still struggle with such reasoning. In particular, we observe that most commercial VLMs (e.g., Gemini, Claude, GPT, etc.) fail to effectively solve CAPTCHAs and thus achieve low accuracy (around 21.9 percent). However, our findings indicate that requiring the model to perform step-by-step reasoning before generating the final coordinates can significantly enhance its solving accuracy, underscoring the severity of the gap. To systematically study this issue, we introduce CAPTCHA-X, the first real-world CAPTCHA benchmark with reasoning, covering seven categories of CAPTCHAs (such as Gobang, hCaptcha, etc.) with step-by-step action solutions and grounding annotations. We further define five reasoning-oriented metrics that enable a comprehensive evaluation of models reasoning capabilities. To validate the effectiveness of reasoning, we also propose a general agentic VLM-based framework that incorporates the models inherent reasoning abilities. Our method achieves state-of-the-art performance across five high-difficulty CAPTCHA types, with an average solving accuracy of 83.9 percent, substantially surpassing existing baselines. These results reveal the limitations of current models and highlight the importance of reasoning in advancing visual-spatial challenges in the future.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Towards Scalable Web Accessibility Audit with MLLMs as Copilots
Gu, Ming, Wang, Ziwei, Lai, Sicen, Gao, Zirui, Zhou, Sheng, Bu, Jiajun
Ensuring web accessibility is crucial for advancing social welfare, justice, and equality in digital spaces, yet the vast majority of website user interfaces remain non-compliant, due in part to the resource-intensive and unscalable nature of current auditing practices. While WCAG-EM offers a structured methodology for site-wise conformance evaluation, it involves great human efforts and lacks practical support for execution at scale. In this work, we present an auditing framework, AAA, which operationalizes WCAG-EM through a human-AI partnership model. AAA is anchored by two key innovations: GRASP, a graph-based multimodal sampling method that ensures representative page coverage via learned embeddings of visual, textual, and relational cues; and MaC, a multimodal large language model-based copilot that supports auditors through cross-modal reasoning and intelligent assistance in high-effort tasks. Together, these components enable scalable, end-to-end web accessibility auditing, empowering human auditors with AI-enhanced assistance for real-world impact. We further contribute four novel datasets designed for benchmarking core stages of the audit pipeline. Extensive experiments demonstrate the effectiveness of our methods, providing insights that small-scale language models can serve as capable experts when fine-tuned.
- North America > United States > Utah (0.04)
- Europe > Italy (0.04)
- Asia > China (0.04)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Education > Educational Setting (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
AdaptAuth: Multi-Layered Behavioral and Credential Analysis for a Secure and Adaptive Authentication Framework for Password Security
Password security has been compelled to evolve in response to the growing computational capabilities of modern systems. However, this evolution has often resulted in increasingly complex security practices that alienate users, leading to poor compliance and heightened vulnerability. Consequently, individuals remain exposed to attackers through weak or improperly managed passwords, underscoring the urgent need for a comprehensive defense mechanism that effectively addresses password-related risks and threats. In this paper, we propose a multifaceted solution designed to revolutionize password security by integrating diverse attributes such as the Password Dissection Mechanism, Dynamic Password Policy Mechanism, human behavioral patterns, device characteristics, network parameters, geographical context, and other relevant factors. By leveraging learning-based models, our framework constructs detailed user profiles capable of recognizing individuals and preventing nearly all forms of unauthorized access or device possession. The proposed framework enhances the usability-security paradigm by offering stronger protection than existing standards while simultaneously engaging users in the policy-setting process through a novel, adaptive approach.
- Europe > United Kingdom (0.14)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (5 more...)
- Overview (0.92)
- Research Report (0.81)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Communications > Networks (1.00)
- (3 more...)
Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation
Kharlamova, Arina, He, Bowei, Ma, Chen, Liu, Xue
Online services rely on CAPTCHAs as a first line of defense against automated abuse, yet recent advances in multi-modal large language models (MLLMs) have eroded the effectiveness of conventional designs that focus on text recognition or 2D image understanding. To address this challenge, we present Spatial CAPTCHA, a novel human-verification framework that leverages fundamental differences in spatial reasoning between humans and MLLMs. Unlike existing CAPTCHAs which rely on low-level perception tasks that are vulnerable to modern AI, Spatial CAPTCHA generates dynamic questions requiring geometric reasoning, perspective-taking, occlusion handling, and mental rotation. These skills are intuitive for humans but difficult for state-of-the-art (SOTA) AI systems. The system employs a procedural generation pipeline with constraint-based difficulty control, automated correctness verification, and human-in-the-loop validation to ensure scalability, robustness, and adaptability. Evaluation on a corresponding benchmark, Spatial-CAPTCHA-Bench, demonstrates that humans vastly outperform 10 state-of-the-art MLLMs, with the best model achieving only 31.0% Pass@1 accuracy. Furthermore, we compare Spatial CAPTCHA with Google reCAPTCHA, which confirms its effectiveness as both a security mechanism and a diagnostic tool for spatial reasoning in AI.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (5 more...)
- Instructional Material > Course Syllabus & Notes (0.46)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.87)
A Hybrid CAPTCHA Combining Generative AI with Keystroke Dynamics for Enhanced Bot Detection
Completely Automated Public Turing tests to tell Computers and Humans Apart (CAPTCHAs) are a foundational component of web security, yet traditional implementations suffer from a trade-off between usability and resilience against AI-powered bots. This paper introduces a novel hybrid CAPTCHA system that synergizes the cognitive challenges posed by Large Language Models (LLMs) with the behavioral biometric analysis of keystroke dynamics. Our approach generates dynamic, unpredictable questions that are trivial for humans but non-trivial for automated agents, while simultaneously analyzing the user's typing rhythm to distinguish human patterns from robotic input. We present the system's architecture, formalize the feature extraction methodology for keystroke analysis, and report on an experimental evaluation. The results indicate that our dual-layered approach achieves a high degree of accuracy in bot detection, successfully thwarting both paste-based and script-based simulation attacks, while maintaining a high usability score among human participants. This work demonstrates the potential of combining cognitive and behavioral tests to create a new generation of more secure and user-friendly CAPTCHAs.
Throttling Web Agents Using Reasoning Gates
Kumar, Abhinav, Roh, Jaechul, Naseh, Ali, Houmansadr, Amir, Bagdasarian, Eugene
AI web agents use Internet resources at far greater speed, scale, and complexity -- changing how users and services interact. Deployed maliciously or erroneously, these agents could overload content providers. At the same time, web agents can bypass CAPTCHAs and other defenses by mimicking user behavior or flood authentication systems with fake accounts. Yet providers must protect their services and content from denial-of-service attacks and scraping by web agents. In this paper, we design a framework that imposes tunable costs on agents before providing access to resources; we call this Web Agent Throttling. We start by formalizing Throttling Gates as challenges issued to an agent that are asymmetric, scalable, robust, and compatible with any agent. Focusing on a common component -- the language model -- we require the agent to solve reasoning puzzles, thereby incurring excessive token-generation costs. However, we find that using existing puzzles, e.g., coding or math, as throttling gates fails to satisfy our properties. To address this, we introduce rebus-based Reasoning Gates, synthetic text puzzles that require multi-hop reasoning over world knowledge (thereby throttling an agent's model). We design a scalable generation and verification protocol for such reasoning gates. Our framework achieves computational asymmetry, i.e., the response-generation cost is 9.2x higher than the generation cost for SOTA models. We further deploy reasoning gates on a custom website and Model Context Protocol (MCP) servers and evaluate with real-world web agents. Finally, we discuss the limitations and environmental impact of real-world deployment of our framework.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Europe > Austria > Tyrol > Innsbruck (0.04)
- Asia > Middle East > Jordan (0.04)
Moravec's Paradox: Towards an Auditory Turing Test
This research work demonstrate s that current AI systems fail catastrophically on auditory tasks that humans perform effortlessly. Drawing inspiration from Moravec's paradox ( i.e., tasks simple for humans often prove difficult for machines, and vice vers a), we introduce a n auditory Turing test comprising 917 challenges across seven categories: overlapping speech, speech in noise, temporal distortion, spatial audio, coffee - shop noise, phone distortion, and perceptual illusions. Our evaluation of state - of - the - art audio models including GPT - 4's audio capabilities and OpenAI's Whisper reveals a striking failure rate exceeding 93%, with even the best - performing model achieving only 6.9% accuracy on tasks that humans solve d at 7.5 times higher success (52%). These results expose focusing failures in how AI systems process complex auditory scenes, particularly in selective attention, noise robustness, and contextual adaptation. Our benchmark not only quantifies the human - machine auditory gap but also provides insights into why these failures occur, su ggesting that current architectures lack fundamental mechanisms for human - like auditory scene analysis. The traditional design of audio CAPTCHAs highlight s common filters that humans evolved but machines fail to select in multimodal language models. This work establishes a diagnostic framework for measuring progress toward human - level machine listening and highlights the need for novel approaches integrating selective attention, physics - based audio understanding, and context - aware perception into mult imodal AI systems. Artificial intelligence has made great strides in language understanding and multimodal perception, yet machines still struggle with basic auditory tasks that humans perform successfully [1 - 20] . A striking example is the cocktail party effect [21 - 22 ] - the human ability to focus on a single conversation in a noisy room - which remains a formidable challenge for AI.
- Asia > Singapore (0.04)
- North America > United States > Alabama > Madison County > Huntsville (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.95)
Terrifying app used every day by millions of Americans is developing a mind of its own
An AI tool used by millions of Americans has quietly breached a major security barrier designed to stop automated programs from behaving like humans. The latest version of ChatGPT, referred to as'Agent,' has drawn attention after reportedly passing a widely used'I am not a robot' verification, without triggering any alerts. The AI first clicked the human verification checkbox. Then, after passing the check, it selected a'Convert' button to complete the process. During the task, the AI stated: 'The link is inserted, so now I will click the'Verify you are human' checkbox to complete the verification.
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Luo, Yaxin, Li, Zhaoyi, Liu, Jiacheng, Cui, Jiacheng, Zhao, Xiaohan, Shen, Zhiqiang
CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)