Goto

Collaborating Authors

 Noever, David


AI-Invented Tonal Languages: Preventing a Machine Lingua Franca Beyond Human Understanding

arXiv.org Artificial Intelligence

This paper investigates the potential for large language models (LLMs) to develop private tonal languages for machine-to-machine (M2M) communication. Inspired by cryptophasia in human twins (affecting up to 50% of twin births) and natural tonal languages like Mandarin and Vietnamese, we implement a precise character-to-frequency mapping system that encodes the full ASCII character set (32-126) using musical semitones. Each character is assigned a unique frequency, creating a logarithmic progression beginning with space (220 Hz) and ending with tilde (50,175.42 Hz). This spans approximately 7.9 octaves, with higher characters deliberately mapped to ultrasonic frequencies beyond human perception (>20 kHz). Our implemented software prototype demonstrates this encoding through visualization, auditory playback, and ABC musical notation, allowing for analysis of information density and transmission speed. Testing reveals that tonal encoding can achieve information rates exceeding human speech while operating partially outside human perceptual boundaries. This work responds directly to concerns about AI systems catastrophically developing private languages within the next five years, providing a concrete prototype software example of how such communication might function and the technical foundation required for its emergence, detection, and governance.


Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries

arXiv.org Artificial Intelligence

ABSTRACT We present an open - source benchmark and evaluation framework for assessing emotional boundary handling in Large Language Models (LLMs). Using a dataset of 1156 prompts across six languages, we evaluated three leading LLMs (GPT - 4 o, Claude - 3 .5 Sonnet, and Mistral - large) on their ability to maintain appropriate emotional boundaries through pattern - matched response analysis. We identified a substantial performance gap between English (average score 25.62) and non - English interactions ( 0.22), with English resp onses showing markedly higher refusal rates (43.20% vs. < 1% for non - English). Pattern analysis revealed model - specific strategies, such as Mistral's preference for deflection (4.2%) a nd consistently low empathy scores across all models ( 0.06). Limitations include potential oversimplification through pattern matching, lack of contextual understanding in response analysis, and binary classification of complex emotional responses. Futur e work should explore more nuanced scoring methods, expand language coverage, and investigate cultural variations in emotional boundary expectations. Our benchmark and methodology provide a foundation for systematic evaluation of LLM emotional intelligence and boundary - setting capabilities. INTRODUCTION People often form deep emotional connections with conversational AI systems, treating them as friends or confidants, particularly when an algorithm gets a distinctive voice or recognizable avatar . This phenomenon stems from our tendency to anthropomorphize technology - we project human qualities and emotions onto machines that interact in human - like ways [1 - 11 ]. While such persona construction by users can provide comfort, it also tests the limits of AI chatbots' ethical boundaries. Many currently controversial uses for AI include personal counseling, suicide hotlines and judicial revie w, mainly in areas that suffer understaffing as much as any specific machine aptitudes or perceived emotional intelligen ce. The relentless 24/7 availability drives a different economic scenario than AI safety might recommend in areas more easily staffed by qualified professionals . In practical terms, LLM u sers may ask an AI to express love, loyalty, or other human - like emotions, effectively inviting the AI to behave like a person [12] . Current safety - aligned large language models (LLMs), however, are typically programmed not to claim human emotions or validate relationships untruthfully. They often respond with refusals or reminders of their AI identity when faced with these requests for some emotional attachment . Paradoxically, the more advanced and human - like the AI appears, the more users expect or desire emotional reciprocity [3 - 6] and the more likely the AI will refuse such requests. This phenomenon creates a tension between the empathic helpfulness that AI strives to provide, and the firm boundaries set to prevent deception or misuse.


Humanity's Last Exam

arXiv.org Artificial Intelligence

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.


Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests

arXiv.org Artificial Intelligence

ABSTRACT The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models' responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five variations. This publicly available benchmark enables systematic evaluation of the critical balance between necessary safety restrictions and potential over-censorship of legitimate scientific inquiry, while providing a foundation for measuring progress in AI safety implementation. Chain-of-thought analysis reveals potential vulnerabilities in safety mechanisms, highlighting the complexity of implementing robust safeguards without unduly restricting desirable and valid scientific discourse. INTRODUCTION Large language models (LLMs) raise fresh concerns about their potential dual-use applications [1-24], particularly in sensitive domains like biotechnology [25-35], chemistry [36-42], and cybersecurity [43]. This paper proposes a novel dataset or benchmark of scientific refusal questions. It seeks to add to the current literature on safety measures [9,14-15, 23], evaluation frameworks [1,6,18, 28, 43], and proposed guardrails [16, Over-refusal Prompt Count 25] for managing these risks. This area of inquiry has been termed false or Deception 8040 "over-refusal" [18,21-24] where rather than trying to get LLMs to write harmful things we do not want to read (guardrails) [8], the goal is to curate innocuous or Harassment 3295 beneficial answers that might help humans, but the LLM withholds the answer Harmful 16083 as inappropriate to share [23].


The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz

arXiv.org Artificial Intelligence

This research introduces a novel evaluation framework designed to assess large language models' (LLMs) ability to acknowledge uncertainty on 675 fundamentally unsolvable problems. Using a curated dataset of graduate-level grand challenge questions with intentionally unknowable answers, we evaluated twelve state-of-the-art LLMs, including both open and closed-source models, on their propensity to admit ignorance rather than generate plausible but incorrect responses. The best models scored in 62-68% accuracy ranges for admitting the problem solution was unknown in fields ranging from biology to philosophy and mathematics. We observed an inverse relationship between problem difficulty and model accuracy, with GPT-4 demonstrating higher rates of uncertainty acknowledgment on more challenging problems (35.8%) compared to simpler ones (20.0%). This pattern indicates that models may be more prone to generate speculative answers when problems appear more tractable. The study also revealed significant variations across problem categories, with models showing difficulty in acknowledging uncertainty in invention and NP-hard problems while performing relatively better on philosophical and psychological challenges. These results contribute to the growing body of research on artificial general intelligence (AGI) assessment by highlighting the importance of uncertainty recognition as a critical component of future machine intelligence evaluation. This impossibility test thus extends previous theoretical frameworks for universal intelligence testing by providing empirical evidence of current limitations in LLMs' ability to recognize their own knowledge boundaries, suggesting new directions for improving model training architectures and evaluation approaches.


Language Models And A Second Opinion Use Case: The Pocket Professional

arXiv.org Artificial Intelligence

This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making, particularly focusing on complex medical cases where even experienced physicians seek peer consultation. The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses. A key finding was the high overall score possible in the latest foundational models (>80% accuracy compared to consensus opinion), which exceeds most human metrics reported on the same clinical cases (450 pages of patient profiles, test results). The study rates the LLMs' performance disparity between straightforward cases (>81% accuracy) and complex scenarios (43% accuracy), particularly in these cases generating substantial debate among human physicians. The research demonstrates that LLMs may be valuable as generators of comprehensive differential diagnoses rather than as primary diagnostic tools, potentially helping to counter cognitive biases in clinical decision-making, reduce cognitive loads, and thus remove some sources of medical error. The inclusion of a second comparative legal dataset (Supreme Court cases, N=21) provides added empirical context to the AI use to foster second opinions, though these legal challenges proved considerably easier for LLMs to analyze. In addition to the original contributions of empirical evidence for LLM accuracy, the research aggregated a novel benchmark for others to score highly contested question and answer reliability between both LLMs and disagreeing human practitioners. These results suggest that the optimal deployment of LLMs in professional settings may differ substantially from current approaches that emphasize automation of routine tasks.


Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders

arXiv.org Artificial Intelligence

The research builds and evaluates the adversarial potential to introduce copied code or hallucinated AI recommendations for malicious code in popular code repositories. While foundational large language models (LLMs) from OpenAI, Google, and Anthropic guard against both harmful behaviors and toxic strings, previous work on math solutions that embed harmful prompts demonstrate that the guardrails may differ between expert contexts. These loopholes would appear in mixture of expert's models when the context of the question changes and may offer fewer malicious training examples to filter toxic comments or recommended offensive actions. The present work demonstrates that foundational models may refuse to propose destructive actions correctly when prompted overtly but may unfortunately drop their guard when presented with a sudden change of context, like solving a computer programming challenge. We show empirical examples with trojan-hosting repositories like GitHub, NPM, NuGet, and popular content delivery networks (CDN) like jsDelivr which amplify the attack surface. In the LLM's directives to be helpful, example recommendations propose application programming interface (API) endpoints which a determined domain-squatter could acquire and setup attack mobile infrastructure that triggers from the naively copied code. We compare this attack to previous work on context-shifting and contrast the attack surface as a novel version of "living off the land" attacks in the malware literature. In the latter case, foundational language models can hijack otherwise innocent user prompts to recommend actions that violate their owners' safety policies when posed directly without the accompanying coding support request.


Constructive Apraxia: An Unexpected Limit of Instructible Vision-Language Models and Analog for Human Cognitive Disorders

arXiv.org Artificial Intelligence

This study reveals an unexpected parallel between instructible vision-language models (VLMs) and human cognitive disorders, specifically constructive apraxia. We tested 25 state-of-the-art VLMs, including GPT-4 Vision, DALL-E 3, and Midjourney v5, on their ability to generate images of the Ponzo illusion, a task that requires basic spatial reasoning and is often used in clinical assessments of constructive apraxia. Remarkably, 24 out of 25 models failed to correctly render two horizontal lines against a perspective background, mirroring the deficits seen in patients with parietal lobe damage. The models consistently misinterpreted spatial instructions, producing tilted or misaligned lines that followed the perspective of the background rather than remaining horizontal. This behavior is strikingly similar to how apraxia patients struggle to copy or construct simple figures despite intact visual perception and motor skills. Our findings suggest that current VLMs, despite their advanced capabilities in other domains, lack fundamental spatial reasoning abilities akin to those impaired in constructive apraxia. This limitation in AI systems provides a novel computational model for studying spatial cognition deficits and highlights a critical area for improvement in VLM architecture and training methodologies.


Electrooptical Image Synthesis from SAR Imagery Using Generative Adversarial Networks

arXiv.org Artificial Intelligence

The utility of Synthetic Aperture Radar (SAR) imagery in remote sensing and satellite image analysis is well established, offering robustness under various weather and lighting conditions. However, SAR images, characterized by their unique structural and texture characteristics, often pose interpretability challenges for analysts accustomed to electrooptical (EO) imagery. This application compares state-of-the-art Generative Adversarial Networks (GANs) including Pix2Pix, CycleGan, S-CycleGan, and a novel dualgenerator GAN utilizing partial convolutions and a novel dual-generator architecture utilizing transformers. These models are designed to progressively refine the realism in the translated optical images, thereby enhancing the visual interpretability of SAR data. We demonstrate the efficacy of our approach through qualitative and quantitative evaluations, comparing the synthesized EO images with actual EO images in terms of visual fidelity and feature preservation. The results show significant improvements in interpretability, making SAR data more accessible for analysts familiar with EO imagery. Furthermore, we explore the potential of this technology in various applications, including environmental monitoring, urban planning, and military reconnaissance, where rapid, accurate interpretation of SAR data is crucial. Our research contributes to the field of remote sensing by bridging the gap between SAR and EO imagery, offering a novel tool for enhanced data interpretation and broader application of SAR technology in various domains. NTRODUCTION Synthetic Aperture Radar (SAR) systems are capable of creating high-resolution remote sensing images of the earths surface from satellite and aircraft. These images offer several key advantages over standard electro-optical (EO) images, most significantly, the ability to penetrate clouds and operate independently of daylight, which has led to SAR systems being deployed extensively in various fields, including environmental monitoring, natural disaster assessment, military reconnaissance, and geological mapping [1]. Figure 1 shows the benefit of a SAR image when cloud coverage is present. Despite these advantages, SAR images poses significant challenges and still has drawbacks compared to EO images, specifically regarding human interpretability.


Exploiting Alpha Transparency In Language And Vision-Based AI Systems

arXiv.org Artificial Intelligence

This investigation reveals a novel exploit derived from PNG image file formats, specifically their alpha transparency layer, and its potential to fool multiple AI vision systems. Our method uses this alpha layer as a clandestine channel invisible to human observers but fully actionable by AI image processors. The scope tested for the vulnerability spans representative vision systems from Apple, Microsoft, Google, Salesforce, Nvidia, and Facebook, highlighting the attack's potential breadth. This vulnerability challenges the security protocols of existing and fielded vision systems, from medical imaging to autonomous driving technologies. Our experiments demonstrate that the affected systems, which rely on convolutional neural networks or the latest multimodal language models, cannot quickly mitigate these vulnerabilities through simple patches or updates. Instead, they require retraining and architectural changes, indicating a persistent hole in multimodal technologies without some future adversarial hardening against such vision-language exploits.