Goto

Collaborating Authors

 Law


Rethinking Technological Readiness in the Era of AI Uncertainty

arXiv.org Artificial Intelligence

Advances in artificial intelligence (AI) promise enhanced c apabilities for military combat systems, from autonomous drones to decision-support algorithms.[2] These benefits c ome with new risks: AI systems can behave unpredictably, lack transparency, and perform inconsistently outside of c ontrolled settings.[3] To overcome these challenges, a ded - icated AI Readiness Framework is needed to systematically a ssess whether AI-enabled military systems are truly prepared for deployment. This article contends that defens e organizations should adopt an AI-specific readiness asses s-ment, analogous to (but more comprehensive than) tradition al metrics like Technology Readiness Levels (TRLs),[1] to ensure justified confidence in AI systems before they are fie lded. W e begin by examining the limitations of current readiness assessment metrics (such as TRLs) when applied to AI. W e then introduce a new framework with specific criteria designed to evaluate AI system maturity, explaini ng our rationale for each criterion and discussing implementation considerations.[4] Next, we analyze how the prop osed framework addresses critical AI system challenges, including "hallucinations," lack of explainability, and p erformance variability in operational scenarios. Finally, we outline the framework's applicability to current military AI p rograms and conclude with recommendations for integrating this approach into defense technology management.


Cartridges: Lightweight and general-purpose long context representations via self-study

arXiv.org Artificial Intelligence

Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.


D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model

arXiv.org Artificial Intelligence

Evaluating generative models with open-ended generation is challenging due to inconsistencies in response formats. Multiple-choice (MC) evaluation mitigates this issue, but generating high-quality distractors is time-consuming and labor-intensive. We introduce D-GEN, the first open-source distractor generator model that transforms open-ended data into an MC format. To evaluate distractor quality, we propose two novel methods: (1) ranking alignment, ensuring generated distractors retain the discriminatory power of ground-truth distractors, and (2) entropy analysis, comparing model confidence distributions. Our results show that D-GEN preserves ranking consistency (Spearman's rho 0.99, Kendall's tau 0.94) and closely matches the entropy distribution of ground-truth distractors. Human evaluation further confirms the fluency, coherence, distractiveness, and incorrectness. Our work advances robust and efficient distractor generation with automated evaluation, setting a new standard for MC evaluation.


A Step-by-Step Guide to Creating a Robust Autonomous Drone Testing Pipeline

arXiv.org Artificial Intelligence

Autonomous drones are rapidly reshaping industries ranging from aerial delivery and infrastructure inspection to environmental monitoring and disaster response. Ensuring the safety, reliability, and efficiency of these systems is paramount as they transition from research prototypes to mission-critical platforms. This paper presents a step-by-step guide to establishing a robust autonomous drone testing pipeline, covering each critical stage: Software-in-the-Loop (SIL) Simulation Testing, Hardware-in-the-Loop (HIL) Testing, Controlled Real-World Testing, and In-Field Testing. Using practical examples, including the marker-based autonomous landing system, we demonstrate how to systematically verify drone system behaviors, identify integration issues, and optimize performance. Furthermore, we highlight emerging trends shaping the future of drone testing, including the integration of Neurosymbolic and LLMs, creating co-simulation environments, and Digital Twin-enabled simulation-based testing techniques. By following this pipeline, developers and researchers can achieve comprehensive validation, minimize deployment risks, and prepare autonomous drones for safe and reliable real-world operations.


MANBench: Is Your Multimodal Model Smarter than Human?

arXiv.org Artificial Intelligence

The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework. Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image Understanding. Moreover, both humans and MLLMs face challenges in highly complex tasks like Puzzles and Spatial Imagination. MANBench highlights the strengths and limitations of MLLMs, revealing that even advanced models fall short of achieving human-level performance across many domains. We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities. The code and dataset are available at https://github.com/micdz/MANBench.


Perception-Driven Bias Detection in Machine Learning via Crowdsourced Visual Judgment

arXiv.org Artificial Intelligence

Machine learning systems are increasingly deployed in high-stakes domains, yet they remain vulnerable to bias systematic disparities that disproportionately impact specific demographic groups. Traditional bias detection methods often depend on access to sensitive labels or rely on rigid fairness metrics, limiting their applicability in real-world settings. This paper introduces a novel, perception-driven framework for bias detection that leverages crowdsourced human judgment. Inspired by reCAPTCHA and other crowd-powered systems, we present a lightweight web platform that displays stripped-down visualizations of numeric data (for example-salary distributions across demographic clusters) and collects binary judgments on group similarity. We explore how users' visual perception-shaped by layout, spacing, and question phrasing can signal potential disparities. User feedback is aggregated to flag data segments as biased, which are then validated through statistical tests and machine learning cross-evaluations. Our findings show that perceptual signals from non-expert users reliably correlate with known bias cases, suggesting that visual intuition can serve as a powerful, scalable proxy for fairness auditing. This approach offers a label-efficient, interpretable alternative to conventional fairness diagnostics, paving the way toward human-aligned, crowdsourced bias detection pipelines.


ChemHGNN: A Hierarchical Hypergraph Neural Network for Reaction Virtual Screening and Discovery

arXiv.org Artificial Intelligence

Reaction virtual screening and discovery are fundamental challenges in chemistry and materials science, where traditional graph neural networks (GNNs) struggle to model multi-reactant interactions. In this work, we propose ChemHGNN, a hypergraph neural network (HGNN) framework that effectively captures high-order relationships in reaction networks. Unlike GNNs, which require constructing complete graphs for multi-reactant reactions, ChemHGNN naturally models multi-reactant reactions through hyperedges, enabling more expressive reaction representations. To address key challenges, such as combinatorial explosion, model collapse, and chemically invalid negative samples, we introduce a reaction center-aware negative sampling strategy (RCNS) and a hierarchical embedding approach combining molecule, reaction and hypergraph level features. Experiments on the USPTO dataset demonstrate that ChemHGNN significantly outperforms HGNN and GNN baselines, particularly in large-scale settings, while maintaining interpretability and chemical plausibility. Our work establishes HGNNs as a superior alternative to GNNs for reaction virtual screening and discovery, offering a chemically informed framework for accelerating reaction discovery.


Fox News AI Newsletter: Hollywood studios sue 'bottomless pit of plagiarism'

FOX News

The Minions pose during the world premiere of the film "Despicable Me 4" in New York City, June 9, 2024. The website of Midjourney, an artificial intelligence (AI) capable of creating AI art, is seen on a smartphone on April 3, 2023, in Berlin, Germany. 'PIRACY IS PIRACY': Two major Hollywood studios are suing Midjourney, a popular AI image generator, over its use and distribution of intellectual property. AI RACE: Meta CEO Mark Zuckerberg is reportedly building a team of experts to develop artificial general intelligence (AGI) that can meet or exceed human capabilities. TECH HUB: New York is poised to play a central role in the development of artificial intelligence (AI), OpenAI executives told key business and civic leaders on Tuesday.


Facial recognition error sees woman accused of theft

BBC News

In one email from Facewatch seen by the BBC, the firm told Ms Horan it "relies on information submitted by stores" and the Home Bargains branches involved had since been "suspended from using the Facewatch system". Madeleine Stone, senior advocacy officer at the civil liberties campaign group Big Brother Watch, said they had been contacted by more than 35 people who have complained of being wrongly placed on facial recognition watchlists. "They're being wrongly flagged as criminals," Ms Stone said. "They've given no due process, kicked out of stores. This is having a really serious impact."


The Chatbot Disinfo Inflaming the LA Protests

WIRED

In recent days, Los Angeles residents have taken to the streets to protest the Trump administration's immigration policies and the increasingly frequent ICE raids. WIRED's senior politics editor Leah Feiger joins Zoë Schiffer, director of business and industry, to discuss the related flood of information on social media, and how AI chatbots like Grok and ChatGPT are delivering incorrect and at times, inflammatory answers. Mentioned in today's episode: AI Chatbots Are Making LA Protest Disinformation Worse by David Gilbert I Joined Every Class Action Lawsuit I Could Find, and So Can You by Andy Vasoyan Vibe Coding Is Coming for Engineering Jobs by Will Knight Write to us at uncannyvalley@wired.com. You can always listen to this week's podcast through the audio player on this page, but if you want to subscribe for free to get every episode, here's how: If you're on an iPhone or iPad, open the app called Podcasts, or just tap this link. Note: This is an automated transcript, which may contain errors.