Goto

Collaborating Authors

 name


BRIDGE: Building Representations In Domain Guided Program Verification

George, Robert Joseph, Eisenach, Carson, Ghai, Udaya, Perrault-Joncas, Dominique, Anandkumar, Anima, Foster, Dean

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved impressive results in code generation, yet struggle with program verification, especially in interactive proof frameworks such as Lean4. A central challenge is scalability: verified synthesis requires not just code, but also precise specifications and correctness proofs, and existing approaches rarely span all three domains. We present BRIDGE, the first systematic study of structured prompting for scalable verified program generation. BRIDGE decomposes verification into three interconnected domains: Code (executable implementations), Specifications (formal intent statements), and Proofs (constructive correctness arguments). Our key idea is to elicit distinct reasoning behaviors functional, specification-driven, and proof-oriented as intermediate representations that preserve semantic structure and connect these domains. Through systematic ablations, we show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods. For example, functional reasoning improves correctness of code in formal languages (Lean4) by nearly 1.5x (pass@5) over direct baselines. In inference-time compute, functional reasoning is also 2x more efficient, achieving higher pass rates with fewer generations and lower total sampling budgets. Similarly, we find that specification-driven prompting boosts Python coding pass rates by up to 17.5%. These findings suggest that structured domain alignment is a promising direction for advancing verified synthesis. BRIDGE establishes a foundation for training via expert iteration or RLVR, enabling models to internalize these reasoning strategies across code, specifications, and proofs.



Supplementary Materials for MLP-Mixer: An all-MLP Architecture for Vision

Neural Information Processing Systems

We did not observe any noticeable improvements. In other words, token-mixing MLPs operate by looking at only one channel at once. All layers in Mixer retain the same, isotropic design. Table 1: Hyperparameter settings used for pre-training Mixer models. However, these did not lead to consistent improvements, so we dropped them.


Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

Shen, Xinjie, Li, Mufei, Li, Pan

arXiv.org Artificial Intelligence

The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural language based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59\% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86\% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15\% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment. Codes and datasets will be available at https://github.com/Graph-COM/EAPrivacy.




XML Prompting as Grammar-Constrained Interaction: Fixed-Point Semantics, Convergence Guarantees, and Human-AI Protocols

Alpay, Faruk, Alpay, Taylan

arXiv.org Artificial Intelligence

Structured prompting with XML tags has emerged as an effective way to steer large language models (LLMs) toward parseable, schema - adherent outputs in real - world systems. We develop a logic - first treatment of XML prompting that unifies (i) grammar - constrained decoding, (ii) fixed - point semantics over lattices of hierarchical prompts, and (iii) convergent human - AI interaction loops. We formalize a complete lattice of XML trees under a refinement order and prove that monotone prompt - to - prompt operators admit least fixed points (Knaster - Tarski) that characterize steady - state protocols; under a task - aware contraction metric on trees, we further prove Banach - style convergence of iterative guidance. We instantiate these results with context - free grammars (CFGs) for XML schemas and show how constrained decoding guarantees well - formedness while preserving task performance. A set of multi - layer human - AI interaction recipes demonstrates practical deployment patterns, including multi - pass "plan verify revise" routines and agentic tool use. We provide mathematically complete proofs and tie our framework to recent advances in grammar - aligned decoding, chain - of - verification, and programmatic prompting. Keywords: XML prompting; grammar - constrained decoding; fixed - point theorems; Banach contraction; Knaster - Tarski; modal µ - calculus; structured outputs; human - AI interaction; arXiv cs.AI; arXiv cs.CL


Toward Reproducible Cross-Backend Compatibility for Deep Learning: A Configuration-First Framework with Three-Tier Verification

Li, Zehua

arXiv.org Artificial Intelligence

This paper presents a configuration-first framework for evaluating cross-backend compatibility in deep learning systems deployed on CPU, GPU, and compiled runtimes. The framework decouples experiments from code using YAML, supports both library and repository models, and employs a three-tier verification protocol covering tensor-level closeness, activation alignment, and task-level metrics. Through 672 checks across multiple models and tolerance settings, we observe that 72.0% of runs pass, with most discrepancies occurring under stricter thresholds. Our results show that detection models and compiled backends are particularly prone to drift, often due to nondeterministic post-processing. We further demonstrate that deterministic adapters and selective fallbacks can substantially improve agreement without significant performance loss. To our knowledge, this is the first unified framework that systematically quantifies and mitigates cross-backend drift in deep learning, providing a reproducible methodology for dependable deployment across heterogeneous runtimes.



How and why parents and teachers are introducing young children to AI

The Guardian

Since the release of ChatGPT in late 2022, generative artificial intelligence has trickled down from adults in their offices to university students in campus libraries to teenagers in high school hallways. Now it's reaching the youngest among us, and parents and teachers are grappling with the most responsible way to introduce their under-13s to a new technology that may fundamentally reshape the future. Though the terms of service for ChatGPT, Google's Gemini and other AI models specify that the tools are only meant for those over 13, parents and teachers are taking the matter of AI education into their own hands. Inspired by a story we published on parents who are teaching their children to use AI to set them up for success in school and at work, we asked Guardian readers how and why – or why not – others are doing the same. Though our original story only concerned parents, we have also included teachers in the responses published below, as preparing children for future studies and jobs is one of educators' responsibilities as well.