name
Mixer
Groupingthechannelstogether Token-mixingMLPstake S-dimensionalvectorsasinputs.Every such vector contains values of asingle feature acrossS different spatial locations. In other words, token-mixing MLPs operate by looking at onlyone channel at once. Forstochasticdepth,followingtheoriginal paper [3], we linearly increase the probability of dropping a layer from0.0 Models are fine-tuned at resolution 224 unless mentioned otherwise. We follow the setup of [2].
BRIDGE: Building Representations In Domain Guided Program Verification
George, Robert Joseph, Eisenach, Carson, Ghai, Udaya, Perrault-Joncas, Dominique, Anandkumar, Anima, Foster, Dean
Large language models (LLMs) have achieved impressive results in code generation, yet struggle with program verification, especially in interactive proof frameworks such as Lean4. A central challenge is scalability: verified synthesis requires not just code, but also precise specifications and correctness proofs, and existing approaches rarely span all three domains. We present BRIDGE, the first systematic study of structured prompting for scalable verified program generation. BRIDGE decomposes verification into three interconnected domains: Code (executable implementations), Specifications (formal intent statements), and Proofs (constructive correctness arguments). Our key idea is to elicit distinct reasoning behaviors functional, specification-driven, and proof-oriented as intermediate representations that preserve semantic structure and connect these domains. Through systematic ablations, we show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods. For example, functional reasoning improves correctness of code in formal languages (Lean4) by nearly 1.5x (pass@5) over direct baselines. In inference-time compute, functional reasoning is also 2x more efficient, achieving higher pass rates with fewer generations and lower total sampling budgets. Similarly, we find that specification-driven prompting boosts Python coding pass rates by up to 17.5%. These findings suggest that structured domain alignment is a promising direction for advancing verified synthesis. BRIDGE establishes a foundation for training via expert iteration or RLVR, enabling models to internalize these reasoning strategies across code, specifications, and proofs.
Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark
Shen, Xinjie, Li, Mufei, Li, Pan
The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural language based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59\% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86\% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15\% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment. Codes and datasets will be available at https://github.com/Graph-COM/EAPrivacy.
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.68)
XML Prompting as Grammar-Constrained Interaction: Fixed-Point Semantics, Convergence Guarantees, and Human-AI Protocols
Structured prompting with XML tags has emerged as an effective way to steer large language models (LLMs) toward parseable, schema - adherent outputs in real - world systems. We develop a logic - first treatment of XML prompting that unifies (i) grammar - constrained decoding, (ii) fixed - point semantics over lattices of hierarchical prompts, and (iii) convergent human - AI interaction loops. We formalize a complete lattice of XML trees under a refinement order and prove that monotone prompt - to - prompt operators admit least fixed points (Knaster - Tarski) that characterize steady - state protocols; under a task - aware contraction metric on trees, we further prove Banach - style convergence of iterative guidance. We instantiate these results with context - free grammars (CFGs) for XML schemas and show how constrained decoding guarantees well - formedness while preserving task performance. A set of multi - layer human - AI interaction recipes demonstrates practical deployment patterns, including multi - pass "plan verify revise" routines and agentic tool use. We provide mathematically complete proofs and tie our framework to recent advances in grammar - aligned decoding, chain - of - verification, and programmatic prompting. Keywords: XML prompting; grammar - constrained decoding; fixed - point theorems; Banach contraction; Knaster - Tarski; modal µ - calculus; structured outputs; human - AI interaction; arXiv cs.AI; arXiv cs.CL
Toward Reproducible Cross-Backend Compatibility for Deep Learning: A Configuration-First Framework with Three-Tier Verification
This paper presents a configuration-first framework for evaluating cross-backend compatibility in deep learning systems deployed on CPU, GPU, and compiled runtimes. The framework decouples experiments from code using YAML, supports both library and repository models, and employs a three-tier verification protocol covering tensor-level closeness, activation alignment, and task-level metrics. Through 672 checks across multiple models and tolerance settings, we observe that 72.0% of runs pass, with most discrepancies occurring under stricter thresholds. Our results show that detection models and compiled backends are particularly prone to drift, often due to nondeterministic post-processing. We further demonstrate that deterministic adapters and selective fallbacks can substantially improve agreement without significant performance loss. To our knowledge, this is the first unified framework that systematically quantifies and mitigates cross-backend drift in deep learning, providing a reproducible methodology for dependable deployment across heterogeneous runtimes.
Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs
Ranaldi, Federico, Zugarini, Andrea, Ranaldi, Leonardo, Zanzotto, Fabio Massimo
We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.
- Europe (1.00)
- North America > United States (0.28)
- North America > Mexico (0.28)