Goto

Collaborating Authors

 Large Language Model


OpenAI's free GPT-5.5 model makes ChatGPT better at understanding context

Engadget

GPT-5.5 Instant is now more capable at processing complex questions. OpenAI has updated GPT-5.5 Instant, the model you interact with the most when you use ChatGPT, to be better at understanding context and adapting to queries as you alter them to add more conditions or clarifications. The company updated ChatGPT's default model to GPT-5.5 Instant in May. Back then, it said that the model produced 52.5 percent fewer hallucinated statements during testing and 37.3 percent fewer factual errors. Now, the model has been upgraded to be more capable when it comes to identifying the underlying goal of a task or a question and carrying context over across multiple back-and-forths as you talk to it.


Jalapeรฑo is the first AI chip from OpenAI and Broadcom

Engadget

OpenAI and Broadcom have unveiled the design for Jalapeรฑo, their first jointly-made chip. The pair of companies announced plans to collaborate on a making a custom AI accelerator in October 2025. In its blog post today, OpenAI called Jalapeรฑo its first Intelligence Processor: an accelerator architected around OpenAI's vision for the future of LLM inference. In other words, the processor is designed to run its large language models. The AI company claims that so far, Jalapeรฑo is offering performance per watt substantially better than current state-of-the-art in chip technology.


AgentAuditor: Human-level Safety and Security Evaluation for LLM Agents

Neural Information Processing Systems

Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents' step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce AgentAuditor, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. AgentAuditor constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator's assessment of new cases. Moreover, we developed ASSEBench, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. ASSEBench comprises 2293 meticulously annotated interaction records, covering 15 risk types across 29 application scenarios. A key feature of ASSEBench is its nuanced approach to ambiguous risk situations, employing Strict and Lenient judgment standards. Experiments demonstrate that AgentAuditor not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy.


Qualcomm Buys Buzzy Chip Startup Modular for Nearly 4 Billion

WIRED

Modular, one of the most promising chip software startups of the AI era, heads for a multibillion-dollar exit. Qualcomm will acquire the Silicon Valley chip startup Modular for nearly $4 billion. The companies announced the acquisition on Wednesday; Qualcomm said it expects to issue up to 19.2 million shares of common stock in the deal, which works out to just under $4 billion based on the company's last closing share price. The deal, which includes $300 million for Modular employees, comes nine months after the chip startup raised $250 million at a $1.6 billion valuation . It's expected to close in the second half of this year.


Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Neural Information Processing Systems

As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) - what is the appropriate information to share while carrying out a certain task - becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only 700 examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls. Our code is available at: https://github.com/EricGLan/CI-RL



Structured Spectral Reasoning for Frequency-Adaptive Multimodal Recommendation

Neural Information Processing Systems

Multimodal recommendation aims to integrate collaborative signals with heterogeneous content such as visual and textual information, but remains challenged by modality-specific noise, semantic inconsistency, and unstable propagation over user-item graphs. These issues are often exacerbated by naive fusion or shallow modeling strategies, leading to degraded generalization and poor robustness. While recent work has explored the frequency domain as a lens to separate stable from noisy signals, most methods rely on static filtering or reweighting, lacking the ability to reason over spectral structure or adapt to modality-specific reliability. To address these challenges, we propose a Structured Spectral Reasoning (SSR) framework for frequency-aware multimodal recommendation. Our method follows a four-stage pipeline: (i) Decompose graph-based multimodal signals into spectral bands via graph-guided transformations to isolate semantic granularity; (ii) Modulate band-level reliability with spectral band masking, a training-time masking with representation-consistency objective that suppresses brittle frequency components; (iii) Fuse complementary frequency cues using hyperspectral reasoning with low-rank cross-band interaction; and (iv) Align modality-specific spectral features via contrastive regularization to promote semantic and structural consistency. Experiments on three real-world benchmarks show consistent gains over strong baselines, particularly under sparse and cold-start settings. Additional analyses indicate that structured spectral modeling improves robustness and provides clearer diagnostics of how different bands contribute to performance. The code is available at https://github.com/llm-ml/SSR.git.


Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties

Neural Information Processing Systems

Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In this work, we introduce the notion of a reasoning graph, extracted by clustering hidden-state representations at each reasoning step, and systematically analyze three key graph-theoretic properties: cyclicity, diameter, and small-world index, across multiple tasks (GSM8K, MATH500, AIME 2024). Our findings reveal that distilled reasoning models (e.g., DeepSeekR1-Distill-Qwen-32B) exhibit significantly more recurrent cycles (about 5 per sample), substantially larger graph diameters, and pronounced small-world characteristics (about 6x) compared to their base counterparts. Notably, these structural advantages grow with task difficulty and model capacity, with cycle detection peaking at the 14B scale and exploration diameter maximized in the 32B variant, correlating positively with accuracy. Furthermore, we show that supervised fine-tuning on an improved dataset systematically expands reasoning graph diameters in tandem with performance gains, offering concrete guidelines for dataset design aimed at boosting reasoning capabilities.


QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

Neural Information Processing Systems

Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge.


Improving Regret Approximation for Unsupervised Dynamic Environment Generation

Neural Information Processing Systems

Unsupervised Environment Design (UED) seeks to automatically generate training curricula for reinforcement learning (RL) agents, with the goal of improving generalisation and zero-shot performance. However, designing effective curricula remains a difficult problem, particularly in settings where small subsets of environment parameterisations result in significant increases in the complexity of the required policy. Current methods struggle with a difficult credit assignment problem and rely on regret approximations that fail to identify challenging levels, both of which are compounded as the size of the environment grows. We propose Dynamic Environment Generation for UED (DEGen) to enable a denser level generator reward signal, reducing the difficulty of credit assignment and allowing for UED to scale to larger environment sizes. We also introduce a new regret approximation, Maximised Negative Advantage (MNA), as a significantly improved metric to optimise for, that better identifies more challenging levels. We show empirically that MNA outperforms current regret approximations and when combined with DEGen, consistently outperforms existing methods, especially as the size of the environment grows. We have made all our code available here: https://github.