Large Language Model
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
Huang, Yizheng, Zeng, Wenjun, Kumaresan, Aditi, Wang, Zi
Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.
Learning to Think from Multiple Thinkers
Joshi, Nirmit, Magen, Roey, Srebro, Nathan, Tsilivis, Nikolaos, Vardi, Gal
We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.
Schema-learning and rebinding as mechanisms of in-context learning and emergence
In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based large language models (LLMs). Yet the mechanisms that underlie it are poorly understood. In this paper, we demonstrate that comparable ICL capabilities can be acquired by an alternative sequence prediction learning method, namely clone-structured causal graphs (CSCGs). A key property of CSCGs is that, unlike transformer-based LLMs, they are interpretable, which considerably simplifies the task of explaining how ICL works. We show that ICL in CSCG uses a combination of (a) learning template (schema) circuits for pattern completion, (b) retrieving relevant templates in a context-sensitive manner, and (c) rebinding novel tokens to appropriate slots in the templates. We go on to marshall evidence for the hypothesis that similar mechanisms underlie ICL in LLMs. For example, we find that, with CSCGs as with LLMs, different capabilities emerge at different levels of overparameterization, suggesting that overparameterization helps in learning more complex template (schema) circuits. By showing how ICL can be achieved with small models and datasets, we open up a path to novel architectures, and take a vital step towards a more general understanding of the mechanics behind this important capability.
Elon Musk and Sam Altman are going to court over OpenAI's future
Elon Musk and Sam Altman are going to court over OpenAI's future Elon Musk says he's suing to save the company's mission. The case could have huge consequences for OpenAI and the AI race. After a yearslong legal feud, Elon Musk and OpenAI CEO Sam Altman are heading to trial this week in Northern California in a case that could have sweeping consequences. Ahead of OpenAI's highly anticipated IPO, the court could rule on whether the company is allowed to exist as a for-profit enterprise and might even oust its current executive leadership, including Altman. Musk is suing OpenAI, alleging that Altman and OpenAI president Greg Brockman deceived him into bankrolling the company in its early days by promising to maintain it as a nonprofit dedicated to developing AI that benefits humanity, only to later restructure the company to operate a for-profit subsidiary. Musk cofounded OpenAI with Altman and others in 2015, but he left in 2018 after a bitter power struggle.
Elon Musk Boosts New Yorker's Sam Altman Exposé on X as Trial Begins
Elon Musk Boosts New Yorker's Sam Altman Exposé on X as Trial Begins The move comes as the trial for Elon Musk's lawsuit against OpenAI kicks off in federal court in Oakland. Elon Musk is boosting a post on X promoting The New Yorker's extensive investigation into Sam Altman's allegedly deceptive behavior, WIRED has confirmed. The move comes just as Musk's lawsuit against OpenAI and Altman heads to a jury trial in a federal courtroom on Monday morning. People scrolling X on Monday reported seeing an April 6 post from Ronan Farrow, a coauthor on the New Yorker article, promoting the investigation. A pop-up on the post on X's mobile app says it was boosted by @elonmusk, who also owns the platform.
OpenAI's GPT-5.5 is faster, smarter, and a step toward its 'super app'
PCWorld reports that OpenAI has launched GPT-5.5, its most advanced AI model, exclusively for paying ChatGPT subscribers on Plus, Pro, Business, and Enterprise plans. The new model delivers faster, more efficient performance in coding, research, and math while outperforming competitors like Google's Gemini 3.1 Pro and Anthropic's Claude Opus 4.7. GPT-5.5 represents a significant step toward OpenAI's'super app' vision, integrating various AI services into one comprehensive platform. OpenAI recently launched GPT-5.5, which the company describes as its most advanced and intuitive AI model to date. The new model is said to be both faster and more efficient, with specific improvements in areas including coding, research, and math. At the same time, it's said to perform better compared to competing models like Google's Gemini 3.1 Pro and Anthropic's Claude Opus 4.7. According to OpenAI co-founder Greg Brockman, GPT-5.5 is also a step towards the company's vision of a future "super app," where services such as ChatGPT, Codex, and an AI-driven web browser are integrated into a single platform, reports TechCrunch . GPT-5.5 is currently rolling out to paying ChatGPT users, which includes those on Plus, Pro, Business, and Enterprise plans. This article originally appeared on our sister publication PC för Alla and was translated and localized from Swedish.
OpenAI breaks out of exclusivity agreements in its partnership with Microsoft
The two companies announced an amended partnership that lets OpenAI use other cloud platforms and offer its models to other companies. OpenAI is opening up its partnership with Microsoft in the latest amendment to the major multi-year collaboration between the tech giants. The latest changes allow OpenAI to offer its latest AI models to other companies and through other cloud providers, stripping Microsoft of its exclusivity rights. In a joint announcement posted on OpenAI and Microsoft's websites, Microsoft will still be OpenAI's primary cloud partner with the latest products shipping first on Azure, but OpenAI is now allowed to use any cloud provider. Sam Altman, OpenAI's CEO, posted on X that the company is now able to make our products and services available across all clouds.
Why Elon Musk and Sam Altman are fighting over OpenAI
Musk, who co-founded the company that created ChatGPT with Altman, wants more than $130 billion in damages in a lawsuit that could shakeup the artificial intelligence landscape. The BBC's Lily Jamali explains why the two tech giants are facing off in court. How much screen time is too much for under fives? Some major retailers and independent stores have introduced AI body scans, CCTV or facial recognition equipment to identify crimes like shoplifting. What does TikTok's deal mean for America's users?
The Download: DeepSeek's latest AI breakthrough, and the race to build world models
The Download: DeepSeek's latest AI breakthrough, and the race to build world models Plus: China has blocked Meta's $2 billion acquisition of AI startup Manus. On Friday, Chinese AI firm DeepSeek released a preview of V4, its long-awaited new flagship model. Notably, the model can process much longer prompts than its last generation, thanks to a new design that handles large amounts of text more efficiently. While the model remains open source, its performance matches leading closed-source rivals from Anthropic, OpenAI, and Google. Here are three ways V4 could shake up AI . AI systems have already gained impressive mastery over the digital world, but the physical world remains humanity's domain.
An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts.In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment.Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the distributions of tokens produced by an audio backbone and those of an image captioner. Our approach aligns the audio token distribution with that of the image tokens, enabling the model to perform zero-shot audio captioning in an unsupervised fashion. This alignment allows for the use of either audio or audiovisual input by combining or substituting the image encoder with the aligned audio encoder. Our method achieves significantly improved performances in zero-shot audio captioning, compared to existing approaches.