Large Language Model
QATCH: Benchmarking SQL-centric tasks with Table Representation Learning Models on Your Data
Table Representation Learning (TRL) models are commonly pre-trained on large open-domain datasets comprising millions of tables and then used to address downstream tasks. Choosing the right TRL model to use on proprietary data can be challenging, as the best results depend on the content domain, schema, and data quality. Our purpose is to support end-users in testing TRL models on proprietary data in two established SQL-centric tasks, i.e., Question Answering (QA) and Semantic Parsing (SP). We present QATCH (Query-Aided TRLChecklist), a toolbox to highlight TRL models' strengths and weaknesses on relational tables unseen at training time. For an input table, QATCH automatically generates a testing checklist tailored to QA and SP. Checklist generation is driven by a SQL query engine that crafts tests of different complexity. This design facilitates inherent portability, allowing the checks to be used by alternative models. We also introduce a set of cross-task performance metrics evaluating the TRL model's performance over its output. Finally, we show how QATCH automatically generates tests for proprietary datasets to evaluate various state-of-the-art models including TAPAS, TAPEX, and CHATGPT.
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
Huang, Yizheng, Zeng, Wenjun, Kumaresan, Aditi, Wang, Zi
Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.
Learning to Think from Multiple Thinkers
Joshi, Nirmit, Magen, Roey, Srebro, Nathan, Tsilivis, Nikolaos, Vardi, Gal
We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.
Schema-learning and rebinding as mechanisms of in-context learning and emergence
In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based large language models (LLMs). Yet the mechanisms that underlie it are poorly understood. In this paper, we demonstrate that comparable ICL capabilities can be acquired by an alternative sequence prediction learning method, namely clone-structured causal graphs (CSCGs). A key property of CSCGs is that, unlike transformer-based LLMs, they are interpretable, which considerably simplifies the task of explaining how ICL works. We show that ICL in CSCG uses a combination of (a) learning template (schema) circuits for pattern completion, (b) retrieving relevant templates in a context-sensitive manner, and (c) rebinding novel tokens to appropriate slots in the templates. We go on to marshall evidence for the hypothesis that similar mechanisms underlie ICL in LLMs. For example, we find that, with CSCGs as with LLMs, different capabilities emerge at different levels of overparameterization, suggesting that overparameterization helps in learning more complex template (schema) circuits. By showing how ICL can be achieved with small models and datasets, we open up a path to novel architectures, and take a vital step towards a more general understanding of the mechanics behind this important capability.
Elon Musk and Sam Altman are going to court over OpenAI's future
Elon Musk and Sam Altman are going to court over OpenAI's future Elon Musk says he's suing to save the company's mission. The case could have huge consequences for OpenAI and the AI race. After a yearslong legal feud, Elon Musk and OpenAI CEO Sam Altman are heading to trial this week in Northern California in a case that could have sweeping consequences. Ahead of OpenAI's highly anticipated IPO, the court could rule on whether the company is allowed to exist as a for-profit enterprise and might even oust its current executive leadership, including Altman. Musk is suing OpenAI, alleging that Altman and OpenAI president Greg Brockman deceived him into bankrolling the company in its early days by promising to maintain it as a nonprofit dedicated to developing AI that benefits humanity, only to later restructure the company to operate a for-profit subsidiary. Musk cofounded OpenAI with Altman and others in 2015, but he left in 2018 after a bitter power struggle.
Elon Musk Boosts New Yorker's Sam Altman Exposé on X as Trial Begins
Elon Musk Boosts New Yorker's Sam Altman Exposé on X as Trial Begins The move comes as the trial for Elon Musk's lawsuit against OpenAI kicks off in federal court in Oakland. Elon Musk is boosting a post on X promoting The New Yorker's extensive investigation into Sam Altman's allegedly deceptive behavior, WIRED has confirmed. The move comes just as Musk's lawsuit against OpenAI and Altman heads to a jury trial in a federal courtroom on Monday morning. People scrolling X on Monday reported seeing an April 6 post from Ronan Farrow, a coauthor on the New Yorker article, promoting the investigation. A pop-up on the post on X's mobile app says it was boosted by @elonmusk, who also owns the platform.
OpenAI's GPT-5.5 is faster, smarter, and a step toward its 'super app'
PCWorld reports that OpenAI has launched GPT-5.5, its most advanced AI model, exclusively for paying ChatGPT subscribers on Plus, Pro, Business, and Enterprise plans. The new model delivers faster, more efficient performance in coding, research, and math while outperforming competitors like Google's Gemini 3.1 Pro and Anthropic's Claude Opus 4.7. GPT-5.5 represents a significant step toward OpenAI's'super app' vision, integrating various AI services into one comprehensive platform. OpenAI recently launched GPT-5.5, which the company describes as its most advanced and intuitive AI model to date. The new model is said to be both faster and more efficient, with specific improvements in areas including coding, research, and math. At the same time, it's said to perform better compared to competing models like Google's Gemini 3.1 Pro and Anthropic's Claude Opus 4.7. According to OpenAI co-founder Greg Brockman, GPT-5.5 is also a step towards the company's vision of a future "super app," where services such as ChatGPT, Codex, and an AI-driven web browser are integrated into a single platform, reports TechCrunch . GPT-5.5 is currently rolling out to paying ChatGPT users, which includes those on Plus, Pro, Business, and Enterprise plans. This article originally appeared on our sister publication PC för Alla and was translated and localized from Swedish.
OpenAI breaks out of exclusivity agreements in its partnership with Microsoft
The two companies announced an amended partnership that lets OpenAI use other cloud platforms and offer its models to other companies. OpenAI is opening up its partnership with Microsoft in the latest amendment to the major multi-year collaboration between the tech giants. The latest changes allow OpenAI to offer its latest AI models to other companies and through other cloud providers, stripping Microsoft of its exclusivity rights. In a joint announcement posted on OpenAI and Microsoft's websites, Microsoft will still be OpenAI's primary cloud partner with the latest products shipping first on Azure, but OpenAI is now allowed to use any cloud provider. Sam Altman, OpenAI's CEO, posted on X that the company is now able to make our products and services available across all clouds.