powerful model
C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning
Large language models (LLMs) have achieved impressive results on complex reasoning tasks, but their high inference cost remains a major barrier to real-world deployment. A promising solution is to use cascaded inference, where small, cheap models handle easy queries, and only the hardest examples are escalated to more powerful models. However, existing cascade methods typically rely on supervised training with labeled data, offer no theoretical generalization guarantees, and provide limited control over test-time computational cost. We introduce C3PO (Cost Controlled Cascaded Prediction Optimization), a self-supervised framework for optimizing LLM cascades under probabilistic cost constraints. By focusing on minimizing regret with respect to the most powerful model (MPM), C3PO avoids the need for labeled data by constructing a cascade using only unlabeled model outputs. It leverages conformal prediction to bound the probability that inference cost exceeds a user-specified budget. We provide theoretical guarantees on both cost control and generalization error, and show that our optimization procedure is effective even with small calibration sets. Empirically, C3PO achieves stateof-the-art performance across a diverse set of reasoning benchmarks including GSM8K, MATH-500, BigBench-Hard and AIME, outperforming strong LLM cascading baselines in both accuracy and cost-efficiency. Our results demonstrate that principled, label-free cascade optimization can enable scalable LLM deployment.
C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning
Large language models (LLMs) have achieved impressive results on complex reasoning tasks, but their high inference cost remains a major barrier to real-world deployment. A promising solution is to use cascaded inference, where small, cheap models handle easy queries, and only the hardest examples are escalated to more powerful models. However, existing cascade methods typically rely on supervised training with labeled data, offer no theoretical generalization guarantees, and provide limited control over test-time computational cost.
The Download: meet Cathy Tie, and Anthropic's new AI models
Since the Chinese biophysicist He Jiankui was released from prison in 2022, he has sought to make a scientific comeback and to repair his reputation after a three-year incarceration for illegally creating the world's first gene-edited children. One area of visible success on his come-back trail has been his X.com account. Over the past few years, his account has evolved from sharing mundane images of his daily life to spreading outrageous, antagonistic messages. This has left observers unsure what to take seriously. Last month, in reply to MIT Technology Review's questions about who was responsible for the account's transformation into a font of clever memes, He emailed us back: "It's thanks to Cathy Tie." Tie is no stranger to the public spotlight.
eufy launches the world's first robot vacuum with a portable deep cleaner (plus other powerful model)
Are you ready to completely overhaul your floor cleaning routines? The eufy E28 and E25 have just landed, and we bet you're going to want one of these robotic cleaners delivered to your home sooner rather than later. Under pre-sale now, be sure to order and save your spot in line with these powerful units. The E25 and E28 are two brand-new models unveiled by the company. Both feature eufy's award-winning HydroJet mopping technology and deliver a jaw-dropping 20,000Pa suction power for a deep clean.
How Do You Measure AI?
Millions of people use artificial intelligence (AI) tools like ChatGPT daily to do everything from generating code to drawing images to creating business ideas. Those AI tools appear to be getting better. Back in November 2022 when it was launched, ChatGPT was powered by GPT-3.5, at the time the most powerful model offered by OpenAI. Yet GPT-3.5 was quickly eclipsed by GPT-4 just a few months later. GPT-4 crushed GPT-3.5 on a range of benchmarks, including its performance on the bar exam (GPT-4 scored in the 90th percentile; GPT-3.5 in the 10th).
Under Trump, AI Scientists Are Told to Remove 'Ideological Bias' From Powerful Models
The National Institute of Standards and Technology (NIST) has issued new instructions to scientists that partner with the US Artificial Intelligence Safety Institute (AISI) that eliminate mention of "AI safety," "responsible AI," and "AI fairness" in the skills it expects of members and introduces a request to prioritize "reducing ideological bias, to enable human flourishing and economic competitiveness." The information comes as part of an updated cooperative research and development agreement for AI Safety Institute consortium members, sent in early March. Previously, that agreement encouraged researchers to contribute technical work that could help identify and fix discriminatory model behavior related to gender, race, age, or wealth inequality. Such biases are hugely important because they can directly affect end users and disproportionately harm minorities and economically disadvantaged groups. The new agreement removes mention of developing tools "for authenticating content and tracking its provenance" as well as "labeling synthetic content," signaling less interest in tracking misinformation and deep fakes.
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
Anagnostidis, Sotiris, Bachmann, Gregor, Kim, Yeongmin, Kohler, Jonas, Georgopoulos, Markos, Sanakoyeu, Artsiom, Du, Yuming, Pumarola, Albert, Thabet, Ali, Schรถnfeld, Edgar
Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into \emph{flexible} ones -- dubbed FlexiDiT -- allowing them to process inputs at varying compute budgets. We demonstrate how a single \emph{flexible} model can generate images without any drop in quality, while reducing the required FLOPs by more than $40$\% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to $75$\% less compute without compromising performance.
Revisiting Robust RAG: Do We Still Need Complex Robust Training in the Era of Powerful LLMs?
Ding, Hanxing, Tao, Shuchang, Pang, Liang, Wei, Zihao, Chen, Liwei, Xu, Kun, Shen, Huawei, Cheng, Xueqi
Retrieval-augmented generation (RAG) systems often suffer from performance degradation when encountering noisy or irrelevant documents, driving researchers to develop sophisticated training strategies to enhance their robustness against such retrieval noise. However, as large language models (LLMs) continue to advance, the necessity of these complex training methods is increasingly questioned. In this paper, we systematically investigate whether complex robust training strategies remain necessary as model capacity grows. Through comprehensive experiments spanning multiple model architectures and parameter scales, we evaluate various document selection methods and adversarial training techniques across diverse datasets. Our extensive experiments consistently demonstrate that as models become more powerful, the performance gains brought by complex robust training methods drop off dramatically. We delve into the rationale and find that more powerful models inherently exhibit superior confidence calibration, better generalization across datasets (even when trained with randomly selected documents), and optimal attention mechanisms learned with simpler strategies. Our findings suggest that RAG systems can benefit from simpler architectures and training strategies as models become more powerful, enabling more scalable applications with minimal complexity.