frontier model
Google says Gemini 3.5 Flash rivals 'large flagship models' for coding and agentic tasks
Google says Gemini 3.5 Flash rivals'large flagship models' for coding and agentic tasks Google says Gemini 3.5 Flash rivals'large flagship models' for coding and agentic tasks It can complete tasks in a fraction of the time of other frontier models, Google claims. Google has unveiled Gemini 3.5, starting with the Gemini 3.5 Flash model that promises to outperform Gemini 3.1 Pro in real-world agentic and coding tasks. Announced at Google I/O 2026, this will be Google's default AI model (not to be confused with Flash-Lite), designed to deliver better speed than the current Gemini Pro models at a more affordable price. The tradeoff is lower performance than the 3.5 Pro model (coming next month) in tasks that require deep reasoning and high-context understanding. However, Google has reduced the compromise between the Pro and Flash models, saying Gemini 3.5 Flash delivers intelligence that rivals large flagship models on multiple dimensions.
Introducing ARFBench: A time series question-answering benchmark based on real incidents
More than a trillion dollars are lost every year due to system failures. To resolve them, engineers must troubleshoot outages quickly. An important task in incident response involves analyzing observability metrics, or time series data that snapshot the health of software systems. For example, an engineer for a service may use Datadog to answer questions like "When did latency start increasing?" and "What metrics outside of latency are also behaving abnormally?" to localize the root cause of the anomalous behavior. These time series question-answering (TSQA) tasks are essential for engineers, and present challenging and necessary tasks for SRE models and agents to perform.
Reid Hoffman Thinks Doctors Should Ask AI for a Second Opinion
The LinkedIn cofounder now has an AI drug discovery startup--and thinks not asking chatbots for medical advice is "bordering on committing malpractice." Following a three-decade career at the helm of some of Silicon Valley's most powerful companies--cofounding LinkedIn and sitting on the boards of PayPal and OpenAI-- Reid Hoffman recently turned his attention to health care. Hoffman's startup, Manas AI, is building an AI engine that aims to fast-track the traditionally slow process of drug discovery for various cancers. Inspired by a dinner with renowned cancer physician Siddhartha Mukherjee, the company's cofounder and CEO, its mission statement is to "shift drug discovery from a decade-long process to one that takes a few years." But Hoffman's enthusiasm for generative AI, in particular, stretches far beyond novel drug targets and small molecules.
Amazon Has New Frontier AI Models--and a Way for Customers to Build Their Own
Nova Forge lets Amazon's customers train frontier models for different tasks--a potential breakthrough in making AI actually useful for businesses. Amazon has announced a new family of frontier artificial intelligence models--and a new way for customers to build frontier models of their own. The ecommerce giant announced the second generation of its Nova AI models at re:Invent, a company conference held in Las Vegas. The models are nowhere near as popular as those offered by rivals like OpenAI and Google, but Amazon's plan to make them highly customizable could see them gain traction with its cloud users. Amazon detailed two improved large language models, Nova Lite and Nova Pro; a new realtime voice model called Nova Sonic; and a more experimental model called Nova Omni that performs a simulated kind of reasoning using images, audio, and video as well as text.
A Rosetta Stone for AI Benchmarks
Ho, Anson, Denain, Jean-Stanislas, Atanasov, David, Albanie, Samuel, Shah, Rohin
Most AI benchmarks saturate within years or even months after they are introduced, making it hard to study long-run trends in AI capabilities. To address this challenge, we build a statistical framework that stitches benchmarks together, putting model capabilities and benchmark difficulties on a single numerical scale. This acts as a "Rosetta Stone", allowing us to compare models across a wide range of abilities and time, even if they are not evaluated on the same benchmarks. Moreover, this works without assuming how capabilities evolve across time or with training compute. We demonstrate three applications of this framework. First, we use it to measure the speed of AI progress over time, and to forecast future AI capabilities. Second, we estimate the rate of improvements in algorithmic efficiency, finding estimates that are higher, but broadly consistent with prior work. Finally, we find that our approach can be used to detect rapid accelerations in AI progress.
AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models
Jackson, Declan, Keating, William, Cameron, George, Hill-Smith, Micah
We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.
GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation
Chatzikyriakidis, Stergios, Papadakis, Dimitris, Papaioannou, Sevasti-Ioanna, Psaltaki, Erofili
We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).
Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models
Balasubramanian, Sriram, Basu, Samyadeep, Goswami, Koustava, Rossi, Ryan, Manjunatha, Varun, Santhosh, Roshan, Zhang, Ruiyi, Feizi, Soheil, Lipka, Nedim
Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.
Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning
Yaron, Nissan, Bystritsky, Dan, Yaron, Ben-Etzion
We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25\%$). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1--Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI
QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture
Prakash, Shvetank, Cheng, Andrew, Tschand, Arya, Mazumder, Mark, Gohil, Varun, Ma, Jeffrey, Yik, Jason, Wan, Zishen, Quaye, Jessica, Alvanaki, Elisavet Lydia, Kumar, Avinash, Mazumdar, Chandrashis, Khare, Tuhin, Ingare, Alexander, Uchendu, Ikechukwu, Ghosal, Radhika, Tyagi, Abhishek, Wang, Chenyu, Garavagno, Andrea Mattia, Gu, Sarah, Guo, Alice, Hur, Grace, Carloni, Luca, Krishna, Tushar, Nayak, Ankita, Yazdanbakhsh, Amir, Reddi, Vijay Janapa
The field of computer architecture, which bridges high-level software abstractions and low-level hardware implementations, remains absent from current large language model (LLM) evaluations. To this end, we present QuArch (pronounced 'quark'), the first benchmark designed to facilitate the development and evaluation of LLM knowledge and reasoning capabilities specifically in computer architecture. QuArch provides a comprehensive collection of 2,671 expert-validated question-answer (QA) pairs covering various aspects of computer architecture, including processor design, memory systems, and interconnection networks. Our evaluation reveals that while frontier models possess domain-specific knowledge, they struggle with skills that require higher-order thinking in computer architecture. Frontier model accuracies vary widely (from 34% to 72%) on these advanced questions, highlighting persistent gaps in architectural reasoning across analysis, design, and implementation QAs. By holistically assessing fundamental skills, QuArch provides a foundation for building and measuring LLM capabilities that can accelerate innovation in computing systems. With over 140 contributors from 40 institutions, this benchmark represents a community effort to set the standard for architectural reasoning in LLM evaluation.