Large Language Model
Cost-aware LLM-based Online Dataset Annotation
Recent advances in large language models (LLMs) have enabled automated dataset labeling with minimal human supervision. While majority voting across multiple LLMs can improve label reliability by mitigating individual model biases, it incurs high computational costs due to repeated querying. In this work, we propose a novel online framework, Cost-aware Majority Voting (CaMVo), for efficient and accurate LLM-based dataset annotation. CaMVo adaptively selects a subset of LLMs for each data instance based on contextual embeddings, balancing confidence and cost without requiring pre-training or ground-truth labels. Leveraging a LinUCB-based selection mechanism and a Bayesian estimator over confidence scores, CaMVo estimates a lower bound on labeling accuracy for each LLM and aggregates responses through weighted majority voting. Our empirical evaluation on the MMLU and IMDB Movie Review datasets demonstrates that CaMVo achieves comparable or superior accuracy to full majority voting while significantly reducing labeling costs. This establishes CaMVo as a practical and robust solution for cost-efficient annotation in dynamic labeling environments.
Evaluating Program Semantics Reasoning with Type Inference in System F
Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem.Their test-time compute reasoning capabilities promise significant potential in understanding program logic and semantics beyond mere token recognition. However, current benchmarks evaluating reasoning LLMs for code lack a formal, program-centric deductive framework for the soundness of evaluation, incompetent in assessing of whether models genuinely reason about program semantics or merely associate superficial connections between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as *program semantics reasoning*. By employing verified transformations to remove semantically irrelevant natural language,we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet)
Claude's 'too dangerous' AI model is finally public. But there's a catch
Anthropic's Claude Fable 5 AI model is now publicly available through the Claude desktop app, showing major improvements in coding, spatial reasoning, and cybersecurity capabilities. PCWorld reports that paid subscribers can access this powerful "Mythos-class" model until June 23rd, after which it requires separate usage credits due to capacity constraints. The release includes conservative safeguards due to extreme cybersecurity capabilities that could potentially be misused, with a less-restricted version available only to select cyberdefenders.
Version of AI tool 'too powerful for public' released to public
Version of AI tool'too powerful for public' released to public A version of an artificial intelligence (AI) tool which the company said was too powerful to be released to the public has just been released to the public. Claude Fable 5 is a version of Anthropic's Claude Mythos, an AI program which caused serious concerns among technology, finance, and government leaders when it was released privately in April for previewing and testing. Some worry the tool is so powerful it could pose financial security risks, though others have questioned how much of the hype is marketing spin. Anthropic said on Tuesday Fable will be released with safeguards and user limitations in place, though it said releasing a model this capable comes with risks. Fable's capabilities exceed those of any model we've ever made generally available, it added.
Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs
Large language models (LLMs) have achieved unprecedented performance by leveraging vast pretraining corpora, yet their performance remains suboptimal in knowledge-intensive domains such as medicine and scientific research, where high factual precision is required. While synthetic data provides a promising avenue for augmenting domain knowledge, existing methods frequently generate redundant samples that do not align with the model's true knowledge gaps. To overcome this limitation, we propose a novel Structural Entropy-guided Knowledge Navigator (SENATOR) framework that addresses the intrinsic knowledge deficiencies of LLMs. Our approach employs the Structure Entropy (SE) metric to quantify uncertainty along knowledge graph paths and leverages Monte Carlo Tree Search (MCTS) to selectively explore regions where the model lacks domain-specific knowledge. Guided by these insights, the framework generates targeted synthetic data for supervised fine-tuning, enabling continuous self-improvement. Experimental results on LLaMA-3 and Qwen2 across multiple domain-specific benchmarks show that SENATOR effectively detects and repairs knowledge deficiencies, achieving notable performance improvements.
Transcending Cost-Quality Tradeoff in Agent Serving via Session-Awareness
Large Language Model (LLM) agents are capable of task execution across various domains by autonomously interacting with environments and refining LLM responses based on feedback. However, existing model serving systems are not optimized for the unique demands of serving agents. Compared to classic model serving, agent serving has different characteristics: predictable request pattern, increasing quality requirement, and unique prompt formatting. We identify a key problem for agent serving: LLM serving systems lack session-awareness. They neither perform effective KV cache management nor precisely select the cheapest yet competent model in each round. This leads to a cost-quality tradeoff, and we identify an opportunity to surpass it in an agent serving system. To this end, we introduce AgServe for AGile AGent SERVing.
Anthropic Offers Mythos Upgrade for Cyber Partners and a 'Safe' Version for the Rest of You
Anthropic Offers Mythos Upgrade for Cyber Partners and a'Safe' Version for the Rest of You Anthropic is releasing Claude Mythos 5 to trusted organizations and Claude Fable 5 to the public, a version it says can't be used for cyberattacks. Anthropic released two new AI models called Claude Fable 5 and Claude Mythos 5 on Tuesday, which the company says have greater capabilities than the Mythos Preview model it released in April to a limited set of tech industry partners. Anthropic has said the initial, limited release stemmed from concerns that the model's capabilities could be exploited by bad actors to develop hacking tools that could catch defenders off guard. Anthropic is currently only releasing Claude Mythos 5 to a limited set of industry partners, many of which received access to Mythos Preview, and the company says it is collaborating with the US government on the rollout. Claude Fable 5, which is being publicly released, uses the same underlying model as Mythos 5, but will have "guardrails" in place at launch, the company said Tuesday, that will block the model from answering many user questions related to cybersecurity, biology, and chemistry.
Variational Uncertainty Decomposition for In-Context Learning
As large language models (LLMs) gain popularity in conducting prediction tasks in-context, understanding the sources of uncertainty in in-context learning becomes essential to ensuring reliability. The recent hypothesis of in-context learning performing predictive Bayesian inference opens the avenue for Bayesian uncertainty estimation, particularly for decomposing uncertainty into epistemic uncertainty due to lack of in-context data and aleatoric uncertainty inherent in the in-context prediction task. However, the decomposition idea remains under-explored due to the intractability of the latent parameter posterior from the underlying Bayesian model. In this work, we introduce a variational uncertainty decomposition framework for in-context learning without explicitly sampling from the latent parameter posterior, by optimising auxiliary inputs as probes to obtain an upper bound to the aleatoric uncertainty of an LLM's in-context learning procedure. Through experiments on synthetic and real-world tasks, we show quantitatively and qualitatively that the decomposed uncertainties obtained from our method exhibit desirable properties of epistemic and aleatoric uncertainty.
ChatGPT can be hijacked without you knowing. Lockdown Mode is the fix
PCWorld reports that OpenAI launched Lockdown Mode for ChatGPT to combat prompt injection attacks that can hijack AI systems and steal personal information. These attacks have previously compromised AI browsers like Perplexity and controlled smart home devices through Google Gemini by tricking systems with malicious instructions. Lockdown Mode restricts features like live web browsing and Deep Research across all ChatGPT plans, though OpenAI acknowledges risks from uploaded files remain. OpenAI has launched a new security feature in ChatGPT called Lockdown Mode, designed to provide additional protection against so-called "prompt injection attacks." A prompt injection attack is when someone crafts a deceptive prompt in an attempt to trick the LLM into following malicious instructions and/or revealing sensitive information.
Quadratic Coreset Selection: Certifying and Reconciling Sequence and Token Mining for Efficient Instruction Tuning
Instruction-Tuning (IT) was recently found the impressive data efficiency in post-training large language models (LLMs). While the pursuit of efficiency predominantly focuses on sequence-level curation, often overlooking the nuanced impact of critical tokens and the inherent risks of token noise and biases. Drawing inspiration from bi-level coreset selection, our work provides the principled view of the motivation behind selecting instructions' responses. It leads to our approach Quadratic Coreset Selection (QCS) that reconciles sequence-level and token-level influence contributions, deriving more expressive LLMs with established theoretical result. Despite the original QCS framework challenged by prohibitive computation from inverted LLM-scale Hessian matrices, we overcome this barrier by proposing a novel QCS probabilistic variant, which relaxes the original formulation through re-parameterized densities. This innovative solver is efficiently learned using hierarchical policy gradients without requiring back-propagation, achieving provable convergence and certified asymptotic equivalence to the original objective. Our experiments demonstrate QCS's superior sequence-level data efficiency and reveal how strategically leveraging token-level influence elevates the performance ceiling of data-efficient IT. Furthermore, QCS's adaptability is showcased through its successes in regular IT and challenging targeted IT scenarios, particularly in the cases of free-form complex instruction-following and CoT reasoning. They underscore QCS's potential for a wide array of versatile post-training applications.