Technology
1 Supplementary Material
To investigate this further, we first observe that Claude-3.7-Sonnet Figure 1 shows the average pass rate under budgets of 12,000, 10 14,000, 16,000, and 17,000 tokens. As the data demonstrate, enlarging the thinking budget yields no 11 appreciable improvement in performance. This finding underscores 14 the challenging nature of ENGDESIGN and suggests its value as a rigorous testbed for future efforts 15 to enhance LLMs' engineering design proficiency. Figure 1: Average pass rate (%) of Claude-3.7-Thinking
Composing Global Solutions to Reasoning Tasks via Algebraic Objects in Neural Nets
We prove rich algebraic structures of the solution space for 2-layer neural networks with quadratic activation and L2 loss, trained on reasoning tasks in Abelian group (e.g., modular addition). Such a rich structure enables analytical construction of global optimal solutions from partial solutions that only satisfy part of the loss, despite its high nonlinearity.
ChartSketcher Reasoning with Feedback and Reflection for Chart Understanding
Charts are high-density visualization carriers for complex data, serving as a crucial medium for information extraction and analysis. Automated chart understanding poses significant challenges to existing multimodal large language models (MLLMs) due to the need for precise and complex visual reasoning. Current step-by-step reasoning models primarily focus on text-based logical reasoning for chart understanding. However, they struggle to refine or correct their reasoning when errors stem from flawed visual understanding, as they lack the ability to leverage multimodal interaction for deeper comprehension. Inspired by human cognitive behavior, we propose ChartSketcher, a multimodal feedback-driven stepby-step reasoning method designed to address these limitations. ChartSketcher is a chart understanding model that employs Sketch-CoT, enabling MLLMs to annotate intermediate reasoning steps directly onto charts using a programmatic sketching library, iteratively feeding these visual annotations back into the reasoning process. This mechanism enables the model to visually ground its reasoning and refine its understanding over multiple steps. We employ a two-stage training strategy: a cold start phase to learn sketch-based reasoning patterns, followed by off-policy reinforcement learning to enhance reflection and generalization. Experiments demonstrate that ChartSketcher achieves promising performance on chart understanding benchmarks and general vision tasks, providing an interactive and interpretable approach to chart comprehension.
ChatGPT can be made to generate sexualised and violent images, researchers find
The latest public version of ChatGPT can be made to generate sexualised images or depict scenes of graphic violence with a simple prompt, researchers have told the BBC. British AI security startup Mindgard figured out how to make ChatGPT create graphic pictures by slightly altering a widely-shared instruction, or prompt, which was originally designed to produce humorous results. After being contacted by the BBC, ChatGPT's maker OpenAI said it had taken action to stop the chatbot responding with those types of images. After investigating this trend, we've introduced additional safeguards against this type of prompt, it said in a statement. It also said it has multiple layers of protection to prevent users making content which breaches its terms and conditions.
'We had to get out of the way': The backlash over delivery robots
'We had to get out of the way': The backlash over delivery robots The first time Chicago resident John Roberts saw a delivery robot trundling down the sidewalk on his street he was impressed. I actually thought they were kind of neat - it felt futuristic, he says. But his attitude started to change when, soon after, he was out for a walk with his family. As another robot approached, they found themselves having to dodge it. To us it felt a little off - the fact that we were on the one strip reserved for walking, and we were having to get out of the way, says Roberts.
MoCha: Towards Movie-Grade Talking Character Generation
Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text.
Simultaneous Statistical Inference for Off-Policy Evaluation in Reinforcement Learning
This work presents the first theoretically justified simultaneous inference framework for off-policy evaluation (OPE). In contrast to existing methods that focus on point estimates or pointwise confidence intervals (CIs), the new framework quantifies global uncertainty across an infinite or continuous initial state space, offering valid inference over the entire state space.
The Korean Telecom Giant at the Center of Anthropic's Mythos Controversy
Days before Anthropic took its most advanced AI models offline, the White House ordered the company to revoke SK Telecom's access to Claude Mythos over claims of alleged ties to China. The Trump administration's move to impose export controls on Anthropic's most powerful AI technology followed a spat over the company granting South Korean telecom giant SK Telecom access to its Claude Mythos model, according to people familiar with the matter. US officials were concerned about what they alleged were SK Telecom's ties to China, those people said. Those concerns appear to have compounded when Amazon later flagged vulnerabilities it identified in Fable 5 to the White House. Fable 5 is a highly safeguarded version of Mythos that Anthropic released to the public on June 9.
Chain of Execution Supervision Promotes General Reasoning in Large Language Models
Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms--such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal. To address this, we introduce TraceMind, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate Tracepile using three training setups--continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, Tracepile boosts LLaMA3-8B by 9.2\% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and Zebra Logic under two-stage finetuning.
Multi-Kernel Correlation-Attention Vision Transformer for Enhanced Contextual Understanding and Multi-Scale Integration
Significant progress has been achieved using Vision Transformers (ViTs) in computer vision. However, challenges persist in modeling multi-scale spatial relationships, hindering effective integration of fine-grained local details and longrange global dependencies. To address this limitation, a Multi-Kernel CorrelationAttention Vision Transformer (MK-CAViT) grounded in the Hirschfeld-GebeleinRényi (HGR) theory was proposed, introducing three key innovations. A parallel multi-kernel architecture was utilized to extract multi-scale features through small, medium, and large kernels, overcoming the single-scale constraints of conventional ViTs. The cross-scale interactions were enhanced through the Fast-HGR attention mechanism, which models nonlinear dependencies and applies adaptive scaling to weigh connections and refine contextual reasoning. Additionally, a stable multi-scale fusion strategy was adopted, integrating dynamic normalization and staged learning to mitigate gradient variance, progressively fusing local and global contexts, and improving training stability.