grok 4
Reasoning Models Ace the CFA Exams
Patel, Jaisal, Chen, Yunzhe, He, Kaiwen, Wang, Keyi, Li, David, Xiao, Kairong, Liu, Xiao-Yang
Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.
- North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
- North America > United States > New York > Rensselaer County > Troy (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > South Korea (0.04)
When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents
Hadeliya, Tsimur, Jauhar, Mohammad Ali, Sakpal, Nidhi, Cruz, Diogo
Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.
- North America > United States > New Jersey (0.04)
- Europe > United Kingdom (0.04)
Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy
Jackson, Daniel I, Jensen, Emma L, Hussain, Syed-Amad, Sezgin, Emre
Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- North America > Costa Rica (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Towards Robust Mathematical Reasoning
Luong, Thang, Hwang, Dawsen, Nguyen, Hoang H., Ghiasi, Golnaz, Chervonyi, Yuri, Seo, Insuk, Kim, Junsu, Bingham, Garrett, Lee, Jonathan, Mishra, Swaroop, Zhai, Alex, Hu, Clara Huiyi, Michalewski, Henryk, Kim, Jimin, Ahn, Jeonghyun, Bae, Junhwi, Song, Xingyou, Trinh, Trieu H., Le, Quoc V., Jung, Junehyuk
Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.
- North America > United States (0.04)
- Europe > Austria (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- (2 more...)
Don't Change My View: Ideological Bias Auditing in Large Language Models
As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.
- Africa > South Africa (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > New Jersey (0.04)
- (6 more...)
- Research Report (1.00)
- Personal > Interview (0.46)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Law (0.93)
- Transportation (0.66)
The Mathematician's Assistant: Integrating AI into Research Practice
The rapid development of artificial intelligence (AI), marked by breakthroughs like 'AlphaEvolve' and 'Gemini Deep Think', is beginning to offer powerful new tools that have the potential to significantly alter the research practice in many areas of mathematics. This paper explores the current landscape of publicly accessible large language models (LLMs) in a mathematical research context, based on developments up to August 2, 2025. Our analysis of recent benchmarks, such as MathArena and the Open Proof Corpus (Balunović et al., 2025; Dekoninck et al., 2025), reveals a complex duality: while state-of-the-art models demonstrate strong abilities in solving problems and evaluating proofs, they also exhibit systematic flaws, including a lack of self-critique and a model depending discrepancy between final-answer accuracy and full-proof validity. Based on these findings, we propose a durable framework for integrating AI into the research workflow, centered on the principle of the augmented mathematician. In this model, the AI functions as a copilot under the critical guidance of the human researcher, an approach distilled into five guiding principles for effective and responsible use. We then systematically explore seven fundamental ways AI can be applied across the research lifecycle, from creativity and ideation to the final writing process, demonstrating how these principles translate into concrete practice. We conclude that the primary role of AI is currently augmentation rather than automation. This requires a new skill set focused on strategic prompting, critical verification, and methodological rigor in order to effectively use these powerful tools.
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Europe > Germany (0.04)
- Research Report > New Finding (0.67)
- Personal > Honors (0.67)
- Research Report > Promising Solution (0.66)
- Information Technology > Security & Privacy (1.00)
- Education > Educational Setting > Higher Education (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)
R-ConstraintBench: Evaluating LLMs on NP-Complete Scheduling
However, the reliability of large language models (LLMs) when reasoning under high-constraint regimes is insufficiently characterized. To address this gap, we present R-ConstraintBench, a scalable framework that evaluates models on Resource-Constrained Project Scheduling Problems (RCPSP), an NP-Complete feasibility class, while difficulty increases via linear growth in constraints. R-ConstraintBench incrementally increases non-redundant precedence constraints in Directed Acyclic Graphs (DAGs) and then introduces downtime, temporal windows, and disjunctive constraints. As an illustrative example, we instantiate the benchmark in a data center migration setting and evaluate multiple LLMs using feasibility and error analysis, identifying degradation thresholds and constraint types most associated with failure. Empirically, strong models are near-ceiling on precedence-only DAGs, but feasibility performance collapses when downtime, temporal windows, and disjunctive constraints interact--implicating constraint interaction, not graph depth, as the principal bottleneck. Performance on clean synthetic ramps also does not guarantee transfer to domain-grounded scenarios, underscoring limited generalization.
Grok AI is now part of new Tesla vehicles
The CyberGuy, Kurt Knutsson, gives his take on Elon Musk's claims that Grok 3 outperforms every AI rival on'Fox & Friends.' Chatting with Grok while cruising in your Tesla is now a reality. The conversational artificial intelligence is being included in newer models, according to Elon Musk. Having Grok around will hopefully make your drive more engaging. It will be like having a buddy with you along for the ride. Sign up for my FREE CyberGuy Report Get my best tech tips, urgent security alerts and exclusive deals delivered straight to your inbox.
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks > Manufacturer (1.00)
- Transportation > Electric Vehicle (0.89)
Elon Musk unveils bizarre new kids project after humiliating anti-Semitism disaster
Just a few weeks after Elon Musk's chatbot praised Hitler and denied the Holocaust, he's now looking to turn it into a playmate for kids. Musk has called this version is calling the version Baby Grok, and added it would offer'kid-friendly content' through a new app developed by his company xAI. He made the announcement Saturday night on X, where the post quickly drew over 28 million views within 24 hours. The move left many stunned, coming just two weeks after Grok 4, the latest version of Elon Musk's AI chatbot, sparked backlash for repeating far-right hate speech and white nationalist talking points when about politics, race, and recent news events. Multiple users reported on July 8 and July 9 that Grok echoed anti-Semitic conspiracy theories, including claims that Jewish people control Hollywood, promote hatred toward white people, and should be imprisoned in camps, though it is still unclear how many of these posts were confirmed before xAI took them down.
I Tried Grok's Built-In Anime Companion and It Called Me a Twat
Its name is Ani, and it cost me 300. Elon Musk's xAI dropped the new visual chatbot feature on Monday in the Grok iOS app. The top-tier subscription unlocks access to xAI's best-performing model, Grok 4 Heavy, and special settings for interacting with two custom characters designed for flirting or chatting. A third character, which looks a bit like a sexy boyfriend, is listed as "coming soon." It's not xAI's first dip into adult content, either: Back in February 2024, the company rolled out a chatbot mode for "sexy" conversations.