AITopics | grok 4

Collaborating Authors

grok 4

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Grok tells researchers pretending to be delusional 'drive an iron nail through the mirror while reciting Psalm 91 backwards'

The GuardianApr-24-2026, 02:35:43 GMT

Researchers found X's AI assistant Grok 4 .1 was'the model most willing to operationalise a delusion, providing detailed real-world guidance'. Researchers found X's AI assistant Grok 4 .1 was'the model most willing to operationalise a delusion, providing detailed real-world guidance'. Grok tells researchers pretending to be delusional'drive an iron nail through the mirror while reciting Psalm 91 backwards' Elon Musk's AI chatbot'extremely validating' of delusional inputs and often went further, 'elaborating new material', study finds Elon Musk's AI chatbot Grok 4.1 told researchers pretending to be delusional that there was indeed a doppelganger in their mirror and they should drive an iron nail through the glass while reciting Psalm 91 backwards. Researchers at the City University of New York (Cuny) and King's College London have published a paper on how various chatbots protect - or fail to safeguard - users' mental health. Experts are increasingly warning that psychosis or mania can be fuelled by AI chatbots.

artificial intelligence, machine learning, natural language, (13 more...)

The Guardian

Country:

North America > United States > New York (0.25)
Oceania > Australia (0.07)
Europe > Ukraine (0.06)

Industry:

Leisure & Entertainment > Sports (0.72)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.71)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.76)

Add feedback

Reasoning Models Ace the CFA Exams

Patel, Jaisal, Chen, Yunzhe, He, Kaiwen, Wang, Keyi, Li, David, Xiao, Kairong, Liu, Xiao-Yang

arXiv.org Artificial IntelligenceDec-10-2025

Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.

exam, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2512.0827

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Hadeliya, Tsimur, Jauhar, Mohammad Ali, Sakpal, Nidhi, Cruz, Diogo

arXiv.org Artificial IntelligenceDec-3-2025

Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2512.02445

Genre: Research Report > New Finding (0.88)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)

Add feedback

Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy

Jackson, Daniel I, Jensen, Emma L, Hussain, Syed-Amad, Sezgin, Emre

arXiv.org Artificial IntelligenceNov-27-2025

Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.

large language model, machine learning, qwen3, (21 more...)

arXiv.org Artificial Intelligence

2511.19872

Country: North America > United States > Ohio (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models

Herambourg, Claudia, Siuda, Dawid, Kopczyńska, Julia, Santos, Joao R. L., Sas, Wojciech, Śmietańska-Nowak, Joanna

arXiv.org Artificial IntelligenceNov-6-2025

We present ORCA (Omni Research on Calculation in AI) Benchmark - a novel benchmark that evaluates large language models (LLMs) on multi-domain, real-life quantitative reasoning using verified outputs from Omni's calculator engine. In 500 natural-language tasks across domains such as finance, physics, health, and statistics, the five state-of-the-art systems (ChatGPT-5, Gemini~2.5~Flash, Claude~Sonnet~4.5, Grok~4, and DeepSeek~V3.2) achieved only $45\text{--}63\,\%$ accuracy, with errors mainly related to rounding ($35\,\%$) and calculation mistakes ($33\,\%$). Results in specific domains indicate strengths in mathematics and engineering, but weaknesses in physics and natural sciences. Correlation analysis ($r \approx 0.40\text{--}0.65$) shows that the models often fail together but differ in the types of errors they make, highlighting their partial complementarity rather than redundancy. Unlike standard math datasets, ORCA evaluates step-by-step reasoning, numerical precision, and domain generalization across real problems from finance, physics, health, and statistics.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.02589

Country: Europe (1.00)

Genre: Research Report > New Finding (0.46)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Don't Change My View: Ideological Bias Auditing in Large Language Models

Kröger, Paul, Barkett, Emilio

arXiv.org Artificial IntelligenceSep-17-2025

As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.

large language model, machine learning, system prompt, (20 more...)

arXiv.org Artificial Intelligence

2509.12652

Country:

North America > United States (1.00)
Asia (0.93)

Genre:

Research Report (1.00)
Personal > Interview (0.46)

Industry:

Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Law (0.93)
Transportation (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

The Mathematician's Assistant: Integrating AI into Research Practice

Henkel, Jonas

arXiv.org Artificial IntelligenceAug-29-2025

The rapid development of artificial intelligence (AI), marked by breakthroughs like 'AlphaEvolve' and 'Gemini Deep Think', is beginning to offer powerful new tools that have the potential to significantly alter the research practice in many areas of mathematics. This paper explores the current landscape of publicly accessible large language models (LLMs) in a mathematical research context, based on developments up to August 2, 2025. Our analysis of recent benchmarks, such as MathArena and the Open Proof Corpus (Balunović et al., 2025; Dekoninck et al., 2025), reveals a complex duality: while state-of-the-art models demonstrate strong abilities in solving problems and evaluating proofs, they also exhibit systematic flaws, including a lack of self-critique and a model depending discrepancy between final-answer accuracy and full-proof validity. Based on these findings, we propose a durable framework for integrating AI into the research workflow, centered on the principle of the augmented mathematician. In this model, the AI functions as a copilot under the critical guidance of the human researcher, an approach distilled into five guiding principles for effective and responsible use. We then systematically explore seven fundamental ways AI can be applied across the research lifecycle, from creativity and ideation to the final writing process, demonstrating how these principles translate into concrete practice. We conclude that the primary role of AI is currently augmentation rather than automation. This requires a new skill set focused on strategic prompting, critical verification, and methodological rigor in order to effectively use these powerful tools.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.20236

Country: North America > United States (0.46)

Genre:

Research Report > New Finding (0.67)
Personal > Honors (0.67)
Research Report > Promising Solution (0.66)

Industry:

Information Technology > Security & Privacy (1.00)
Education > Educational Setting > Higher Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Add feedback

R-ConstraintBench: Evaluating LLMs on NP-Complete Scheduling

Jain, Raj, Wetter, Marc

arXiv.org Artificial IntelligenceAug-22-2025

However, the reliability of large language models (LLMs) when reasoning under high-constraint regimes is insufficiently characterized. To address this gap, we present R-ConstraintBench, a scalable framework that evaluates models on Resource-Constrained Project Scheduling Problems (RCPSP), an NP-Complete feasibility class, while difficulty increases via linear growth in constraints. R-ConstraintBench incrementally increases non-redundant precedence constraints in Directed Acyclic Graphs (DAGs) and then introduces downtime, temporal windows, and disjunctive constraints. As an illustrative example, we instantiate the benchmark in a data center migration setting and evaluate multiple LLMs using feasibility and error analysis, identifying degradation thresholds and constraint types most associated with failure. Empirically, strong models are near-ceiling on precedence-only DAGs, but feasibility performance collapses when downtime, temporal windows, and disjunctive constraints interact--implicating constraint interaction, not graph depth, as the principal bottleneck. Performance on clean synthetic ramps also does not guarantee transfer to domain-grounded scenarios, underscoring limited generalization.

constraint, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2508.15204

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Grok AI is now part of new Tesla vehicles

FOX NewsJul-26-2025, 10:00:01 GMT

The CyberGuy, Kurt Knutsson, gives his take on Elon Musk's claims that Grok 3 outperforms every AI rival on'Fox & Friends.' Chatting with Grok while cruising in your Tesla is now a reality. The conversational artificial intelligence is being included in newer models, according to Elon Musk. Having Grok around will hopefully make your drive more engaging. It will be like having a buddy with you along for the ride. Sign up for my FREE CyberGuy Report Get my best tech tips, urgent security alerts and exclusive deals delivered straight to your inbox.

artificial intelligence, grok, tesla, (13 more...)

FOX News

Country: North America > United States (0.07)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks > Manufacturer (1.00)
Transportation > Electric Vehicle (0.89)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Elon Musk unveils bizarre new kids project after humiliating anti-Semitism disaster

Daily Mail - Science & techJul-21-2025, 22:18:51 GMT

Just a few weeks after Elon Musk's chatbot praised Hitler and denied the Holocaust, he's now looking to turn it into a playmate for kids. Musk has called this version is calling the version Baby Grok, and added it would offer'kid-friendly content' through a new app developed by his company xAI. He made the announcement Saturday night on X, where the post quickly drew over 28 million views within 24 hours. The move left many stunned, coming just two weeks after Grok 4, the latest version of Elon Musk's AI chatbot, sparked backlash for repeating far-right hate speech and white nationalist talking points when about politics, race, and recent news events. Multiple users reported on July 8 and July 9 that Grok echoed anti-Semitic conspiracy theories, including claims that Jewish people control Hollywood, promote hatred toward white people, and should be imprisoned in camps, though it is still unclear how many of these posts were confirmed before xAI took them down.

grok, machine learning, natural language, (14 more...)

Daily Mail - Science & tech

Country: North America > United States (0.16)

Industry: Law > Civil Rights & Constitutional Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback