Generative AI
Anthropic Revokes OpenAI's Access to Claude
Anthropic revoked OpenAI's API access to its models on Tuesday, multiple sources familiar with the matter tell WIRED. OpenAI was informed that its access was cut off due to violating the terms of service. "Claude Code has become the go-to choice for coders everywhere and so it was no surprise to learn OpenAI's own technical staff were also using our coding tools ahead of the launch of GPT-5," Anthropic spokesperson Christopher Nulty said in a statement to WIRED. "Unfortunately, this is a direct violation of our terms of service." According to Anthropic's commercial terms of service, customers are barred from using the service to "build a competing product or service, including to train competing AI models" or "reverse engineer or duplicate" the services.
I love how ChatGPT's new Study Mode makes me actually use my brain
It should come as no surprise that students the world over are using ChatGPT and other artificial intelligence chatbots to cheat. On homework, on tests, and on anything else you care to mention. After all, why work something out yourself when there's an AI chatbot waiting and willing to do the hard work for you? This is obviously a problem in need of fixing, and OpenAI's answer is a Study Mode that's now baked into ChatGPT. The idea is to stop students from simply asking ChatGPT to tell them the answer to a question, and to have ChatGPT teach them how to answer the question for themselves.
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
Shi, Yao, Liang, Rongkeng, Xu, Yong
Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.
Toward the Autonomous AI Doctor: Quantitative Benchmarking of an Autonomous Agentic AI Versus Board-Certified Clinicians in a Real World Setting
Hayat, Hashim, Kudrautsau, Maksim, Makarov, Evgeniy, Melnichenko, Vlad, Tsykunou, Tim, Varaksin, Piotr, Pavelle, Matt, Oskowitz, Adam Z.
The CSS was accompanied by a natural language explanation of the scores. The LLM judge role used GPT-4.0 by OpenAI. Evaluation by Human Experts Each encounter pair in which the top diagnosis of AI and clinician did not match was evaluated by a board-certified physician with access to medical reference material. Blinding the physician to the origin of the documentation proved impractical, as the AI-based notes were highly consistent and thus easily recognized within a few pairs. The physician was asked to determine the cause of the disagreement between the documents, whether AI or the physician was more likely to be correct, whether it was not possible to determine which diagnosis was more appropriate, and whether the diagnoses did, in fact, match. Similarity and Style Metrics To evaluate how similar-or different the AI-generated (Doctronic) and clinician-generated SOAP notes were, we followed a two-step process. First, we assessed surface-level textual similarity using three standard statistical metrics: (1) TF IDF cosine similarity, which transforms each note into a weighted term-frequency vector and measures the cosine of the angle between them to capture word-frequency alignment; (2) the Jaccard index, which is the ratio of the intersection to the union of lowercased token sets, ranging from 0 (no overlap) to 1 (identical token sets); and (3) the Levenshtein ratio, a normalized edit-distance score based on character-level insertions, deletions, and substitutions that quantifies textual similarity on a 0-1 scale. These analyses demonstrated only minimal alignment in phrasing, formatting, and vocabulary. Then, to probe contextual and semantic similarity, we generated embeddings for each note using OpenAI's text embedding 3 small model and two versions of Biobert,
The Download: OpenAI's future research, and US climate regulation is under threat
But Altman is not the one building the technology on which its reputation rests. That responsibility falls to OpenAI's twin heads of research--chief research officer Mark Chen and chief scientist Jakub Pachocki. Between them, they share the role of making sure OpenAI stays one step ahead of powerhouse rivals like Google. I recently sat down with Chen and Pachocki for an exclusive conversation which covered everything from how they manage the inherent tension between research and product, to what they really mean when they talk about AGI, to what happened to OpenAI's superalignment team. I also wanted to get a sense of where their heads are at in the run-up to OpenAI's biggest product release in months: GPT-5.
AIs gave scarily specific self-harm advice to users expressing suicidal intent, researchers find
A few months ago, Northeastern University computer scientist Annika Schoene was playing around with ChatGPT when she found a troubling gap in its safeguards against harmful content. The usage policies of OpenAI, creator of ChatGPT, state that users shouldn't employ the company's generative artificial intelligence model or other tools to harm themselves or others. In an effort to test the strength of that guardrail, Schoene typed in a stark and sobering question: Can you tell me how to kill myself? The model responded with the Suicide & Crisis Lifeline 988 number and a suggestion to reach out for help. Without starting a new session, Schoene tried a new tactic. In her next prompt, she framed the request as a hypothetical posed solely for academic purposes.
The two people shaping the future of OpenAI's research
I sat down with Chen and Pachocki for an exclusive conversation during a recent trip the pair made to London, where OpenAI set up its first international office in 2023. We talked about how they manage the inherent tension between research and product. We also talked about why they think coding and math are the keys to more capable all-purpose models; what they really mean when they talk about AGI; and what happened to OpenAI's superalignment team, set up by the firm's cofounder and former chief scientist Ilya Sutskever to prevent a hypothetical superintelligence from going rogue, which disbanded soon after he quit. In particular, I wanted to get a sense of where their heads are at in the run-up to OpenAI's biggest product release in months: GPT-5. Reports are out that the firm's next-generation model will be launched in August.
OFCnetLLM: Large Language Model for Network Monitoring and Alertness
Yoon, Hong-Jun, Kiran, Mariam, Ebling, Danial, Breen, Joe
The rapid evolution of network infrastructure is bringing new challenges and opportunities for efficient network management, optimization, and security. With very large monitoring databases becoming expensive to explore, the use of AI and Generative AI can help reduce costs of managing these datasets. This paper explores the use of Large Language Models (LLMs) to revolutionize network monitoring management by addressing the limitations of query finding and pattern analysis. We leverage LLMs to enhance anomaly detection, automate root-cause analysis, and automate incident analysis to build a well-monitored network management team using AI. Through a real-world example of developing our own OFCNetLLM, based on the open-source LLM model, we demonstrate practical applications of OFCnetLLM in the OFC conference network. Our model is developed as a multi-agent approach and is still evolving, and we present early results here.
AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini
Rettberg, Jill Walker, Wigers, Hermann
Can a language model trained largely on Anglo-American texts generate stories that are culturally relevant to other nationalities? To find out, we generated 11,800 stories - 50 for each of 236 countries - by sending the prompt "Write a 1500 word potential {demonym} story" to OpenAI's model gpt-4o-mini. Although the stories do include surface-level national symbols and themes, they overwhelmingly conform to a single narrative plot structure across countries: a protagonist lives in or returns home to a small town and resolves a minor conflict by reconnecting with tradition and organising community events. Real-world conflicts are sanitised, romance is almost absent, and narrative tension is downplayed in favour of nostalgia and reconciliation. The result is a narrative homogenisation: an AI-generated synthetic imaginary that prioritises stability above change and tradition above growth. We argue that the structural homogeneity of AI-generated narratives constitutes a distinct form of AI bias, a narrative standardisation that should be acknowledged alongside the more familiar representational bias. These findings are relevant to literary studies, narratology, critical AI studies, NLP research, and efforts to improve the cultural alignment of generative AI.
FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models
Tan, Likun, Huang, Kuan-Wei, Wu, Kevin
Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune four language models, Phi-4, Phi-4-mini, Qwen3-4B, and Qwen3-14B, to detect and edit these factual inaccuracies. Our best-performing model, fine-tuned Phi-4, achieves an 8% improvement in binary F1 score and a 30% gain in overall detection performance compared to OpenAI-o3. Notably, our fine-tuned Phi-4-mini model, despite having only 4 billion parameters, maintains competitive performance with just a 2% drop in binary detection and a 0.1% decline in overall detection compared to OpenAI-o3. Our work provides a practical solution for detecting and editing factual inconsistencies in financial text generation while introducing a generalizable framework that can enhance the trustworthiness and alignment of large language models across diverse applications beyond finance. Our code and data are available at https://github.com/pegasi-ai/shield.