Large Language Model
Ilya Sutskever Stands by His Role in Sam Altman's OpenAI Ouster: 'I Didn't Want It to Be Destroyed'
Ilya Sutskever Stands by His Role in Sam Altman's OpenAI Ouster: 'I Didn't Want It to Be Destroyed' The former OpenAI chief scientist may be estranged from the company, but he still came to its defense as he testified on Monday. Elon Musk's trial against OpenAI and Microsoft entered its final stretch on Monday, with testimony from Microsoft CEO Satya Nadella, former OpenAI chief scientist Ilya Sutskever, and current OpenAI chairman Bret Taylor. Sutskever drew the spotlight, revealing an ownership stake in OpenAI's $850-billion for-profit arm that is currently worth about $7 billion. That makes him one of the largest known individual shareholders of OpenAI. Earlier in the trial, OpenAI president Greg Brockman acknowledged for the first time that he has around $30 billion worth of OpenAI shares .
AI-powered hacking has exploded into industrial-scale threat, Google says
'There's a misconception that the AI vulnerability race is imminent. The reality is it's already begun,' said John Hultquist at Google's threat intelligence group. 'There's a misconception that the AI vulnerability race is imminent. The reality is it's already begun,' said John Hultquist at Google's threat intelligence group. In just three months, AI-powered hacking has gone from a nascent problem to an industrial-scale threat, according to a report from Google .
The Download: the hantavirus outbreak and Musk v. Altman week 2
Plus: Meta's embrace of AI is making employees miserable. Here's what you need to know about the cruise ship hantavirus outbreak Last week, eight passengers aboard a Dutch-flagged cruise ship contracted a type of hantavirus transmitted by rats. But health experts stress that this situation is nothing like the coronavirus outbreak in 2020. The Andes virus is known to spread between people, and there are no specific antiviral treatments or vaccines. Yet transmission appears to require a specific form of contact that the cruise ship fostered. Here's what you need to know about the outbreak--and why experts believe it can be contained .
I Work in Hollywood. Everyone Who Used to Make TV Is Now Secretly Training AI
For screenwriters like me--and job seekers all over--AI gig work is the new waiting tables. In eight months, I've done 20 of these soul-crushing contracts for five different platforms. My name on the platform is ri611. I work as an AI trainer. I assess whether a chatbot's tone is natural or flat, affected or annoying. I identify patterns in pictures of furniture; search the internet for group photos of strangers whom I'll eliminate from the portrait, one by one. I trawl through bizarre videos so I can annotate and time-stamp the barking of a dog, the moment a stranger walks past a window, the precise millisecond a balloon pops. I generate anime sex scenes and decapitate young women, coax LLMs into giving me recipes for bombs made of household items, and generate invites to a reprise of January 6 at the White House, all as part of a red team whose purpose is to test safety precautions and probe weaknesses. I work for companies with names like Mercor and Outlier and Task-ify and Turing and Handshake and Micro1. In my "other" career, I am a Hollywood writer and showrunner. I create prime-time TV, usually featuring a middle-class white lady having the worst day of her life, with some salt-of-the-earth police interference to raise the stakes. You can find my shows on Paramount and Hulu and the BBC.
CUDA Proves Nvidia Is a Software Company
There's a deep, forbidding moat that surrounds Nvidia--and it has nothing to do with hardware. Forgive me for starting with a cliché, a piece of finance jargon that has recently slipped into the tech lexicon, but I'm afraid I must talk about "moats." Popularized decades ago by Warren Buffett to refer to a company's competitive advantage, the word found its way into Silicon Valley pitch decks when a memo purportedly leaked from Google, titled "We Have No Moat, and Neither Does OpenAI," fretted that open-source AI would pillage Big Tech's castle. A few years on, the castle walls remain safe. Apart from a brief bout of panic when DeepSeek first appeared, open-source AI models have not vastly outperformed proprietary models.
Bias and Uncertainty in LLM-as-a-Judge Estimation
LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliability depends critically on judge quality and, for model comparisons, on calibration stability. Sharing calibration across compared models is practically attractive but can introduce severe bias, including cases where the comparison estimate points in the wrong direction with high apparent confidence. We study these failure modes through analytical results, simulations over judge quality ($J$) and cross-model calibration instability ($ΔJ$), and a real-data MMLU-Pro case study with sign reversal. We propose $J$ and $ΔJ$ as diagnostics for when corrected estimates, especially shared-calibration comparisons, are likely unreliable, and provide reporting guidance for LaaJ evaluation.
Response Time Enhances Alignment with Heterogeneous Preferences
Echenique, Federico, Fallah, Alireza, Huang, Baihe, Jordan, Michael I.
Aligning large language models (LLMs) to human preferences typically relies on aggregating pooled feedback into a single reward model. However, this standard approach assumes that all labelers share the same underlying preferences, ignoring the fact that real-world labelers are highly heterogeneous and usually anonymous. Consequently, relying solely on binary choice data fundamentally distorts the learned policy, making the true population-average preference unidentifiable. To overcome this critical limitation, we demonstrate that augmenting preference datasets with a simple, secondary signal -- the user's response time -- can restore the identifiability of the population's average preference. By modeling each decision as a Drift-Diffusion Model (DDM), we introduce a novel, consistent estimator of heterogeneous preferences that successfully corrects the distortions of standard choice-only labels. We prove that our estimator asymptotically converges to the true average preference even in extreme cases where each anonymous labeler contributes only a single choice. Empirically, across both synthetic and real-world datasets, our method consistently outperforms standard baselines that otherwise fail and plateau at a bias floor. Because response times are essentially free to record and require zero user tracking or identification, our results bring promises and open up new opportunities for future data-collection pipelines to improve the social benefit without requiring user-level identifiers or repeated elicitations.
Why Does Agentic Safety Fail to Generalize Across Tasks?
Slutzky, Yonatan, Alexander, Yotam, Slor, Tomer, Nagel, Yoav, Cohen, Nadav
AI agents are increasingly deployed in multi-task settings, where the task to perform is specified at test time, and the agent must generalize to unseen tasks. A major concern in such settings is safety: often, an agent must not only execute unseen tasks, but do so while avoiding risks and handling ones that materialize. Empirical evidence suggests that even when the ability to execute generalizes to unseen tasks, the ability to do so safely frequently does not. This paper provides theory and experiments indicating that failures of agentic safety to generalize across tasks are not merely due to limitations of training methods, but reflect an inherent property of safety itself: the relationship between a task and its safe execution is more complex than the relationship between a task and its execution alone. Theoretically, we analyze linear-quadratic control with $H_{\infty}$-robustness, and prove that the mapping from task specification to an optimal controller has higher Lipschitz constant with safety requirements than without, yielding a Lipschitz bound of independent interest. Empirically, we demonstrate our conclusions in simulated quadcopter navigation with a neural network agent and in CRM with an LLM agent. Our findings suggest that current efforts to enhance agentic safety may be insufficient, and point to a need for fundamentally different approaches.
An Interpretable and Scalable Framework for Evaluating Large Language Models
Qu, Xinhao, Heng, Qiang, Zeng, Hao, Liu, Xiaoqian
Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item Response Theory (IRT) offers a principled framework for modeling latent model abilities and item characteristics, but conventional methods are computationally expensive and numerically unstable, limiting large-scale implementations. To address these challenges, we propose an interpretable and scalable framework for LLM evaluation based on the majorization-minimization principle. Our approach reformulates the problem as a sequence of constrained matrix factorization subproblems, enabling stable and efficient parameter estimation with theoretical guarantees for identifiability and convergence. Experiments on synthetic and real-world datasets, including MATH-500 and six Open LLM Leaderboard benchmarks, demonstrate that our method achieves superior scalability and interpretability. It delivers orders-of-magnitude speedups over competing methods while maintaining comparable or even higher estimation accuracy. Our results align with established scaling laws and offer insights into item difficulty and discrimination, informing more principled benchmark design.
ProteinJEPA: Latent prediction complements protein language models
Ofer, Dan, Shahaf, Dafna, Linial, Michal
Protein language models are trained primarily with masked language modeling (MLM), which predicts amino-acid identities at masked positions. We ask whether latent-space prediction can complement these token-level objectives under matched wall-clock budget. Across pretrained and random-init protein sequence encoders at 35--150M parameters, we find that the best protein-JEPA design is not all-position latent prediction but a variant: predicting latent targets only at masked positions, and retaining the MLM cross-entropy. We call this recipe masked-position MLM+JEPA. On a 16-task downstream suite (15 frozen linear probes plus SCOPe-40 zero-shot fold retrieval), under matched wall-clock budgets, this recipe wins more tasks than it loses against MLM-only continuation: 10 wins / 3 losses / 3 ties (hereafter W/L/T) on pretrained ESM2-35M, 11/2/3 on ESM2-150M while results in pretraining from scratch are mixed (6/8/2). Gains are seen for multiple models on 11 of 16 tasks, including stability, \b{eta}β\b{eta}-lactamase fitness, variant effect, intrinsic disorder, remote homology, enzyme classification, and SCOPe-40 fold retrieval. Tasks with more losses than wins are Fluorescence (TAPE) and Peptide-HLA Binding. All-position MLM+JEPA matches MLM-only overall but does not reproduce the masked-position gains. JEPA-only (no MLM) collapses in nearly every experiment. We conclude that JEPA, when combined with MLM, is competitive and can outperform pure MLM in pretraining and continued training, even under matched wall-clock budgets.