Large Language Model
DeepSeek reportedly gets China's approval to buy NVIDIA's H200 AI chips
ByteDance, Alibaba and Tencent received permission, as well, according to Reuters. The Chinese government has given DeepSeek its approval to purchase NVIDIA's H200 AI chips, according to . ByteDance, Alibaba and Tencent have also reportedly received permission from Beijing to buy a total of 400,000 H200 GPUs. says Chinese authorities are still finalizing the conditions they're imposing on the companies to be able to proceed with their orders, so it may take a while before they're able to receive their shipments. In addition, NVIDIA CEO Jensen Huang told reporters that his company has yet to receive orders from the aforementioned firms and that he believed China is still finalizing their licenses. In December 2025, the US government allowed NVIDIA to sell its second-best H200 processors to vetted Chinese companies in addition to its H20 model in exchange for a 25 percent tariff on those sales.
Silicon Valley Tech Workers Are Campaigning to Get ICE Out of US Cities
Even as Big Tech CEOs curry favor with President Trump, Silicon Valley employees are calling on their bosses to use their influence to help stop his immigration policies. The first Trump administration, and the tech industry that stood up to it, are both looking quainter by the day. Here's one example: In 2017, when President Trump issued a series of executive orders instituting a travel ban on foreigners from certain countries (predominantly Muslim-majority ones), people from across the United States vigorously protested the policy. They included some of tech's most elite: Google cofounder Sergey Brin, who joined a demonstration at the San Francisco airport; Amazon founder Jeff Bezos, who wrote a company-wide email outlining "legal options" that Amazon was considering to fight the ban; and Facebook founder Mark Zuckerberg, who took to Instagram to describe his own family's immigrant roots. On Saturday, hours after federal agents shot and killed ICU nurse Alex Pretti in the streets of Minneapolis, several prominent tech executives attended a private White House screening of, a documentary being released by (of course) Amazon MGM Studios. The timing was not lost on the group of Silicon Valley workers who recently launched ICEout.tech The letter, posted following Renee Nicole Good's killing earlier this month, has now been signed by more than 1,000 tech employees. Those workers, who come from across the spectrum of Big Tech companies and startups, are asking that executives use their clout to demand Immigration and Customs Enforcement agents leave American cities, that they cancel company contracts with the agency, and that they speak publicly about ICE's violent and deadly tactics. Worker-led demands like those were commonplace during Trump 1.0, when tech employees at the world's biggest companies often spoke out--internally and externally--about the cruelty of the US administration and the industry's role in facilitating or tempering its most craven policies. Meanwhile, the executives leading those companies have been busy kissing the ring-- over dinner at the White House or with outlandishly expensive documentaries nobody's watching--at every opportunity. Is the dam finally breaking? This week, Silicon Valley leaders including Anthropic heads Dario and Daniela Amodei, OpenAI CEO Sam Altman, and Apple CEO Tim Cook finally spoke out about ICE's outrageous overreach.
Microdosing for Depression Appears to Work About as Well as Drinking Coffee
For years, people from CEOs to novelists have taken tiny amounts of psychedelics to support well-being. New research shows that benefits for depression may be attributable to a placebo effect. Typically using psilocybin mushrooms or LSD, the archetypal microdoser sought less melting walls and open-eye kaleidoscopic visuals than boosts in mood and energy, like a gentle spring breeze blowing through the mind. Anecdotal reports pitched microdosing as a kind of psychedelic Swiss Army knife, providing everything from increased focus to a spiked libido and (perhaps most promisingly) lowered reported levels of depression. It was a miracle for many.
AI-generated news should carry 'nutrition' labels, thinktank says
The IPPR recommended standardised labels for AI-generated news, showing what information had been used to create those answers. The IPPR recommended standardised labels for AI-generated news, showing what information had been used to create those answers. AI-generated news should carry'nutrition' labels, thinktank says AI-generated news should carry "nutrition" labels and tech companies must pay publishers for the content they use, according to a left-of-centre thinktank, amid rising use of the technology as a source for current affairs . The Institute for Public Policy Research (IPPR) said AI firms were rapidly emerging as the new "gatekeepers" of the internet and intervention was needed to create a healthy AI news environment. It recommended standardised labels for AI-generated news, showing what information had been used to create those answers, including peer-reviewed studies and articles from professional news organisations.
Efficient Evaluation of LLM Performance with Statistical Guarantees
Wu, Skyler, Nair, Yash, Candรจs, Emmanuel J.
Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized Active Querying (FAQ), which (a) leverages historical information through a Bayesian factor model; (b) adaptively selects questions using a hybrid variance-reduction/active-learning sampling policy; and (c) maintains validity through Proactive Active Inference -- a finite-population extension of active inference (Zrnic & Candรจs, 2024) that enables direct question selection while preserving coverage. With negligible overhead cost, FAQ delivers up to $5\times$ effective sample size gains over strong baselines on two benchmark suites, across varying historical-data missingness levels: this means that it matches the CI width of uniform sampling while using up to $5\times$ fewer queries. We release our source code and our curated datasets to support reproducible evaluation and future research.
Best Arm Identification with LLM Judges and Limited Human
Ao, Ruicheng, Chen, Hongyu, Gao, Siyang, Li, Hanwei, Simchi-Levi, David
We study fixed-confidence best-arm identification (BAI) where a cheap but potentially biased proxy (e.g., LLM judge) is available for every sample, while an expensive ground-truth label can only be acquired selectively when using a human for auditing. Unlike classical multi-fidelity BAI, the proxy is biased (arm- and context-dependent) and ground truth is selectively observed. Consequently, standard multi-fidelity methods can mis-select the best arm, and uniform auditing, though accurate, wastes scarce resources and is inefficient. We prove that without bias correction and propensity adjustment, mis-selection probability may not vanish (even with unlimited proxy data). We then develop an estimator for the mean of each arm that combines proxy scores with inverse-propensity-weighted residuals and form anytime-valid confidence sequences for that estimator. Based on the estimator and confidence sequence, we propose an algorithm that adaptively selects and audits arms. The algorithm concentrates audits on unreliable contexts and close arms and we prove that a plug-in Neyman rule achieves near-oracle audit efficiency. Numerical experiments confirm the theoretical guarantees and demonstrate the superior empirical performance of the proposed algorithm.
Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text
Zhou, Hongyi, Zhu, Jin, Xu, Erhan, Ye, Kai, Yang, Ying, Shi, Chengchun
Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Y et, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLMgenerated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 57.8% to 80.6% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini). The past few years have witnessed the emergence and rapid development of large language models (LLMs) such as GPT (Hurst et al., 2024), DeepSeek (Liu et al., 2024), Claude (Anthropic, 2024), Gemini (Comanici et al., 2025), Grok (xAI, 2025) and Qwen (Y ang et al., 2025). Their impact is everywhere, from education, academia and software development to healthcare and everyday life (Arora & Arora, 2023; Chan & Hu, 2023; Hou et al., 2024). On one side of the coin, LLMs can support users with conversational question answering, help students learn more effectively, draft emails, write computer code, prepare presentation slides and more. On the other side, their ability to closely mimic human-written text also raises serious concerns, including the generation of biased or harmful content, the spread of misinformation in the news ecosystem, and the challenges related to authorship attribution and intellectual property (Dave et al., 2023; Fang et al., 2024; Messeri & Crockett, 2024; Mahajan et al., 2025; Laurito et al., 2025). Addressing these concerns requires effective algorithms to distinguish between human-written and LLM-generated text, which has become an active and popular research direction in recent literature (see Crothers et al., 2023; Wu et al., 2025, for reviews).
A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth
Xu, Mingyuan, Tan, Xinzi, Wu, Jiawei, Zhou, Doudou
Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.
More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)
Meir, Sagi, Keidar, Tommer D., Levi, Noam, Reuveni, Shlomi, Hirshberg, Barak
The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for any given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs using HumanEval demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws.
Statsformer: Validated Ensemble Learning with LLM-Derived Semantic Priors
Zhang, Erica, Sagan, Naomi, Tse, Danny, Zhang, Fangzhao, Pilanci, Mert, Blanchet, Jose
We introduce Statsformer, a principled framework for integrating large language model (LLM)-derived knowledge into supervised statistical learning. Existing approaches are limited in adaptability and scope: they either inject LLM guidance as an unvalidated heuristic, which is sensitive to LLM hallucination, or embed semantic information within a single fixed learner. Statsformer overcomes both limitations through a guardrailed ensemble architecture. We embed LLM-derived feature priors within an ensemble of linear and nonlinear learners, adaptively calibrating their influence via cross-validation. This design yields a flexible system with an oracle-style guarantee that it performs no worse than any convex combination of its in-library base learners, up to statistical error. Empirically, informative priors yield consistent performance improvements, while uninformative or misspecified LLM guidance is automatically downweighted, mitigating the impact of hallucinations across a diverse range of prediction tasks.