Goto

Collaborating Authors

 stanford




Reid Hoffman Wants Silicon Valley to 'Stand Up' Against the Trump Administration

WIRED

Reid Hoffman Wants Silicon Valley to'Stand Up' Against the Trump Administration The LinkedIn cofounder and frequent Trump target has a simple message for his peers: "Just speak up about the things that you think are true." Reid Hoffman doesn't do much in half measures. He cofounded LinkedIn, of course, and helped bankroll companies including Meta and Airbnb in their startup days. He has also fashioned himself, via books, podcasts, and other public appearances, as something of a public intellectual--a pro-capitalist philosopher who still insists that tech can be a force for good. Most recently, Hoffman has emerged as one of Silicon Valley's most prominent defenders of artificial intelligence . His newest book, 2025's, makes the case that AI won't diminish human capacity but will instead amplify it. Hoffman even relied on AI to make one of the most unconventional--and perhaps uncomfortable, depending on your view of AI-generated creativity--Christmas gifts I've heard of lately. Whatever you think of Hoffman's utopian views on AI, credit where due: He's also a very outspoken critic of President Trump--a rare trait in a tech world that's grown increasingly quiet, or cozy, when it comes to the cruelties of the US administration. Hoffman's overt political views haven't been without consequence: Trump has twice threatened to launch investigations into him, most recently calling on Attorney General Pam Bondi to dig into Hoffman's ties to Jeffrey Epstein . He has subsequently called for the government to release the Epstein files in full.) Despite those threats, Hoffman isn't pulling punches: When we sat down to tape this episode in mid-December, he readily called out the administration for degrading American government, criticized his peers for keeping their heads down, and urged Silicon Valley to stop pretending that neutrality is a virtue. If only more billionaires were saying it. So glad to have you here. I'm glad to be here. We like to start these conversations with some very fast questions. What's the hardest lesson you've ever had to learn? Probably when to give up.


Supplementary Material of ST ARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases Website/Platform and Hosting

Neural Information Processing Systems

We provide a persistent dereferenceable identifier DOI: https://doi.org/10.57967/hf/2530. RK retrieval datasets are under license CC-BY -4.0 as stated in our website. We will maintain our GitHub repository will pull requests and open issues. Code: We have provided the complete codebase in our GitHub repository. Evaluation Procedures: All evaluation procedures are thoroughly documented.


The Biggest AI Companies Met to Find a Better Path for Chatbot Companions

WIRED

In a closed-door workshop led by Anthropic and Stanford, leading AI startups and researchers discussed guidelines for chatbot companions, especially for younger users. At Stanford for eight hours on Monday, representatives from Anthropic, Apple, Google, OpenAI, Meta, and Microsoft met in a closed-door workshop to discuss the use of chatbots as companions or in roleplay scenarios. Interactions with AI tools are often mundane, but they can also lead to dire outcomes. Users sometimes experience mental breakdowns during lengthy conversations with chatbots or confide in them about their suicidal ideations . "We need to have really big conversations across society about what role we want AI to play in our future as humans who are interacting with each other," says Ryn Linthicum, head of user well-being policy at Anthropic .


Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy

Brunello, Andrea, Geatti, Luca, Mignani, Michele, Montanari, Angelo, Saccomanno, Nicola

arXiv.org Artificial Intelligence

Due to its expressiveness and unambiguous nature, First-Order Logic (FOL) is a powerful formalism for representing concepts expressed in natural language (NL). This is useful, e.g., for specifying and verifying desired system properties. While translating FOL into human-readable English is relatively straightforward, the inverse problem, converting NL to FOL (NL-FOL translation), has remained a longstanding challenge, for both humans and machines. Although the emergence of Large Language Models (LLMs) promised a breakthrough, recent literature provides contrasting results on their ability to perform NL-FOL translation. In this work, we provide a threefold contribution. First, we critically examine existing datasets and protocols for evaluating NL-FOL translation performance, revealing key limitations that may cause a misrepresentation of LLMs' actual capabilities. Second, to overcome these shortcomings, we propose a novel evaluation protocol explicitly designed to distinguish genuine semantic-level logical understanding from superficial pattern recognition, memorization, and dataset contamination. Third, using this new approach, we show that state-of-the-art, dialogue-oriented LLMs demonstrate strong NL-FOL translation skills and a genuine grasp of sentence-level logic, whereas embedding-centric models perform markedly worse.


Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

McCoy, Liam G., Haredasht, Fateme Nateghi, Chopra, Kanav, Wu, David, Wu, David JH, Conteh, Abass, Khemani, Sarita, Maharaj, Saloni Kumar, Ravi, Vishnu, Pahwa, Arth, Weng, Yingjie, Rosengaus, Leah, Giang, Lena, Li, Kelvin Zhenghao, Jee, Olivia, Shirvani, Daniel, Goh, Ethan, Chen, Jonathan H.

arXiv.org Artificial Intelligence

This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford's eConsult team, we assess frontier models -- including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro -- for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2\%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model's ability to prioritize clinically salient information within the time constraints of real-world physician communication.


Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation

Dong, Yifei, Wu, Fengyi, Chen, Guangyu, Cheng, Zhi-Qi, Hu, Qiyu, Zhou, Yuxuan, Sun, Jingdong, He, Jun-Yan, Dai, Qi, Hauptmann, Alexander G

arXiv.org Artificial Intelligence

Enabling embodied agents to effectively imagine future states is critical for robust and generalizable visual navigation. Current state-of-the-art approaches, however, adopt modular architectures that separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios. To overcome this fundamental limitation, we propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between prediction and control. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons. Extensive experiments across four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset. These results highlight UniWM as a principled step toward unified, imagination-driven embodied navigation.


Supplementary Material of ST ARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases Website/Platform and Hosting

Neural Information Processing Systems

We provide a persistent dereferenceable identifier DOI: https://doi.org/10.57967/hf/2530. RK retrieval datasets are under license CC-BY -4.0 as stated in our website. We will maintain our GitHub repository will pull requests and open issues. Code: We have provided the complete codebase in our GitHub repository. Evaluation Procedures: All evaluation procedures are thoroughly documented.