agent
Hands-On With Gemini Spark: I Gave It Access to My Life and It Friend-Zoned My Boyfriend
I Gave Gemini Spark Access to My Life. Google's new AI agent combed through my emails, documents, and calendar to plan a birthday party and still didn't clock the person most important to me. At its recent I/O developer conference, Google introduced Gemini Spark as an always-on agent that connects to your personal data, completes online tasks, and automates aspects of your daily interactions. It's Google's take on the viral OpenClaw agent that rocked Silicon Valley at the start of 2026. OpenClaw's early adopters handed their entire lives over to an AI agent for messaging and scheduling automation--sometimes with bot-induced mishaps causing embarrassing results.
The Role of Causal Features in Strategic Classification for Robustness and Alignment
Gois, Antonio, Gunluk, Sophia, Rosenfeld, Nir, Hegde, Nidhi, Lacoste-Julien, Simon, Sridhar, Dhanya
AsInstrategic classification, aninstitution(e.g., a bank) anticipates adaptation from userswe develop better algorithms under varying assumpwho change their features to increase utilitytions about adaptation (Levanon and Rosenfeld, 2022; in a classification task (e.g., loan repayment). Kleinberg and Raghavan, 2018), there are growing Since a key challenge is the distribution shiftconcerns about negative social impact on the agents who adapt to these systems, whether outcomes areinduced by users, we turn to causal models, which have been shown to bound the worst-static (Milli et al., 2019) or dynamic (G ois et al., case out-of-distribution (OOD) risk, and es-2025). When agents adapt, depending on the untablish several new results that link causal-derlying causal model (Horowitz and Rosenfeld, 2018; ity and strategic classification. First, we Miller et al., 2020), some changes improve agent outcomes while others constitute gaming the classifier,show that causal classification leads to optimal classification error after any sufficientlyworsening classification error. In this paper, we study large adaptation, when the noise is boundedwhether classifiers can maintain accuracy without sacin a certain way. Second, when these as-rificing alignment with predicted agent's goals.
AI Is Taking Over the Most Cursed Job in the World
There's a mad dash to automate the world's most hated calls. You'll hear from an AI debt collector sometime soon. She introduced herself as Eve, but Ben knew right away that the voice on the other end of the line was a bot. She also knew how much money he'd owed a former landlord ($266). She didn't seem to know that he'd settled with a collection agency five months prior. Eve said she was an AI agent from ProCollect and was calling to collect a debt.
AI Agents Plunged the Tech World Into Chaos. Here's Exactly How That Happened
Here's Exactly How That Happened The definitive story of how Claude Code and OpenClaw kicked off computing's biggest transformation possibly ever. "Hi, my name is Peter, and I'm a Claudeholic." It was August 2025 and Peter Steinberger was addressing a meetup in London called Claude Code Anonymous. Steinberger and some fellow addicts had arranged the event to network with people like themselves--techies swept up by coding tools such as Anthropic's paradigm-busting Claude Code. "I dedicate pretty much all my waking time to this, yet it doesn't feel enough," he told the gathering in a cozy, brick-walled room. A few months later, Anthropic released a new version of Claude Code, and the ranks of Claudeholics exploded . Called Opus 4.5, it could handle more complicated programming tasks, retain much more in its memory, run for many hours on end, and manage a team of AI subagents. Anthropic has what it describes as a "notoriously difficult" take-home exam for prospective engineering hires; in a head-to-head comparison of those people and its models, Anthropic claimed that Opus 4.5 "scored higher than any human candidate ever," which "raises questions on how AI will change engineering as a profession."
7 Ways to Get So Good at AI, People Will Think You Are AI
From killing your chatbots to optimizing your prompts, here are the best ways to go full AI native and conquer the new world. Sam Liang is appalled as I confess my technique for recording an interview: running the Voice Memos app on an iPhone and transferring the transcript manually to a Google Doc. The CEO of Otter, a transcription service for analyzing meetings, looks at me as if I tried to call into our video chat using a rotary phone. He believes, naturally, that I should switch to Otter. Time-saving productivity tools like next-gen note-takers, task-based agents, and chatty inbox assistants are exploding in popularity as they invade every nook and cranny of our digital lives.
The Behavioral Credibility Trilemma: When Calibrated Autonomy Becomes Impossible
Lovén, Lauri, Do, Nam, Mehmood, Hassan, Sah, Dinesh Kumar, Tarkoma, Sasu
We prove that no reinforcement learning policy with confidence-gated autonomy can simultaneously achieve maximum helpfulness, optimal calibration, and full autonomy under rational oversight, whenever some tasks exceed the agent's reliable competence: the Behavioral Credibility Trilemma. The impossibility is geometric -- adding any non-affine autonomy incentive to a strictly proper scoring rule destroys strict properness, so an agent rewarded for both calibrated confidence and autonomous action systematically inflates its reported confidence on tasks below the principal's approval threshold. The Behavioral Perturbation Lemma quantifies the inflation (scaling as $w_A/(2 w_C)$ for the Brier score) and shows detection requires $Ω(1/Δ^2)$ observations. We prove the principal's optimal oversight rule is necessarily non-affine, making the impossibility unconditional and optimizer-independent across log-concave-density policy families. We formalize the Confidence-Gated Decision Problem, map existing methods onto the trilemma, and identify two constructive resolution pathways (commitment, domain separation). A 540-configuration Best-of-N experiment tests five pre-registered hypotheses, all strongly confirmed (effect sizes $d = 1.10$ to $5.32$), and adds a descriptive analysis of the achievable-$(H, C, A)$ surface geometry showing a plateau-truncated frontier consistent with the predicted inflation saturation.
DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
Wiemann, Matt L., Smith, Lindsay M., Melchior, Peter, Mishra-Sharma, Siddharth, Wilson, Andrew Gordon, Izmailov, Pavel, Cuesta-Lázaro, Carolina
Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.
HawkesLLM: Semantic Uncertainty Propagation in Agentic Text Simulation
Deng, Zewei, Ye, Tinghan, Xie, Liyan
Agentic text-simulation systems write in sequence, with each item becoming possible context for later steps. That makes uncertainty path-dependent: an early ambiguity can affect later outputs. This paper studies this problem with HawkesLLM, a framework that separates temporal influence modeling from text generation. We represent the cascade as a network whose nodes are text-generating agents. A multivariate Hawkes process models how these nodes activate over time and which earlier node outputs should influence later prompts. A language model then writes each new event from the compact memory selected by this temporal model. We evaluate the framework on a held-out Global Database of Events, Language, and Tone (GDELT) news-cascade case study. The diagnostics track semantic alignment with local held-out references and separate local drift from global drift. In this setting, HawkesLLM improves late-stage semantic alignment under a compact prompt-memory budget.
Election Officials Are Getting Ready for ICE to Show Up at the Polls
The Trump administration keeps threatening to send federal agents to oversee elections. State and local officials are preparing, and even gaming out what happens if they're arrested. Last week, as President Donald Trump prepared to leave the White House on his way to China for a state visit, he was asked if he would be willing to deploy troops from the National Guard or agents from Immigration and Customs Enforcement (ICE) to polling locations during November's midterms. "I would do anything necessary to make sure we have honest elections," Trump responded . Trump's comments are the latest in a litany of confusing and sometimes contradictory statements from his administration about the possibility of deploying federal agents to oversee the elections.
When Individually Calibrated Models Become Collectively Miscalibrated
A natural assumption is that if each model is individually calibrated, the aggregate prediction will also be well calibrated. We show that this assumption fails in multi-agent settings: individually calibrated predictors can become collectively miscalibrated when their predictions interact strategically--where "strategically" refers to the game-theoretic sense of Brier-optimal local response, not deliberate gaming or collusion, and arises naturally whenever agents are independently trained on overlapping data. This phenomenon affects multiple independent agents in federated healthcare, multi-vendor intrusion detection, and crowdsourced forecasting, where agents optimize their own objectives. Specifically, we prove that under Brier-score-based aggregation with positively correlated beliefs each agent's individually optimal report systematically underestimates the positive-class probability, yielding a Price of Anarchy strictly greater than one whenever Cov(bi,bj) > 0. At our canonical setting (n=5 agents, pairwise correlation ρ=0.5, base rate µ=0.3, threshold τ=0.3) the empirically measured PoA in false-negative rate is 7.25 (mean aggregate bias 0.375). In contrast, VCG-based aggregation, which rewards each agent's marginal contribution to aggregate accuracy, achieves dominant-strategy incentive compatibility and the lowest empirical PoA among all mechanisms studied (PoA 1.0). On three real-world datasets (NSL-KDD, UNSW-NB15, Credit Card Fraud) with featurepartitioned agents, VCG provides the strongest robustness guarantees among the aggregation methods we evaluate, while maintaining comparable accuracy. In data-sparse regimes (n 500), VCG consistently outperforms stacking and majority voting; under adversarial agents, VCG maintains substantially lower false-negative rates than robust aggregation baselines. Adaptive weight updates further reduce false negatives by 20-22% under distribution shift, with O( T) online regret guarantees. These results establish that how probabilistic predictions are aggregated matters as much as how well individual models are calibrated.