Goto

Collaborating Authors

 customer


b1041e52d3be19f0a9bc491657488e4a-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

Despite enthusiasm for Multi-Agent LLMSystems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems.


SVRPBench: ARealistic Benchmark for Stochastic Vehicle Routing Problem

Neural Information Processing Systems

Robust routing under uncertainty is central to real-world logistics, yet most benchmarks assume static, idealized settings. We present SVRPBench, the first open benchmark to capture high-fidelity stochastic dynamics in vehicle routing at urban scale. Spanning more than 500 instances with up to 1000 customers, it simulates realistic delivery conditions: time-dependent congestion, log-normal delays, probabilistic accidents, and empirically grounded time windows for residential and commercial clients. Our pipeline generates diverse, constraint-rich scenarios, including multi-depot and multi-vehicle setups. Benchmarking reveals that state-of-the-art RL solvers like POMO and AM degrade by over 20% under distributional shift, while classical and metaheuristic methods remain robust. To enable reproducible research, we release the dataset (Hugging Face) and evaluation suite (GitHub). SVRPBenchchallenges the community to design solvers that generalize beyond synthetic assumptions and adapt to real-world uncertainty.



Millions of people can get discounts on their bills - here's how

BBC News

Millions of people can get discounts on their bills - here's how Water, phone and broadband companies are willing to give millions of people discounted deals on their bills. Social tariffs - sometimes known as essential, or basic, tariffs - can reduce bills for people on various benefits. Generally, you only need to ask your supplier to get on one. Importantly, they are not price promotions designed to attract customers, but lower bills for the same service for those who would otherwise struggle to pay. Most people who have fallen behind on paying their bills are unaware this help is available, a major report has suggested.


Learning Personalized Ad Impact via Contextual Reinforcement Learning under Delayed Rewards

Neural Information Processing Systems

Online advertising platforms use automated auctions to connect advertisers with potential customers, requiring effective bidding strategies to maximize profits. Accurate ad impact estimation requires considering three key factors: delayed and long-term effects, cumulative ad impacts such as reinforcement or fatigue, and customer heterogeneity. However, these effects are often not jointly addressed in previous studies. To capture these factors, we model ad bidding as a Contextual Markov Decision Process (CMDP) with delayed Poisson rewards. For efficient estimation, we propose a two-stage maximum likelihood estimator combined with data-splitting strategies, ensuring controlled estimation error based on the first-stage estimator's (in)accuracy. Building on this, we design a reinforcement learning algorithm to derive efficient personalized bidding strategies. This approach achieves a near-optimal regret bound of $\tilde{\mathcal{O}}(dH^2\sqrt{T})$, where $d$ is the contextual dimension, $H$ is the number of rounds, and $T$ is the number of customers. Our theoretical findings are validated by simulation experiments.


The FCC Wants to Kill Burner Phones

WIRED

After WIRED reported last week that Meta's smart glasses app contained code that would enable the company to activate face-recognition features on the devices, the company removed the code this week without commenting on why or whether it plans to add such functionality back into the app later. Another WIRED investigation this week found that xAI's Grok is still hosting sexualized deepfakes, including "nudified" images and videos, of celebrities and at least one prominent US politician. After limiting the release of its new Mythos-class AI model over concerns about its potential impacts on cybersecurity, Anthropic announced a model upgrade for partners in its limited-access group this week and launched a "safe" version of the model to the public with guardrails meant to keep the system from being used to fuel cyberattacks. Meanwhile, the United States Cybersecurity and Infrastructure Security Agency issued a new directive to federal agencies this week in reaction to new AI threats that includes a requirement to fix the most urgent software vulnerabilities in as little as three days. As Europe looks to separate and insulate itself from US Big Tech, WIRED created a timeline that tracks all the ways EU governments, companies, and other organizations are moving away from US tech.


Rivian's CEO on Tesla's Cybertruck, Ferrari's Luce, and What Happens If the R2 Fails

WIRED

RJ Scaringe, the CEO of Rivian Automotive, joined us for a wide-ranging interview about how his company's new electric SUV fits into the current EV industry, and what comes next. RJ Scaringe got his PhD from MIT studying internal combustion engines. Then he founded a company to make them obsolete. In 2009, fresh out of grad school, he launched what would become Rivian. The company spent nearly a decade in stealth mode before arriving at the 2018 LA Auto Show with two electric rides nobody had seen coming. The road, however, hasn't been easy. Rivian lost $3.6 billion in 2025, and has burned through nearly $25 billion in the past eight years. It has spent more money over the same period than almost every other pure EV maker. Rivian's IPO was the largest worldwide in 2021, and one of the largest in US history, within days valuing the company at over $100 billion. Its stock has dropped from a high of $130 to around $16. Since the R1 went on sale in 2021, Rivian has sold 175,000 cars.


Anthropic blocks all customers' access to Fable 5 and Mythos 5

Engadget

It's to ensure compliance with a government directive citing national security concerns. Anthroic has disabled all of its customers' access to Fable 5 and Mythos 5 in order to ensure compliance with an order it received from the government on Friday, June 12. All its other models and its Claude chatbot are not affected. The company said in its announcement that the US government wanted it to suspend all foreign nationals' access to its newly launched AI models, whether they're inside or outside the US and even if they're Anthropic employees, citing national security concerns. While the US government didn't specify those concerns, Anthropic believes that it's because the government heard about a method of jailbreaking Fable 5.


World's largest chipmaker does not rule out price rises as costs increase

BBC News

World's largest chipmaker does not rule out price rises as costs increase The world's largest chipmaker has told the BBC that inflation is pushing up the cost of doing business, and did not rule out price rises. Taiwan Semiconductor Manufacturing Company (TSMC) makes the most advanced chips designed by companies such as Nvidia, AMD and Apple, so any increase in pricing could ripple through to the cost of AI infrastructure, and potentially over time, the prices customers pay for their electronic devices. However, the firm's chief financial officer, Wendell Huang, said it would not introduce sudden fourfold, fivefold price rises. We reflect our value, he said, pointing to its technology leadership and manufacturing excellence. In an exclusive and wide-ranging interview, Huang also denied that the AI boom was a bubble and that the firm's global expansion was due to geopolitical pressure.


Humanoid robot cleans first US apartment

FOX News

This material may not be published, broadcast, rewritten, or redistributed. Quotes displayed in real-time or delayed by at least 15 minutes. Market data provided by Factset . Powered and implemented by FactSet Digital Solutions . Mutual Fund and ETF data provided by LSEG . Will this high-tech lounge change how you wait at airports?