Goto

Collaborating Authors

 website


WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Neural Information Processing Systems

LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multifile website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we generate test cases targeting each functionality described in the instructions. These test cases are then manually filtered, refined, and organized to ensure accuracy, resulting in a total of 647 test cases. Each test case specifies an operation to be performed on the website and the expected outcome of the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute test cases on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks--Bolt.diy,


How to Make an Impact in the AI Economy

TIME - Tech

Follow this section to personalize your feed and get instant alerts. Follow Go to your personalized feed WHY FOLLOW? Smart Alerts: Get notified about major news as it happens. Follow this tag to personalize your feed and get instant alerts. Follow Go to your personalized feed WHY FOLLOW?


Realms for Integrated Agent Intelligence

Neural Information Processing Systems

AI agents today are mostly siloed -- they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action -- but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce EMBODIEDWEBAGENTS, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning.


WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Neural Information Processing Systems

Autonomous UI agents powered by AI have tremendous potential to boost human productivity by automating routine tasks such as filing taxes and paying bills. However, a major challenge in unlocking their full potential is security, which is exacerbated by the agent's ability to take action on their user's behalf. Existing tests for prompt injections in web agents either over-simplify the threat by testing unrealistic scenarios or giving the attacker too much power, or look at single-step isolated tasks. To more accurately measure progress for secure web agents, we introduce WASP--a new publicly available benchmark for end-to-end evaluation of Web Agent Security against Prompt injection attacks. Evaluating with WASP shows that even top-tier AI models, including those with advanced reasoning capabilities, can be deceived by simple, low-effort human-written injections in very realistic scenarios. Our end-to-end evaluation reveals a previously unobserved insight: while attacks partially succeed in up to 86% of the case, even state-ofthe-art agents often struggle to fully complete the attacker goals--highlighting the current state of security by incompetence.


MIP against Agent: Malicious Image Patches Hijacking Multimodal OSAgents

Neural Information Processing Systems

Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable capabilities, driving significant advancements across a wide range of applications. These models are typically fine-tuned to align with specific objectives, such as being "helpful and harmless" [39]. However, recent work on adversarial attacks has demonstrated that carefully crafted inputs can bypass these alignment safeguards [65, 10, 4, 26, 52]. While such adversarial attacks can elicit harmful responses, the output is usually constrained to text that is not directly actionable, limiting the scope of possible harm. While malicious text outputs are concerning, it remains unclear whether the associated risks exceed those posed by information already accessible through the internet [18].


https://papers.nips.cc/paper_files/paper/2025/file/09265e2568cf7a6ff47b506acbc2c6eb-Paper-Conference.pdf

Neural Information Processing Systems

Fraudulent activities have caused substantial negative social impacts and are exhibiting emerging characteristics such as intelligence and industrialization, posing challenges of high-order interactions, intricate dependencies, and the sparse yet concealed nature of fraudulent entities. Existing graph fraud detectors are limited by their narrow "receptive fields", as they focus only on the relations between an entity and its neighbors while neglecting longer-range structural associations hidden between entities. To address this issue, we propose a novel fraud detector based on Graph Path Aggregation (GPA). It operates through variable-length path sampling, semantic-associated path encoding, path interaction and aggregation, and aggregation-enhanced fraud detection. To further facilitate interpretable association analysis, we synthesize G-Internet, the first benchmark dataset in the field of internet fraud detection. Extensive experiments across datasets in multiple fraud scenarios demonstrate that the proposed GPA outperforms mainstream fraud detectors by up to +15% in Average Precision (AP). Additionally, GPA exhibits enhanced robustness to noisy labels and provides excellent interpretability by uncovering implicit fraudulent patterns across broader contexts.


The DOGE Bros Want Another Shot

The Atlantic - Technology

Two former staffers have created a new, perplexing company. And DOGE alumni make splashy announcements about entering complex industries with scant qualifications while promising to "root out waste." This, at least, is the premise of Special, a newly announced start-up co-founded by Justin Fox and Nate Cavanaugh, two former Department of Government Efficiency staffers who left the federal government "motivated to extend the ethos of our work at DOGE back into the private sector," as they wrote on Special's website. The company officially launched last week with funding from the Elon Musk-friendly contingent of Silicon Valley, including the venture groups Andreessen Horowitz and Human Capital. Special is also backed by investments from numerous Musk associates, including Steve Davis, Musk's top lieutenant at DOGE.


Tadej Pogacar

TIME - Tech

Follow this author to personalize your feed and get instant alerts. Follow Go to your personalized feed WHY FOLLOW? Smart Alerts: Get notified about major news as it happens. A four-time champion of the Tour de France--including the last two editions of the world's premier cycling event--Tadej Pogacar, 27, can win anywhere on two wheels: Grand Tours, one-week stage races, one-day classics, world-championship road races, and individual time trials. He can win across cobble or gravel and is a master uphill climber. "Pogi" was initially into soccer as a child in Slovenia, but after his older brother started training as a cyclist at a local club, he did the same.


Your health app may be failing you

FOX News

This material may not be published, broadcast, rewritten, or redistributed. Quotes displayed in real-time or delayed by at least 15 minutes. Market data provided by Factset . Powered and implemented by FactSet Digital Solutions . Mutual Fund and ETF data provided by LSEG . Are bank text codes enough to protect you? You have a credit freeze; it still isn't enough Turning 65? Month-by-month plan to protect yourself China's AI growth is about'economic and political leverage,' Rep Hinson says Expert warns'red-green-green alliance' helping China gain AI edge AI's impact on jobs, economy debated as youth express growing fears Jury dismisses Elon Musk's lawsuit against OpenAI and Sam Altman China does not'innovate,' they'replicate': Former DHS spokeswoman Trump to press Xi to'open up' China as tech CEOs join key summit Kurt CyberGuy Knutsson lays out how to limit what health apps used by insurance companies can track about you, the user.


How to avoid garbage news on Google Search

Popular Science

'Preferred sources' ensures you're seeing the news outlets you want to see. More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. Get Google News working the way you want it to. Breakthroughs, discoveries, and DIY tips sent six days a week. When you search Google for something topical, you might see a cluster of headlines from news outlets, reporting breaking stories related to your search query.