sneaker
T2I-ConBench: Text-to-Image Benchmark for Continual Post-training
Huang, Zhehao, Liu, Yuhang, Lou, Yixin, He, Zhengbao, He, Mingzhen, Zhou, Wenxing, Li, Tao, Li, Kehan, Huang, Zeyi, Huang, Xiaolin
Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.
Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble
Duan, Lin, Xiu, Yanming, Gorlatova, Maria
Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs -- GPT, Gemini, and Claude -- in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences.
GANFusion: Feed-Forward Text-to-3D with Diffusion in GAN Space
Attaiki, Souhaib, Guerrero, Paul, Ceylan, Duygu, Mitra, Niloy J., Ovsjanikov, Maks
We train a feed-forward text-to-3D diffusion generator for human characters using only single-view 2D data for supervision. Existing 3D generative models cannot yet match the fidelity of image or video generative models. State-of-the-art 3D generators are either trained with explicit 3D supervision and are thus limited by the volume and diversity of existing 3D data. Meanwhile, generators that can be trained with only 2D data as supervision typically produce coarser results, cannot be text-conditioned, or must revert to test-time optimization. We observe that GAN- and diffusion-based generators have complementary qualities: GANs can be trained efficiently with 2D supervision to produce high-quality 3D objects but are hard to condition on text. In contrast, denoising diffusion models can be conditioned efficiently but tend to be hard to train with only 2D supervision. We introduce GANFusion, which starts by generating unconditional triplane features for 3D data using a GAN architecture trained with only single-view 2D data. We then generate random samples from the GAN, caption them, and train a text-conditioned diffusion model that directly learns to sample from the space of good triplane features that can be decoded into 3D objects.
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
Yang, Wenkai, Bi, Xiaohan, Lin, Yankai, Chen, Sishuo, Zhou, Jie, Sun, Xu
Driven by the rapid development of Large Language Models (LLMs), LLM-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. It is crucial to ensure the reliability and security of LLM-based agents during applications. However, the safety issues of LLM-based agents are currently under-explored. In this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis of different forms of agent backdoor attacks. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms: (1) From the perspective of the final attacking outcomes, the agent backdoor attacker can not only choose to manipulate the final output distribution, but also introduce the malicious behavior in an intermediate reasoning step only, while keeping the final output correct. (2) Furthermore, the former category can be divided into two subcategories based on trigger locations, in which the backdoor trigger can either be hidden in the user query or appear in an intermediate observation returned by the external environment. We implement the above variations of agent backdoor attacks on two typical agent tasks including web shopping and tool utilization. Extensive experiments show that LLM-based agents suffer severely from backdoor attacks and such backdoor vulnerability cannot be easily mitigated by current textual backdoor defense algorithms. This indicates an urgent need for further research on the development of targeted defenses against backdoor attacks on LLM-based agents. Warning: This paper may contain biased content.
Google's Visual Search Can Now Answer Even More Complex Questions
When Google Lens was introduced in 2017, the search feature accomplished a feat that not too long ago would have seemed like the stuff of science fiction: Point your phone's camera at an object and Google Lens can identify it, show some context, maybe even let you buy it. It was a new way of searching, one that didn't involve awkwardly typing out descriptions of things you were seeing in front of you. Lens also demonstrated how Google planned to use its machine learning and AI tools to ensure its search engine shows up on every possible surface. As Google increasingly uses its foundational generative AI models to generate summaries of information in response to text searches, Google Lens' visual search has been evolving, too. And now the company says Lens, which powers around 20 billion searches per month, is going to support even more ways to search, including video and multimodal searches.
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
Toubal, Imad Eddine, Avinash, Aditya, Alldrin, Neil Gordon, Dlabal, Jan, Zhou, Wenlei, Luo, Enming, Stretcu, Otilia, Xiong, Hao, Lu, Chun-Ta, Zhou, Howard, Krishna, Ranjay, Fuxman, Ariel, Duerig, Tom
From content moderation to wildlife conservation, the number of applications that require models to recognize nuanced or subjective visual concepts is growing. Traditionally, developing classifiers for such concepts requires substantial manual effort measured in hours, days, or even months to identify and annotate data needed for training. Even with recently proposed Agile Modeling techniques, which enable rapid bootstrapping of image classifiers, users are still required to spend 30 minutes or more of monotonous, repetitive data labeling just to train a single classifier. Drawing on Fiske's Cognitive Miser theory, we propose a new framework that alleviates manual effort by replacing human labeling with natural language interactions, reducing the total effort required to define a concept by an order of magnitude: from labeling 2,000 images to only 100 plus some natural language interactions. Our framework leverages recent advances in foundation models, both large language models and vision-language models, to carve out the concept space through conversation and by automatically labeling training data points. Most importantly, our framework eliminates the need for crowd-sourced annotations. Moreover, our framework ultimately produces lightweight classification models that are deployable in cost-sensitive scenarios. Across 15 subjective concepts and across 2 public image classification datasets, our trained models outperform traditional Agile Modeling as well as state-of-the-art zero-shot classification models like ALIGN, CLIP, CuPL, and large visual question-answering models like PaLI-X.
Causal Reasoning of Entities and Events in Procedural Texts
Zhang, Li, Xu, Hainiu, Yang, Yue, Zhou, Shuyan, You, Weiqiu, Arora, Manni, Callison-Burch, Chris
Entities and events are crucial to natural language reasoning and common in procedural texts. Existing work has focused either exclusively on entity state tracking (e.g., whether a pan is hot) or on event reasoning (e.g., whether one would burn themselves by touching the pan), while these two tasks are often causally related. We propose CREPE, the first benchmark on causal reasoning of event plausibility and entity states. We show that most language models, including GPT-3, perform close to chance at .35 F1, lagging far behind human at .87 F1. We boost model performance to .59 F1 by creatively representing events as programming languages while prompting language models pretrained on code. By injecting the causal relations between entities and events as intermediate reasoning steps in our representation, we further boost the performance to .67 F1. Our findings indicate not only the challenge that CREPE brings for language models, but also the efficacy of code-like prompting combined with chain-of-thought prompting for multihop event reasoning.
The future of AI in building digital experiences online
Artificial intelligence (AI) is one of the most talked about topics in technology these days. Many developers are using AI to build applications that can act like intelligent agents and help you accomplish tasks more efficiently. However, if you're a marketer or customer service representative, chances are that you don't have much knowledge about AI. So what exactly is artificial intelligence? What can it do for your business? And how will it affect digital experiences?
Deep Objects Is Using Artificial Intelligence to Democratize Good Design
A quick run through popular program DALL-E 2 for terms like'Virgil Abloh-inspired sneaker' or'Yeezy sneaker' spits out a'best guess' that resembles dollar-bin unlicensed bootlegs. It's clunky, sterile, and lacks the narrative of what excites us about these designers. If we want AI to help'push culture forward', these are not the machines for the job. In rethinking how artificial intelligence can improve design, Deep Objects sought to create a model where human input was key, building an AI engine that democratizes the design of cultural artifacts. Built by the creative studio FTR (whose credits include Nike, PUMA, Google, Marni, Kendrick Lamar, Travis Scott, and Daft Punk), the team has been working on the project in secret for nearly two years. WHITEPAPER ISSUE 01 Your first real peek into [ DEEPOBJECTS ] and why we believe the world of design is in need of a shake up https://t.co/K6naXctz0J
Hey Google, tighten my sneakers! Nike adds virtual assistant to its Adapt BB basketball shoes
As if tightening your shoes wasn't easyenough, Nike will now let you adjust your kicks using your voice. The firm's Adapt BB basketball sneakers are designed with a power-lacing system that are activated by pushing a button on the shoe, but now Google's virtual assistant can do it for you. Google has added'Hey, Google' abilities to the Nike Adapt app, allowing wearers to voice their need just by speaking into their smartphone. The capability is part of a larger launch for Google, which adds the virtual assistant to 30 third-party apps including Twitter, Spotify and MyFitnessPal. Nike's Adapt BB basketball sneakers are designed with a power-lacing system that are activated by pushing a button on the shoe, but now Google's virtual assistant can do it for you Nike's $400 sneaker is designed with a power-lacing system that users control by pushing buttons on the side of the shoe or in the companion app – Nike Apt app.