Goto

Collaborating Authors

 Information Technology


Locating What You Need: Towards Adapting Diffusion Models to OOD Concepts In-the-Wild

Neural Information Processing Systems

The recent large-scale text-to-image generative models have attained unprecedented performance, while people established adaptor modules like LoRA and DreamBooth to extend this performance to even more unseen concept tokens. However, we empirically find that this workflow often fails to accurately depict the out-of-distribution concepts. This failure is highly related to the low quality of training data. To resolve this, we present a framework called Controllable Adaptor Towards Out-of-Distribution Concepts (CATOD). Our framework follows the active learning paradigm which includes high-quality data accumulation and adaptor training, enabling a finer-grained enhancement of generative results. The aesthetics score and concept-matching score are two major factors that impact the quality of synthetic results. One key component of CATOD is the weighted scoring system that automatically balances between these two scores and we also offer comprehensive theoretical analysis for this point. Then, it determines how to select data and schedule the adaptor training based on this scoring system. The extensive results show that CATOD significantly outperforms the prior approaches with an 11.10 boost on the CLIP score and a 33.08% decrease on the CMMD metric.


Beyond Euclidean: Dual-Space Representation Learning for Weakly Supervised Video Violence Detection

Neural Information Processing Systems

While numerous Video Violence Detection (VVD) methods have focused on representation learning in Euclidean space, they struggle to learn sufficiently discriminative features, leading to weaknesses in recognizing normal events that are visually similar to violent events (i.e., ambiguous violence). In contrast, hyperbolic representation learning, renowned for its ability to model hierarchical and complex relationships between events, has the potential to amplify the discrimination between visually similar events. Inspired by these, we develop a novel Dual-Space Representation Learning (DSRL) method for weakly supervised VVD to utilize the strength of both Euclidean and hyperbolic geometries, capturing the visual features of events while also exploring the intrinsic relations between events, thereby enhancing the discriminative capacity of the features. DSRL employs a novel information aggregation strategy to progressively learn event context in hyperbolic spaces, which selects aggregation nodes through layer-sensitive hyperbolic association degrees constrained by hyperbolic Dirichlet energy. Furthermore, DSRL attempts to break the cyber-balkanization of different spaces, utilizing cross-space attention to facilitate information interactions between Euclidean and hyperbolic space to capture better discriminative features for final violence detection. Comprehensive experiments demonstrate the effectiveness of our proposed DSRL.


Zero-Shot Reinforcement Learning from Low Quality Data

Neural Information Processing Systems

Zero-shot reinforcement learning (RL) promises to provide agents that can perform any task in an environment after an offline, reward-free pre-training phase. Methods leveraging successor measures and successor features have shown strong performance in this setting, but require access to large heterogenous datasets for pre-training which cannot be expected for most real problems. Here, we explore how the performance of zero-shot RL methods degrades when trained on small homogeneous datasets, and propose fixes inspired by conservatism, a well-established feature of performant single-task offline RL algorithms. We evaluate our proposals across various datasets, domains and tasks, and show that conservative zero-shot RL algorithms outperform their non-conservative counterparts on low quality datasets, and perform no worse on high quality datasets. Somewhat surprisingly, our proposals also outperform baselines that get to see the task during training.


A Supplementary Material A.1 Dataset Nutrition Labels

Neural Information Processing Systems

A.2 Mercury Data Distribution and Customized Data Structures Except for all built-in Python data structures, Mercury imports another two structures to enhance the diversity and complexity as shown in Figure 4. Table 6: Mercury-eval encompasses 256 tasks, the difficulty of which has been balanced for model evaluation. Mercury-train Figure 4: Mercury supports two customized comprises the remaining 1,633 tasks for data structures: TreeNode and ListNode. Each executed code within the sandbox is subject to certain constraints to ensure fair utilization of resources and to prevent any single code from monopolizing the system resource. Specifically, there are two primary constraints: a time limit and a memory limit. The time limit restricts how long the code can execute before being forcibly terminated, thereby ensuring that no infinite loops or excessively long computations negatively impact the availability of the sandbox.


Dell wants to be your one-stop shop for AI infrastructure

ZDNet

Michael Dell is pitching a "decentralized" future for artificial intelligence that his company's devices will make possible. "The future of AI will be decentralized, low-latency, and hyper-efficient," predicted the Dell Technologies founder, chairman, and CEO in his Dell World keynote, which you can watch on YouTube. "AI will follow the data, not the other way around," Dell said at Monday's kickoff of the company's four-day customer conference in Las Vegas. Dell is betting that the complexity of deploying generative AI on-premise is driving companies to embrace a vendor with all of the parts, plus 24-hour-a-day service and support, including monitoring. On day two of the show, Dell chief operating officer Jeffrey Clarke noted that Dell's survey of enterprise customers shows 37% want an infrastructure vendor to "build their entire AI stack for them," adding, "We think Dell is becoming an enterprise's'one-stop shop' for all AI infrastructure."


Google releases its asynchronous Jules AI agent for coding - how to try it for free

ZDNet

The race to deploy AI agents is heating up. At its annual I/O developer conference yesterday, Google announced that Jules, its new AI coding assistant, is now available worldwide in public beta. The launch marks the company's latest effort to corner the burgeoning market for AI agents, widely regarded across Silicon Valley as essentially a more practical and profitable form of chatbot. Virtually every other major tech giant -- including Meta, OpenAI, and Amazon, just to name a few -- has launched its own agent product in recent months. Also: I tested ChatGPT's Deep Research against Gemini, Perplexity, and Grok AI to see which is best Originally unveiled by Google Labs in December, Jules is positioned as a reliable, automated coding assistant that can manage a broad suite of time-consuming tasks on behalf of human users. The model is "asynchronous," which, in programming-speak, means it can start and work on tasks without having to wait for any single one of them to finish.


Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

Neural Information Processing Systems

Research on video generation has recently made tremendous progress, enabling high-quality videos to be generated from text prompts or images. Adding control to the video generation process is an important goal moving forward and recent approaches that condition video generation models on camera trajectories make strides towards it. Yet, it remains challenging to generate a video of the same scene from multiple different camera trajectories. Solutions to this multi-video generation problem could enable large-scale 3D scene generation with editable camera trajectories, among other applications. We introduce collaborative video diffusion (CVD) as an important step towards this vision. The CVD framework includes a novel cross-video synchronization module that promotes consistency between corresponding frames of the same video rendered from different camera poses using an epipolar attention mechanism. Trained on top of a state-of-the-art camera-control module for video generation, CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines, as shown in extensive experiments.


AR-Pro: Counterfactual Explanations for Anomaly Repair with Formal Properties

Neural Information Processing Systems

Anomaly detection is widely used for identifying critical errors and suspicious behaviors, but current methods lack interpretability. We leverage common properties of existing methods and recent advances in generative models to introduce counterfactual explanations for anomaly detection. Given an input, we generate its counterfactual as a diffusion-based repair that shows what a non-anomalous version should have looked like. A key advantage of this approach is that it enables a domain-independent formal specification of explainability desiderata, offering a unified framework for generating and evaluating explanations. We demonstrate the effectiveness of our anomaly explainability framework, AR-Pro, on vision (MVTec, VisA) and time-series (SWaT, WADI, HAI) anomaly datasets. The code used for the experiments is accessible at: https://github.com/xjiae/arpro.


Mixture of Link Predictors on Graphs

Neural Information Processing Systems

Link prediction, which aims to forecast unseen connections in graphs, is a fundamental task in graph machine learning. Heuristic methods, leveraging a range of different pairwise measures such as common neighbors and shortest paths, often rival the performance of vanilla Graph Neural Networks (GNNs). Therefore, recent advancements in GNNs for link prediction (GNN4LP) have primarily focused on integrating one or a few types of pairwise information. In this work, we reveal that different node pairs within the same dataset necessitate varied pairwise information for accurate prediction and models that only apply the same pairwise information uniformly could achieve suboptimal performance. As a result, we propose a simple mixture of experts model Link-MoE for link prediction. Link-MoE utilizes various GNNs as experts and strategically selects the appropriate expert for each node pair based on various types of pairwise information. Experimental results across diverse real-world datasets demonstrate substantial performance improvement from Link-MoE. Notably, Link-MoE achieves a relative improvement of 18.71% on the MRR metric for the Pubmed dataset and 9.59% on the Hits@100 metric for the ogbl-ppa dataset, compared to the best baselines. The code is available at https://github.com/ml-ml/Link-MoE/.


Smart home got the cold shoulder at Google's I/O keynote

PCWorld

From game-changing text diffusion models and cutting-edge AR glasses to AI videos with sound and virtual clothing try-ons, there was plenty of amazing tech to see during Google's I/O keynote on Tuesday. The closest we got to a smart home shout-out was when a Google exec said that Gemini--the star of the show--is "coming to your watch, your car dashboard, even your TV." As Google puts its Google TV Streamer under the umbrella of smart home, we'll count that as a fleeting reference. Officially, Google has promised that Gemini is coming to Nest devices. Gemini on Nest speakers has been available on a public-preview basis for months now, and back in March, Google confirmed that a "new experience powered by Gemini" is coming to smart speakers and displays.