Wang, Xi
Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator
Zhu, Yuanzhi, Wang, Xi, Lathuilière, Stéphane, Kalogeiton, Vicky
Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di$\mathtt{[M]}$O, a novel approach that distills masked diffusion models into a one-step generator. Di$\mathtt{[M]}$O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di$\mathtt{[M]}$O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.
Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning
Wang, Xi
In neural networks, the weight vector W of a neuron plays a crucial role in transforming input features into outputs. While representing synaptic weights of postsynaptic neurons from presynaptic neurons, W can also be viewed as the neuron's encoding of the target concept it aims to represent. However, defining a target concept independently from other concepts often results in insufficient representation; rather, effective learning necessitates contrasting the target with non-targets. For instance, to accurately define a "dog," it is essential not only to understand the characteristics of dogs but also to distinguish them from non-dog entities. Without this contrast, differentiation remains incomplete. Similarly, when a neuron learns, it should capture the differences between the features of the target class (hereafter termed positive examples) and those of non-target classes (negative examples).
PGAD: Prototype-Guided Adaptive Distillation for Multi-Modal Learning in AD Diagnosis
Li, Yanfei, Yin, Teng, Shang, Wenyi, Liu, Jingyu, Wang, Xi, Zhao, Kaiyang
Missing modalities pose a major issue in Alzheimer's Disease (AD) diagnosis, as many subjects lack full imaging data due to cost and clinical constraints. While multi-modal learning leverages complementary information, most existing methods train only on complete data, ignoring the large proportion of incomplete samples in real-world datasets like ADNI. This reduces the effective training set and limits the full use of valuable medical data. While some methods incorporate incomplete samples, they fail to effectively address inter-modal feature alignment and knowledge transfer challenges under high missing rates. To address this, we propose a Prototype-Guided Adaptive Distillation (PGAD) framework that directly incorporates incomplete multi-modal data into training. PGAD enhances missing modality representations through prototype matching and balances learning with a dynamic sampling strategy. We validate PGAD on the ADNI dataset with varying missing rates (20%, 50%, and 70%) and demonstrate that it significantly outperforms state-of-the-art approaches. Ablation studies confirm the effectiveness of prototype matching and adaptive sampling, highlighting the potential of our framework for robust and scalable AD diagnosis in real-world clinical settings.
GOD model: Privacy Preserved AI School for Personal Assistant
PIN AI Team, null, Sun, Bill, Guo, Gavin, Peng, Regan, Zhang, Boliang, Wang, Shouqiao, Florescu, Laura, Wang, Xi, Crapis, Davide, Wu, Ben
Personal AI assistants (e.g., Apple Intelligence, Meta AI) offer proactive recommendations that simplify everyday tasks, but their reliance on sensitive user data raises concerns about privacy and trust. To address these challenges, we introduce the Guardian of Data (GOD), a secure, privacy-preserving framework for training and evaluating AI assistants directly on-device. Unlike traditional benchmarks, the GOD model measures how well assistants can anticipate user needs-such as suggesting gifts-while protecting user data and autonomy. Functioning like an AI school, it addresses the cold start problem by simulating user queries and employing a curriculum-based approach to refine the performance of each assistant. Running within a Trusted Execution Environment (TEE), it safeguards user data while applying reinforcement and imitation learning to refine AI recommendations. A token-based incentive system encourages users to share data securely, creating a data flywheel that drives continuous improvement. Specifically, users mine with their data, and the mining rate is determined by GOD's evaluation of how well their AI assistant understands them across categories such as shopping, social interactions, productivity, trading, and Web3. By integrating privacy, personalization, and trust, the GOD model provides a scalable, responsible path for advancing personal AI assistants. For community collaboration, part of the framework is open-sourced at https://github.com/PIN-AI/God-Model.
DeepCircuitX: A Comprehensive Repository-Level Dataset for RTL Code Understanding, Generation, and PPA Analysis
Li, Zeju, Xu, Changran, Shi, Zhengyuan, Peng, Zedong, Liu, Yi, Zhou, Yunhao, Zhou, Lingfeng, Ma, Chengyu, Zhong, Jianyuan, Wang, Xi, Zhao, Jieru, Chu, Zhufei, Yang, Xiaoyan, Xu, Qiang
This paper introduces DeepCircuitX, a comprehensive repository-level dataset designed to advance RTL (Register Transfer Level) code understanding, generation, and power-performance-area (PPA) analysis. Unlike existing datasets that are limited to either file-level RTL code or physical layout data, DeepCircuitX provides a holistic, multilevel resource that spans repository, file, module, and block-level RTL code. This structure enables more nuanced training and evaluation of large language models (LLMs) for RTL-specific tasks. DeepCircuitX is enriched with Chain of Thought (CoT) annotations, offering detailed descriptions of functionality and structure at multiple levels. These annotations enhance its utility for a wide range of tasks, including RTL code understanding, generation, and completion. Additionally, the dataset includes synthesized netlists and PPA metrics, facilitating early-stage design exploration and enabling accurate PPA prediction directly from RTL code. We demonstrate the dataset's effectiveness on various LLMs finetuned with our dataset and confirm the quality with human evaluations. Our results highlight DeepCircuitX as a critical resource for advancing RTL-focused machine learning applications in hardware design automation.Our data is available at https://zeju.gitbook.io/lcm-team.
FreeTumor: Large-Scale Generative Tumor Synthesis in Computed Tomography Images for Improving Tumor Recognition
Wu, Linshan, Zhuang, Jiaxin, Zhou, Yanning, He, Sunan, Ma, Jiabo, Luo, Luyang, Wang, Xi, Ni, Xuefeng, Zhong, Xiaoling, Wu, Mingxiang, Zhao, Yinghua, Duan, Xiaohui, Vardhanabhuti, Varut, Rajpurkar, Pranav, Chen, Hao
Tumor is a leading cause of death worldwide, with an estimated 10 million deaths attributed to tumor-related diseases every year. AI-driven tumor recognition unlocks new possibilities for more precise and intelligent tumor screening and diagnosis. However, the progress is heavily hampered by the scarcity of annotated datasets, which demands extensive annotation efforts by radiologists. To tackle this challenge, we introduce FreeTumor, an innovative Generative AI (GAI) framework to enable large-scale tumor synthesis for mitigating data scarcity. Specifically, FreeTumor effectively leverages a combination of limited labeled data and large-scale unlabeled data for tumor synthesis training. Unleashing the power of large-scale data, FreeTumor is capable of synthesizing a large number of realistic tumors on images for augmenting training datasets. To this end, we create the largest training dataset for tumor synthesis and recognition by curating 161,310 publicly available Computed Tomography (CT) volumes from 33 sources, with only 2.3% containing annotated tumors. To validate the fidelity of synthetic tumors, we engaged 13 board-certified radiologists in a Visual Turing Test to discern between synthetic and real tumors. Rigorous clinician evaluation validates the high quality of our synthetic tumors, as they achieved only 51.1% sensitivity and 60.8% accuracy in distinguishing our synthetic tumors from real ones. Through high-quality tumor synthesis, FreeTumor scales up the recognition training datasets by over 40 times, showcasing a notable superiority over state-of-the-art AI methods including various synthesis methods and foundation models. These findings indicate promising prospects of FreeTumor in clinical applications, potentially advancing tumor treatments and improving the survival rates of patients.
AKiRa: Augmentation Kit on Rays for optical video generation
Wang, Xi, Courant, Robin, Christie, Marc, Kalogeiton, Vicky
Recent advances in text-conditioned video diffusion have greatly improved video quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic camera motion, zoom, distorted lens and focus shifts. These motion and optical aspects are crucial for adding controllability and cinematic elements to generation frameworks, ultimately resulting in visual content that draws focus, enhances mood, and guides emotions according to filmmakers' controls. In this paper, we aim to close the gap between controllable video generation and camera optics. To achieve this, we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework that builds and trains a camera adapter with a complex camera model over an existing video generation backbone. It enables fine-tuned control over camera motion as well as complex optical parameters (focal length, distortion, aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh. Extensive experiments demonstrate AKiRa's effectiveness in combining and composing camera optics while outperforming all state-of-the-art methods. This work sets a new landmark in controlled and optically enhanced video generation, paving the way for future optical video generation methods.
Adaptive Visual Perception for Robotic Construction Process: A Multi-Robot Coordination Framework
Xu, Jia, Dixit, Manish, Wang, Xi
Construction robots operate in unstructured construction sites, where effective visual perception is crucial for ensuring safe and seamless operations. However, construction robots often handle large elements and perform tasks across expansive areas, resulting in occluded views from onboard cameras and necessitating the use of multiple environmental cameras to capture the large task space. This study proposes a multi-robot coordination framework in which a team of supervising robots equipped with cameras adaptively adjust their poses to visually perceive the operation of the primary construction robot and its surrounding environment. A viewpoint selection method is proposed to determine each supervising robot's camera viewpoint, optimizing visual coverage and proximity while considering the visibility of the upcoming construction robot operation. A case study on prefabricated wooden frame installation demonstrates the system's feasibility, and further experiments are conducted to validate the performance and robustness of the proposed viewpoint selection method across various settings. This research advances visual perception of robotic construction processes and paves the way for integrating computer vision techniques to enable real-time adaption and responsiveness. Such advancements contribute to the safe and efficient operation of construction robots in inherently unstructured construction sites.
Holistic Understanding of 3D Scenes as Universal Scene Description
Halacheva, Anna-Maria, Miao, Yang, Zaech, Jan-Nico, Wang, Xi, Van Gool, Luc, Paudel, Danda Pani
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered by current works. In this work, we address this shortcoming and introduce (1) an expertly curated dataset in the Universal Scene Description (USD) format, featuring high-quality manual annotations, for instance, segmentation and articulation on 280 indoor scenes; (2) a learning-based model together with a novel baseline capable of predicting part segmentation along with a full specification of motion attributes, including motion type, articulated and interactable parts, and motion parameters; (3) a benchmark serving to compare upcoming methods for the task at hand. Overall, our dataset provides 8 types of annotations - object and part segmentations, motion types, movable and interactable parts, motion parameters, connectivity, and object mass annotations. With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models. All data is provided in the USD format, allowing interoperability and easy integration with downstream tasks. We provide open access to our dataset, benchmark, and method's source code.
Understanding the World's Museums through Vision-Language Reasoning
Balauca, Ada-Astrid, Garai, Sanjana, Balauca, Stefan, Shetty, Rasesh Udayakumar, Agrawal, Naitik, Shah, Dhwanil Subhashbhai, Fu, Yuqian, Wang, Xi, Toutanova, Kristina, Paudel, Danda Pani, Van Gool, Luc
Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions, preserving well-documented collections. Data reveal key attributes such as age, origin, material, and cultural significance. Understanding museum exhibits from their images requires reasoning beyond visual features. In this work, we facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world; (b) training large vision-language models on the collected dataset; (c) benchmarking their ability on five visual question answering tasks. The complete dataset is labeled by museum experts, ensuring the quality as well as the practical significance of the labels. We train two VLMs from different categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through exhaustive experiments, we provide several insights on the complex and fine-grained understanding of museum exhibits. In particular, we show that some questions whose answers can often be derived directly from visual features are well answered by both types of models. On the other hand, questions that require the grounding of the visual features in repositories of human knowledge are better answered by the large vision-language models, thus demonstrating their superior capacity to perform the desired reasoning. Find our dataset, benchmarks, and source code at: https://github.com/insait-institute/Museum-65