Sun, Qi
SEM-CLIP: Precise Few-Shot Learning for Nanoscale Defect Detection in Scanning Electron Microscope Image
Jin, Qian, Jiang, Yuqi, Lu, Xudong, Liu, Yumeng, Chen, Yining, Gao, Dawei, Sun, Qi, Zhuo, Cheng
In the field of integrated circuit manufacturing, the detection and classification of nanoscale wafer defects are critical for subsequent root cause analysis and yield enhancement. The complex background patterns observed in scanning electron microscope (SEM) images and the diverse textures of the defects pose significant challenges. Traditional methods usually suffer from insufficient data, labels, and poor transferability. In this paper, we propose a novel few-shot learning approach, SEM-CLIP, for accurate defect classification and segmentation. SEM-CLIP customizes the Contrastive Language-Image Pretraining (CLIP) model to better focus on defect areas and minimize background distractions, thereby enhancing segmentation accuracy. We employ text prompts enriched with domain knowledge as prior information to assist in precise analysis. Additionally, our approach incorporates feature engineering with textual guidance to categorize defects more effectively. SEM-CLIP requires little annotated data, substantially reducing labor demands in the semiconductor industry. Extensive experimental validation demonstrates that our model achieves impressive classification and segmentation results under few-shot learning scenarios.
Concurrent Learning with Aggregated States via Randomized Least Squares Value Iteration
Chen, Yan, Bai, Qinxun, Zhang, Yiteng, Dong, Shi, Dimakopoulou, Maria, Sun, Qi, Zhou, Zhengyuan
Designing learning agents that explore efficiently in a complex environment has been widely recognized as a fundamental challenge in reinforcement learning. While a number of works have demonstrated the effectiveness of techniques based on randomized value functions on a single agent, it remains unclear, from a theoretical point of view, whether injecting randomization can help a society of agents {\it concurently} explore an environment. The theoretical results %that we established in this work tender an affirmative answer to this question. We adapt the concurrent learning framework to \textit{randomized least-squares value iteration} (RLSVI) with \textit{aggregated state representation}. We demonstrate polynomial worst-case regret bounds in both finite- and infinite-horizon environments. In both setups the per-agent regret decreases at an optimal rate of $\Theta\left(\frac{1}{\sqrt{N}}\right)$, highlighting the advantage of concurent learning. Our algorithm exhibits significantly lower space complexity compared to \cite{russo2019worst} and \cite{agrawal2021improved}. We reduce the space complexity by a factor of $K$ while incurring only a $\sqrt{K}$ increase in the worst-case regret bound, compared to \citep{agrawal2021improved,russo2019worst}. Additionally, we conduct numerical experiments to demonstrate our theoretical findings.
Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark
Roger, Alexis, Humane, Prateek, Kaplan, Daniel Z., Gupta, Kshitij, Sun, Qi, Adamopoulos, George, Lim, Jonathan Siu Chi, Anthony, Quentin, Fennell, Edwin, Rish, Irina
The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AIbased assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research. Recently, a lot of significant advances have been made in Vision-Language Models (VLMs), driven by breakthroughs in computer vision and natural language processing Chen et al. (2022); Li et al. (2023b); Liu et al. (2023b); Sun et al. (2023). However, existing VLM benchmarks, often designed for specific tasks (e.g., VQAv2 Goyal et al. (2017)), struggle to accurately reflect real-world VLM performance and capture nuanced differences between models Hsieh et al. (2024). This is particularly evident when evaluating models with significant architectural variations, where standard benchmark scores remain similar despite noticeable differences in human-perceived model quality. To address this issue, we introduce CHIRP, a hybrid VLM benchmark that combines automated metrics' scalability with human evaluators' nuanced judgment. We argue that this approach is crucial for capturing the complexities of VLM behavior, which traditional benchmarks often fail to represent. To demonstrate the limitations of existing benchmarks and the efficacy of our proposed method, we introduce Robin, a suite of VLMs trained at various scales, inspired by the Pythia language model suite Biderman et al. (2023). By systematically varying the Vision Encoder (VE) and the Large Language Model (LLM) sizes, we will show that while benchmark scores remain largely unaffected, human evaluations reveal significant differences in the models' outputs quality. Our findings underscore the need for more robust and human-centric VLM evaluation methodologies. CHIRP paves the way for developing more reliable and informative VLM benchmarks, ultimately leading to the creation of more effective and impactful VLMs. Our Contributions: We investigate the drawbacks of relying on automatic metrics and show the benefits of AI-based and human-based evaluations of VLMs. We train and release an open-source collection of VLMs named Robin. Robin is a scaling suite based on LLMs and VEs of different sizes.
$\text{Transformer}^2$: Self-adaptive LLMs
Sun, Qi, Cetin, Edoardo, Tang, Yujin
This dynamic adjustment has parallels to concepts like fast-weight memories, which enable networks to update weights in response to task demands (Schmid-huber, 1992; Gomez & Schmidhuber, 2005), and neural network weights being treated as dynamic programs (Schmidhuber, 2015). Recently, Panigrahi et al. (2023) introduces an approach where a smaller auxiliary transformer is updated dynamically within a larger model, aligning with the principles of self-adaptive behavior. This adaptation can be explored from two perspectives: a macroview, where multiple LLMs collaborate and/or compete, and a microview, where internal adaptations allow a single LLM to specialize in different tasks. Macroview: From this perspective, the system directs queries to LLMs with domain specific expertise, prioritizing outputs from expert models, thereby achieving higher accuracy and task-specific optimization. Such task-specific ensembles can be realized through various mechanisms: multiple LLMs playing distinct roles and coordinate toward a shared goal (Zhuge et al., 2023), engaging in mutual listening and debate (Du et al., 2023), or using meticulously crafted prompt constructions (Zhang et al., 2024) to integrate knowledge library and skill planning.
FovealNet: Advancing AI-Driven Gaze Tracking Solutions for Optimized Foveated Rendering System Performance in Virtual Reality
Liu, Wenxuan, Duinkharjav, Monde, Sun, Qi, Zhang, Sai Qian
Leveraging real-time eye-tracking, foveated rendering optimizes hardware efficiency and enhances visual quality virtual reality (VR). This approach leverages eye-tracking techniques to determine where the user is looking, allowing the system to render high-resolution graphics only in the foveal region-the small area of the retina where visual acuity is highest, while the peripheral view is rendered at lower resolution. However, modern deep learning-based gaze-tracking solutions often exhibit a long-tail distribution of tracking errors, which can degrade user experience and reduce the benefits of foveated rendering by causing misalignment and decreased visual quality. This paper introduces \textit{FovealNet}, an advanced AI-driven gaze tracking framework designed to optimize system performance by strategically enhancing gaze tracking accuracy. To further reduce the implementation cost of the gaze tracking algorithm, FovealNet employs an event-based cropping method that eliminates over $64.8\%$ of irrelevant pixels from the input image. Additionally, it incorporates a simple yet effective token-pruning strategy that dynamically removes tokens on the fly without compromising tracking accuracy. Finally, to support different runtime rendering configurations, we propose a system performance-aware multi-resolution training strategy, allowing the gaze tracking DNN to adapt and optimize overall system performance more effectively. Evaluation results demonstrate that FovealNet achieves at least $1.42\times$ speed up compared to previous methods and 13\% increase in perceptual quality for foveated output.
BudgetFusion: Perceptually-Guided Adaptive Diffusion Models
Li, Qinchan, Chen, Kenneth, Su, Changyue, Sun, Qi
Diffusion models have shown unprecedented success in the task of text-to-image generation. While these models are capable of generating high-quality and realistic images, the complexity of sequential denoising has raised societal concerns regarding high computational demands and energy consumption. In response, various efforts have been made to improve inference efficiency. However, most of the existing efforts have taken a fixed approach with neural network simplification or text prompt optimization. Are the quality improvements from all denoising computations equally perceivable to humans? We observed that images from different text prompts may require different computational efforts given the desired content. The observation motivates us to present BudgetFusion, a novel model that suggests the most perceptually efficient number of diffusion steps before a diffusion model starts to generate an image. This is achieved by predicting multi-level perceptual metrics relative to diffusion steps. With the popular Stable Diffusion as an example, we conduct both numerical analyses and user studies. Our experiments show that BudgetFusion saves up to five seconds per prompt without compromising perceptual similarity. We hope this work can initiate efforts toward answering a core question: how much do humans perceptually gain from images created by a generative model, per watt of energy?
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
Sun, Qi, Hong, Pengfei, Pala, Tej Deep, Toh, Vernon, Tan, U-Xuan, Ghosal, Deepanway, Poria, Soujanya
Traditional reinforcement learning-based robotic control methods are often task-specific and fail to generalize across diverse environments or unseen objects and instructions. Visual Language Models (VLMs) demonstrate strong scene understanding and planning capabilities but lack the ability to generate actionable policies tailored to specific robotic embodiments. To address this, Visual-Language-Action (VLA) models have emerged, yet they face challenges in long-horizon spatial reasoning and grounded task planning. In this work, we propose the Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, Emma-X. Emma-X leverages our constructed hierarchical embodiment dataset based on BridgeV2, containing 60,000 robot manipulation trajectories auto-annotated with grounded task reasoning and spatial guidance. Additionally, we introduce a trajectory segmentation strategy based on gripper states and motion trajectories, which can help mitigate hallucination in grounding subtask reasoning generation. Experimental results demonstrate that Emma-X achieves superior performance over competitive baselines, particularly in real-world robotic tasks requiring spatial reasoning.
An Evolved Universal Transformer Memory
Cetin, Edoardo, Sun, Qi, Zhao, Tianyu, Tang, Yujin
Prior methods propose to offset the escalating costs of modern foundation models by dropping specific parts of their contexts with hand-designed rules, while attempting to preserve their original performance. We overcome this trade-off with Neural Attention Memory Models (NAMMs), introducing a learned network for memory management that improves both the performance and efficiency of transformers. We evolve NAMMs atop pre-trained transformers to provide different latent contexts focusing on the most relevant information for individual layers and attention heads. NAMMs are universally applicable to any model using self-attention as they condition exclusively on the values in the produced attention matrices. Learning NAMMs on a small set of problems, we achieve substantial performance improvements across multiple long-context benchmarks while cutting the model's input contexts up to a fraction of the original sizes. We show the generality of our conditioning enables zero-shot transfer of NAMMs trained only on language to entirely new transformer architectures even across input modalities, with their benefits carrying over to vision and reinforcement learning.
Detect an Object At Once without Fine-tuning
Hao, Junyu, Liu, Jianheng, Zhao, Yongjia, Chen, Zuofan, Sun, Qi, Chen, Jinlong, Wei, Jianguo, Yang, Minghao
When presented with one or a few photos of a previously unseen object, humans can instantly recognize it in different scenes. Although the human brain mechanism behind this phenomenon is still not fully understood, this work introduces a novel technical realization of this task. It consists of two phases: (1) generating a Similarity Density Map (SDM) by convolving the scene image with the given object image patch(es) so that the highlight areas in the SDM indicate the possible locations; (2) obtaining the object occupied areas in the scene through a Region Alignment Network (RAN). The RAN is constructed on a backbone of Deep Siamese Network (DSN), and different from the traditional DSNs, it aims to obtain the object accurate regions by regressing the location and area differences between the ground truths and the predicted ones indicated by the highlight areas in SDM. By pre-learning from labels annotated in traditional datasets, the SDM-RAN can detect previously unknown objects without fine-tuning. Experiments were conducted on the MS COCO, PASCAL VOC datasets. The results indicate that the proposed method outperforms state-of-the-art methods on the same task.
Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models
Chia, Yew Ken, Sun, Qi, Bing, Lidong, Poria, Soujanya
Large multimodal models have demonstrated impressive problem-solving abilities in vision and language tasks, and have the potential to encode extensive world knowledge. However, it remains an open challenge for these models to perceive, reason, plan, and act in realistic environments. In this work, we introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities through more diverse and complex scenarios than previous datasets. Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans. The data encompasses diverse aspects of commonsense knowledge, physical understanding, and safety awareness. Our fine-grained analysis reveals that state-of-the-art models, including GPT-4V, face bottlenecks in visual perception, comprehension, and reasoning abilities. To address these challenges, we propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans. Experimental results demonstrate the effectiveness of our framework compared to strong baselines. Our code and dataset are available at https://embodied-planning.github.io.