sam
Segment Anything without Supervision
The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to "discover" the hierarchical structure of visual scenes. For all pixels within a segment, a bottom-up clustering method is employed to iteratively merge them into larger groups, thereby forming a hierarchical structure. These unsupervised multi-granular masks are then utilized to supervise model training.
Segment Any Change
Visual foundation models have achieved remarkable results in zero-shot image classification and segmentation, but zero-shot change detection remains an open problem. In this paper, we propose the segment any change models (AnyChange), a new type of change detection model that supports zero-shot prediction and generalization on unseen change types and data distributions.AnyChange is built on the segment anything model (SAM) via our training-free adaptation method, bitemporal latent matching.By revealing and exploiting intra-image and inter-image semantic similarities in SAM's latent space, bitemporal latent matching endows SAM with zero-shot change detection capabilities in a training-free way. We also propose a point query mechanism to enable AnyChange's zero-shot object-centric change detection capability.We perform extensive experiments to confirm the effectiveness of AnyChange for zero-shot change detection.AnyChange sets a new record on the SECOND benchmark for unsupervised change detection, exceeding the previous SOTA by up to 4.4\% F _1 score, and achieving comparable accuracy with negligible manual annotations (1 pixel per image) for supervised change detection.
Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization
Can we modify the training data distribution to encourage the underlying optimization method toward finding solutions with superior generalization performance on in-distribution data? In this work, we approach this question for the first time by comparing the inductive bias of gradient descent (GD) with that of sharpness-aware minimization (SAM). By studying a two-layer CNN, we rigorously prove that SAM learns different features more uniformly, particularly in early epochs. That is, SAM is less susceptible to simplicity bias compared to GD. We also show that examples constraining features that are learned early are separable from the rest based on the model's output.
Sam's Club is adding AI to the shopping experience. Why are privacy advocacy groups worried?
Sam's Club is going register-free and introducing an all-digital, AI-powered shopping experience for its customers, a move that has privacy advocates worried that the new AI tool could be used to unfairly target some customers with higher-priced items based on their shopping habits. The all-digital approach started with the reconstruction of a Sam's Club in Grapevine, a suburb of Dallas, that was severely damaged in 2022 by a tornado. When the retail location opened two years later it was the first of its kind to ditch its registers for a "Scan and Go" program that allowed customers to scan each item placed in their physical cart and pay through a mobile app. This program has since been piloted in nine Dallas metro locations and one store in Missouri, Retail Dive reported. Instead of handing a receipt to a Sam's Club employee to review before leaving the store, customers walk through an arch that's equipped with AI-powered cameras to capture images of the items in the cart and electronically match them with the items paid for through the app. Sam's Club did not disclose when the AI technology would be coming to California stores but Sam's Club has outlets in Torrance, Fountain Valley, El Monte and Riverside.
Attention-Guided Integration of CLIP and SAM for Precise Object Masking in Robotic Manipulation
Muttaqien, Muhammad A., Motoda, Tomohiro, Hanai, Ryo, Yukiyasu, Domae
Attention-Guided Integration of CLIP and SAM for Precise Object Masking in Robotic Manipulation 1 st Muhammad A. Muttaqien Automation Research T eam National Institute of AIST Tokyo, Japan muha.muttaqien@aist.go.jp 2 nd Tomohiro Motoda Automation Research T eam National Institute of AIST Tokyo, Japan tomohiro.motoda@aist.go.jp 3 rd Ryo Hanai Automation Research T eam National Institute of AIST Tokyo, Japan ryo.hanai@aist.go.jp 4 th Domae Y ukiyasu Automation Research T eam National Institute of AIST Tokyo, Japan domae.yukiyasu@aist.go.jp Abstract --This paper introduces a novel pipeline to enhance the precision of object masking for robotic manipulation within the specific domain of masking products in convenience stores. The approach integrates two advanced AI models, CLIP and SAM, focusing on their synergistic combination and the effective use of multimodal data (image and text). Emphasis is placed on utilizing gradient-based attention mechanisms and customized datasets to fine-tune performance. While CLIP, SAM, and Grad-CAM are established components, their integration within this structured pipeline represents a significant contribution to the field. The resulting segmented masks, generated through this combined approach, can be effectively utilized as inputs for robotic systems, enabling more precise and adaptive object manipulation in the context of convenience store products. I NTRODUCTION In recent years, the ability to recognize and manipulate specific objects within well-defined domains, such as products in convenience stores, has become increasingly important in the field of robotic manipulation [1] [2] [3]. As robots are expected to perform more complex tasks in diverse environments, the need for precise object identification and interaction grows, particularly in domains where a high level of accuracy is crucial. For instance, in convenience stores (Figure 1), robots must reliably identify and handle a wide variety of products, each with unique visual characteristics, to automate tasks such as stocking, sorting, and customer assistance.
Monge SAM: Robust Reparameterization-Invariant Sharpness-Aware Minimization Based on Loss Geometry
Jacobsen, Albert Kjøller, Arvanitidis, Georgios
Recent studies on deep neural networks show that flat minima of the loss landscape correlate with improved generalization. Sharpness-aware minimization (SAM) efficiently finds flat regions by updating the parameters according to the gradient at an adversarial perturbation. The perturbation depends on the Euclidean metric, making SAM non-invariant under reparametrizations, which blurs sharpness and generalization. We propose Monge SAM (M-SAM), a reparametrization invariant version of SAM by considering a Riemannian metric in the parameter space induced naturally by the loss surface. Compared to previous approaches, M-SAM works under any modeling choice, relies only on mild assumptions while being as computationally efficient as SAM. We theoretically argue that M-SAM varies between SAM and gradient descent (GD), which increases robustness to hyperparameter selection and reduces attraction to suboptimal equilibria like saddle points. We demonstrate this behavior both theoretically and empirically on a multi-modal representation alignment task.
Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes
Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows -- limiting their applicability to real-world domains. Here, we present an end-to-end differentiable memory access scheme, which we call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories. We show that SAM achieves asymptotic lower bounds in space and time complexity, and find that an implementation runs 1,000 faster and with 3,000 less physical memory than non-sparse models. SAM learns with comparable data efficiency to existing models on a range of synthetic tasks and one-shot Omniglot character recognition, and can scale to tasks requiring 100,000s of time steps and memories. As well, we show how our approach can be adapted for models that maintain temporal associations between memories, as with the recently introduced Differentiable Neural Computer.