Goto

Collaborating Authors

 Vision


Mind the Gap Between Prototypes and Images in Cross-domain Finetuning

Neural Information Processing Systems

In cross-domain few-shot classification (CFC), recent works mainly focus on adapting a simple transformation head on top of a frozen pre-trained backbone with few labeled data to project embeddings into a task-specific metric space where classification can be performed by measuring similarities between image instance and prototype representations. Technically, an assumption implicitly adopted in such a framework is that the prototype and image instance embeddings share the same representation transformation. However, in this paper, we find that there naturally exists a gap, which resembles the modality gap, between the prototype and image instance embeddings extracted from the frozen pre-trained backbone, and simply applying the same transformation during the adaptation phase constrains exploring the optimal representations and shrinks the gap between prototype and image representations. To solve this problem, we propose a simple yet effective method, contrastive prototype-image adaptation (CoPA), to adapt different transformations respectively for prototypes and images similarly to CLIP by treating prototypes as text prompts. Extensive experiments on Meta-Dataset demonstrate that CoPA achieves the state-of-the-art performance more efficiently. Meanwhile, further analyses also indicate that CoPA can learn better representation clusters, enlarge the gap, and achieve minimal validation loss at the enlarged gap.



RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer Jian Wang 1 Qiman Wu1

Neural Information Processing Systems

Recently, transformer-based networks have shown impressive results in semantic segmentation. Yet for real-time semantic segmentation, pure CNN-based approaches still dominate in this field, due to the time-consuming computation mechanism of transformer. We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation, which achieves better trade-off between performance and efficiency than CNN-based models. To achieve high inference efficiency on GPU-like devices, our RTFormer leverages GPU-Friendly Attention with linear complexity and discards the multi-head mechanism. Besides, we find that cross-resolution attention is more efficient to gather global context information for high-resolution branch by spreading the high level knowledge learned from low-resolution branch. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer, it achieves state-of-the-art on Cityscapes, CamVid and COCOStuff, and shows promising results on ADE20K.



Peek-a-boo, Big Tech sees you: Expert warns just 20 cloud images can make an AI deepfake video of your child

FOX News

Texas high school student Elliston Berry joins'Fox & Friends' to discuss the House's passage of a new bill that criminalizes the sharing of non-consensual intimate images, including content created with artificial intelligence. Parents love capturing their kids' big moments, from first steps to birthday candles. But a new study out of the U.K. shows many of those treasured images may be scanned, analyzed and turned into data by cloud storage services, and nearly half of parents don't even realize it. A survey of 2,019 U.K. parents, conducted by Perspectus Global and commissioned by Swiss privacy tech company Proton, found that 48% of parents were unaware providers like Google Photos, Apple iCloud, Amazon Photos and Dropbox can access and analyze the photos they upload. First lady Melania Trump, joined by President Donald Trump, delivers remarks before President Trump signed the Take it Down Act into law in the Rose Garden of the White House May 19, 2025, in Washington, D.C. (Chip Somodevilla/Getty Images) These companies use artificial intelligence to sort images into albums, recognize faces and locations and suggest memories.


Sharing Key Semantics in Transformer Makes Efficient Image Restoration

Neural Information Processing Systems

Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the emergence of Vision Transformers (ViTs) has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR's performance by sharing the key semantics via Transformer for IR (i.e., SemanIR) in this paper.


Learning to Orient Surfaces by Self-supervised Spherical CNNs, Federico Stella 1, Luciano Silva

Neural Information Processing Systems

Defining and reliably finding a canonical orientation for 3D surfaces is key to many Computer Vision and Robotics applications. This task is commonly addressed by handcrafted algorithms exploiting geometric cues deemed as distinctive and robust by the designer. Yet, one might conjecture that humans learn the notion of the inherent orientation of 3D objects from experience and that machines may do so alike. In this work, we show the feasibility of learning a robust canonical orientation for surfaces represented as point clouds. Based on the observation that the quintessential property of a canonical orientation is equivariance to 3D rotations, we propose to employ Spherical CNNs, a recently introduced machinery that can learn equivariant representations defined on the Special Orthogonal group SO(3). Specifically, spherical correlations compute feature maps whose elements define 3D rotations. Our method learns such feature maps from raw data by a self-supervised training procedure and robustly selects a rotation to transform the input point cloud into a learned canonical orientation. Thereby, we realize the first end-to-end learning approach to define and extract the canonical orientation of 3D shapes, which we aptly dub Compass. Experiments on several public datasets prove its effectiveness at orienting local surface patches as well as whole objects.


Jan P. Bauer

Neural Information Processing Systems

Exp. Psychology, Oxford ELSC, HebrewU Department of Computing Brain Mind Institute, EPFL Gatsby Unit, UCL Imperial College London Andrew M. Saxe Christopher Summerfield Ali Hummos



Urgent warning to Americans over 'dangerous' technology quietly rolled out in 80 airports

Daily Mail - Science & tech

Within seconds, you've been scanned, stored, and tracked--before even reaching airport security. Without ever handing over your ID, the Transportation Security Administration (TSA) already knows exactly who you are. This is happening at 84 airports across the US. And chances are, you didn't even notice. Marketed as a tool to enhance security, TSA's facial recognition system is drawing criticism for its potential to track Americans from the terminal entrance to their final destination.