Goto

Collaborating Authors

 Pattern Recognition


NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

Neural Information Processing Systems

Recent advances in neural reconstruction enable high-quality 3D object reconstruction from casually captured image collections. Current techniques mostly analyze their progress on relatively simple image collections where Structurefrom-Motion (SfM) techniques can provide ground-truth (GT) camera poses. We note that SfM techniques tend to fail on in-the-wild image collections such as image search results with varying backgrounds and illuminations. To enable systematic research progress on 3D reconstruction from casual image captures, we propose'NAVI': a new dataset of category-agnostic image collections of objects with high-quality 3D scans along with per-image 2D-3D alignments providing near-perfect GT camera parameters. These 2D-3D alignments allow us to extract accurate derivative annotations such as dense pixel correspondences, depth and segmentation maps. We demonstrate the use of NAVI image collections on different problem settings and show that NAVI enables more thorough evaluations that were not possible with existing datasets. We believe NAVI is beneficial for systematic research progress on 3D reconstruction and correspondence estimation.



EV-Eye: Rethinking High-frequency Eye Tracking through the Lenses of Event Cameras,Yurun Yang

Neural Information Processing Systems

In this paper, we present EV-Eye, a first-of-its-kind large-scale multimodal eye tracking dataset aimed at inspiring research on high-frequency eye/gaze tracking. EV-Eye utilizes the emerging bio-inspired event camera to capture independent pixel-level intensity changes induced by eye movements, achieving submicrosecond latency. Our dataset was curated over two weeks and collected from 48 participants encompassing diverse genders and age groups. It comprises over 1.5 million near-eye grayscale images and 2.7 billion event samples generated by two DAVIS346 event cameras. Additionally, the dataset contains 675 thousand scene images and 2.7 million gaze references captured by a Tobii Pro Glasses 3 eye tracker for cross-modality validation. Compared with existing event-based high-frequency eye tracking datasets, our dataset is significantly larger in size, and the gaze references involve more natural and diverse eye movement patterns, i.e., fixation, saccade, and smooth pursuit. Alongside the event data, we also present a hybrid eye tracking method as a benchmark, which leverages both the near-eye grayscale images and event data for robust and high-frequency eye tracking. We show that our method achieves higher accuracy for both pupil and gaze estimation tasks compared to the existing solution.


Meta-Adapter: An Online Few-shot Learner for Vision-Language Model Lin Song 2 Hang Wang 1

Neural Information Processing Systems

The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples, resulting in longer inference time and the risk of over-fitting in certain domains. To tackle these challenges, we propose the Meta-Adapter, a lightweight residualstyle adapter, to refine the CLIP features guided by the few-shot samples in an online manner. With a few training samples, our method can enable effective few-shot learning capabilities and generalize to unseen data or tasks without additional fine-tuning, achieving competitive performance and high efficiency. Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3.6% on eight image classification datasets with higher inference speed. Furthermore, our model is simple and flexible, serving as a plug-and-play module directly applicable to downstream tasks. Without further fine-tuning, Meta-Adapter obtains notable performance improvements in open-vocabulary object detection and segmentation tasks.




Disentangling Voice and Content with Self-Supervision for Speaker Recognition, Kong Aik Lee

Neural Information Processing Systems

For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.


Enhancing User Intent Capture in Session-Based Recommendation with Attribute Patterns Zheng Li2 Yifan Gao 2 Jingfeng Yang

Neural Information Processing Systems

The goal of session-based recommendation in E-commerce is to predict the next item that an anonymous user will purchase based on the browsing and purchase history. However, constructing global or local transition graphs to supplement session data can lead to noisy correlations and user intent vanishing. In this work, we propose the Frequent Attribute Pattern Augmented Transformer (FAPAT) that characterizes user intents by building attribute transition graphs and matching attribute patterns. Specifically, the frequent and compact attribute patterns are served as memory to augment session representations, followed by a gate and a transformer block to fuse the whole session information. Through extensive experiments on two public benchmarks and 100 million industrial data in three domains, we demonstrate that FAPAT consistently outperforms state-of-the-art methods by an average of 4.5% across various evaluation metrics (Hits, NDCG, MRR). Besides evaluating the next-item prediction, we estimate the models' capabilities to capture user intents via predicting items' attributes and period-item recommendations.


ADT and Yale partner on Z-Wave lock with fingerprint recognition

PCWorld

ADT offers Yale Assure locks with its ADT home security systems, and now the security service provider has partnered with Yale and the Z-Wave Alliance to introduce the Yale Assure Lock 2 Touch with Z-Wave. This is the first Z-Wave lock with fingerprint recognition that is certified to use the Z-Wave User Credential Command Class specification that was released in June 2024. The new lock also features the latest generation Z-Wave 800 chipset, which promises longer battery life and improved range on a Z-Wave mesh network. Thanks to its use of the Z-Wave User Credential Command Class spec, ADT subscribers will be able arm and disarm their security system at the same time they lock or unlock the new deadbolt, all by just touching their previously enrolled finger to the new lock. ADT offers the Yale Assure Lock 2 Touch with Z-Wave with its ADT home security systems, which can be self- or professionally installed.


Google AI Mode rolls out to more testers with new image search feature

Engadget

Google is bringing AI Mode to more people in the US. The company announced on Monday it would make the new search tool, first launched at the start of last month, to millions of more Labs users across the country. For uninitiated, AI Mode is a new dedicated tab within Search. It allows you to ask more complicated questions of Google, with a custom version of Gemini 2.0 doing the legwork to deliver a nuanced AI-generated response. Labs, meanwhile, is a beta program you can enroll your Google account in to gain access to new Search features before the company rolls them out to the public.