imagenet100
- Asia > Malaysia (0.14)
- North America > United States (0.04)
- Europe > United Kingdom > England > Surrey (0.04)
- (3 more...)
Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Moummad, Ilyass, Zaher, Kawtar, Goëau, Hervé, Joly, Alexis
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
- Europe > France > Occitanie > Hérault > Montpellier (0.04)
- Asia > China > Heilongjiang Province > Daqing (0.04)
Soft Task-Aware Routing of Experts for Equivariant Representation Learning
Jeon, Jaebyeong, Jang, Hyeonseo, Sohn, Jy-yong, Lee, Kibok
Equivariant representation learning aims to capture variations induced by input transformations in the representation space, whereas invariant representation learning encodes semantic information by disregarding such transformations. Recent studies have shown that jointly learning both types of representations is often beneficial for downstream tasks, typically by employing separate projection heads. However, this design overlooks information shared between invariant and equivariant learning, which leads to redundant feature learning and inefficient use of model capacity. To address this, we introduce Soft Task-Aware Routing (STAR), a routing strategy for projection heads that models them as experts. STAR induces the experts to specialize in capturing either shared or task-specific information, thereby reducing redundant feature learning. We validate this effect by observing lower canonical correlations between invariant and equivariant embeddings. Experimental results show consistent improvements across diverse transfer learning tasks. The code is available at https://github.com/YonseiML/star.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California (0.04)
- North America > United States > Arizona (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.67)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Asia > Malaysia (0.14)
- North America > United States (0.04)
- Europe > United Kingdom > England > Surrey (0.04)
- (3 more...)
The novel practical
The top-1 accuracy of JCL pre-trained features is 48.6%, which outperforms MoCo v2 (47.3%). Generalization of JCL for other data modalities (sound, language, video) will be included in our future work. Regarding your concerns of the written quality and typos (e.g., Algorithm 1 The top-1 accuracy on ImageNet100 for vanilla (ResNet-50) is 80.9% while JCL achieves 82.0%. The top-5 accuracy we reported (87.3%) for SimCLR was extracted from the Thus, there is no one-one correspondence between the data in Table1 and Figure2. MS COCO for object detection & instance segmentation tasks.