Inductive Learning
Speaker Diarization for Low-Resource Languages Through Wav2vec Fine-Tuning
Abdullah, Abdulhady Abas, Karim, Sarkhel H. Taher, Ahmed, Sara Azad, Tariq, Kanar R., Rashid, Tarik A.
Speaker diarization, a core problem in speech processing, entails partitioning a given audio stream according to the speakers. Even though progress has been made in the development of the models for high - resource languages, there is still a set of specific difficulties in going through a similar process for low - resource languages such as Kurdish: there are very few annotated datasets available; the language has dialects; speakers use code - switching a lot. These challenges are met in this study by training the Wav2V ec 2.0 SSL model on a Ku rdish dataset prepared for this purpose. Thanks to transfer learning, it was possible to transfer multiling ual representations learnt in other languages to the phonetic and acoustic features of Kurdish speech. The general Diarization Error Rate (DER) was reduced by 7.2%, and the cluster purity increased by 13% when compared to the baseline algorithm. They show that making improvements in any state - of - the - art model can help in enhancing the performance of under - resourced languages. Implications of this work include transcription services for Kurdish - language media programs, as well as speaker segmentation in multilingual call centers, teleconferencing, and videoconferencing systems. Therefore, this work demonstrates that self - supervised and transfer techniques can improve speaker diarization for Kurdish and other low - resource languages with diverse features. The approach provides a ba se for building effective diarization systems in other understudied languages, which remai ns essential for speech technology's equity.
Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation
Kerdreux, Thomas, Tuel, Alexandre, Febvre, Quentin, Mouche, Alexis, Chapron, Bertrand
Self-supervised learning (SSL) has enabled the development of vision foundation models for Earth Observation (EO), demonstrating strong transferability across diverse remote sensing tasks. While prior work has focused on network architectures and training strategies, the role of dataset curation, especially in balancing and diversifying pre-training datasets, remains underexplored. In EO, this challenge is amplified by the redundancy and heavy-tailed distributions common in satellite imagery, which can lead to biased representations and inefficient training. In this work, we propose a dynamic dataset pruning strategy designed to improve SSL pre-training by maximizing dataset diversity and balance. Our method iteratively refines the training set without requiring a pre-existing feature extractor, making it well-suited for domains where curated datasets are limited or unavailable. We demonstrate our approach on the Sentinel-1 Wave Mode (WV) Synthetic Aperture Radar (SAR) archive, a challenging dataset dominated by ocean observations. We train models from scratch on the entire Sentinel-1 WV archive spanning 10 years. Across three downstream tasks, our results show that dynamic pruning improves both computational efficiency and representation quality, leading to stronger transferability. We also release the weights of OceanSAR-1, the first model in the OceanSAR family, a series of foundation models for ocean observation and analysis using SAR imagery, at github.com/galeio-research/OceanSAR-models/.
SSLR: A Semi-Supervised Learning Method for Isolated Sign Language Recognition
Algafri, Hasan, Luqman, Hamzah, Alyami, Sarah, Laradji, Issam
Sign language is the primary communication language for people with disabling hearing loss. Sign language recognition (SLR) systems aim to recognize sign gestures and translate them into spoken language. One of the main challenges in SLR is the scarcity of annotated datasets. T o address this issue, we propose a semi-supervised learning (SSL) approach for SLR (SSLR), employing a pseudo-label method to annotate unlabeled samples. The sign gestures are represented using pose information that encodes the signer's skeletal joint points. This information is used as input for the Transformer backbone model utilized in the proposed approach. T o demonstrate the learning capabilities of SSL across various labeled data sizes, several experiments were conducted using different percentages of labeled data with varying numbers of classes. The performance of the SSL approach was compared with a fully supervised learning-based model on the WLASL-100 dataset. The obtained results of the SSL model outperformed the supervised learning-based model with less labeled data in many cases.
A Self-supervised Learning Method for Raman Spectroscopy based on Masked Autoencoders
Ren, Pengju, Zhou, Ri-gui, Li, Yaochong
Raman spectroscopy serves as a powerful and reliable tool for analyzing the chemical information of substances. The integration of Raman spectroscopy with deep learning methods enables rapid qualitative and quantitative analysis of materials. Most existing approaches adopt supervised learning methods. Although supervised learning has achieved satisfactory accuracy in spectral analysis, it is still constrained by costly and limited well-annotated spectral datasets for training. When spectral annotation is challenging or the amount of annotated data is insufficient, the performance of supervised learning in spectral material identification declines. In order to address the challenge of feature extraction from unannotated spectra, we propose a self-supervised learning paradigm for Raman Spectroscopy based on a Masked AutoEncoder, termed SMAE. SMAE does not require any spectral annotations during pre-training. By randomly masking and then reconstructing the spectral information, the model learns essential spectral features. The reconstructed spectra exhibit certain denoising properties, improving the signal-to-noise ratio (SNR) by more than twofold. Utilizing the network weights obtained from masked pre-training, SMAE achieves clustering accuracy of over 80% for 30 classes of isolated bacteria in a pathogenic bacterial dataset, demonstrating significant improvements compared to classical unsupervised methods and other state-of-the-art deep clustering methods. After fine-tuning the network with a limited amount of annotated data, SMAE achieves an identification accuracy of 83.90% on the test set, presenting competitive performance against the supervised ResNet (83.40%).
MAGIC: Near-Optimal Data Attribution for Deep Learning
Ilyas, Andrew, Engstrom, Logan
A fundamental problem when building machine learning syste ms is to predict counterfactuals about model behavior. For example, scaling laws [ KMH+20; Has21; MRB+23 ] aim to predict the performance of systems trained with more data and more co mpute than is currently available; interpretability techniques [ KWG+18 ] predict how models behave under counterfactual inputs. Analogously, in this work we study predictive data attribution (or datamodeling [ IPE+22 ]), where the goal is to predict how a model would behave if it had been tr ained on a different dataset. This well-studied problem encompasses, e.g., estimating the ef fect (on the resulting trained model's predictions) of modifying a training example [ KL17 ], removing a group of training examples [ KAT+19; BNL+22; PGI+23 ], or adding entire training data sources [ LSZ+24 ]. Predictive data attribution in large-scale settings is cha llenging: it requires simulating training a model on a different dataset without actually training [ GWP+23; IGE+24 ]. In "classical" settings--when learning corresponds to minimizing a convex loss--statistical tools like the influence function [ Ham47 ] allow us to accurately and efficiently estimate how differen t training data choices change trained model predictions [ RM18; KAT+19; GSL+19 ]. However, in the non-convex settings that are ubiquitous in natural domains like langua ge/vision, current methods are less effective. Indeed, the best existing methods produce estimat es that typically (a) only moderately correlate with the ground truth [ BPF21; BNL+22; PGI+23 ] and (b) incur large absolute error [ BNL+22 ].
Describe Anything: Detailed Localized Image and Video Captioning
Lian, Long, Ding, Yifan, Ge, Yunhao, Liu, Sifei, Mao, Hanzi, Li, Boyi, Pavone, Marco, Liu, Ming-Yu, Darrell, Trevor, Yala, Adam, Cui, Yin
Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.
Invariant Learning with Annotation-free Environments
Le, Phuong Quynh, Seifert, Christin, Schlötterer, Jörg
Invariant learning is a promising approach to improve domain generalization compared to Empirical Risk Minimization (ERM). However, most invariant learning methods rely on the assumption that training examples are pre-partitioned into different known environments. We instead infer environments without the need for additional annotations, motivated by observations of the properties within the representation space of a trained ERM model. We show the preliminary effectiveness of our approach on the ColoredMNIST benchmark, achieving performance comparable to methods requiring explicit environment labels and on par with an annotation-free method that poses strong restrictions on the ERM reference model.
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D
Arnaud, Sergio, McVay, Paul, Martin, Ada, Majumdar, Arjun, Jatavallabhula, Krishna Murthy, Thomas, Phillip, Partsey, Ruslan, Dugas, Daniel, Gejji, Abha, Sax, Alexander, Berges, Vincent-Pierre, Henaff, Mikael, Jain, Ayush, Cao, Ang, Prasad, Ishita, Kalakrishnan, Mrinal, Rabbat, Michael, Ballas, Nicolas, Assran, Mido, Maksymets, Oleksandr, Rajeswaran, Aravind, Meier, Franziska
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.
Spectral Algorithms under Covariate Shift
Fan, Jun, Guo, Zheng-Chu, Shi, Lei
Spectral algorithms leverage spectral regularization techniques to analyze and process data, providing a flexible framework for addressing supervised learning problems. To deepen our understanding of their performance in real-world scenarios where the distributions of training and test data may differ, we conduct a rigorous investigation into the convergence behavior of spectral algorithms under distribution shifts, specifically within the framework of reproducing kernel Hilbert spaces. Our study focuses on the case of covariate shift. In this scenario, the marginal distributions of the input data differ between the training and test datasets, while the conditional distribution of the output given the input remains unchanged. Under this setting, we analyze the generalization error of spectral algorithms and show that they achieve minimax optimality when the density ratios between the training and test distributions are uniformly bounded. However, we also identify a critical limitation: when the density ratios are unbounded, the spectral algorithms may become suboptimal. To address this limitation, we propose a weighted spectral algorithm that incorporates density ratio information into the learning process. Our theoretical analysis shows that this weighted approach achieves optimal capacity-independent convergence rates. Furthermore, by introducing a weight clipping technique, we demonstrate that the convergence rates of the weighted spectral algorithm can approach the optimal capacity-dependent convergence rates arbitrarily closely. This improvement resolves the suboptimality issue in unbounded density ratio scenarios and advances the state-of-the-art by refining existing theoretical results.
An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research
Reizinger, Patrik, Balestriero, Randall, Klindt, David, Brendel, Wieland
Self-Supervised Learning (SSL) powers many current AI systems. As research interest and investment grow, the SSL design space continues to expand. The Platonic view of SSL, following the Platonic Representation Hypothesis (PRH), suggests that despite different methods and engineering approaches, all representations converge to the same Platonic ideal. However, this phenomenon lacks precise theoretical explanation. By synthesizing evidence from Identifiability Theory (IT), we show that the PRH can emerge in SSL. However, current IT cannot explain SSL's empirical success. To bridge the gap between theory and practice, we propose expanding IT into what we term Singular Identifiability Theory (SITh), a broader theoretical framework encompassing the entire SSL pipeline. SITh would allow deeper insights into the implicit data assumptions in SSL and advance the field towards learning more interpretable and generalizable representations. We highlight three critical directions for future research: 1) training dynamics and convergence properties of SSL; 2) the impact of finite samples, batch size, and data diversity; and 3) the role of inductive biases in architecture, augmentations, initialization schemes, and optimizers.