Hérault, Romain
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers
Denize, Julien, Liashuha, Mykola, Rabarisoa, Jaonary, Orcesi, Astrid, Hérault, Romain
We present COMEDIAN, a novel pipeline to initialize spatiotemporal transformers for action spotting, which involves self-supervised learning and knowledge distillation. Action spotting is a timestamp-level temporal action detection task. Our pipeline consists of three steps, with two initialization stages. First, we perform self-supervised initialization of a spatial transformer using short videos as input. Additionally, we initialize a temporal transformer that enhances the spatial transformer's outputs with global context through knowledge distillation from a pre-computed feature bank aligned with each short video segment. In the final step, we fine-tune the transformers to the action spotting task. The experiments, conducted on the SoccerNet-v2 dataset, demonstrate state-of-the-art performance and validate the effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.
SoccerNet 2023 Challenges Results
Cioppa, Anthony, Giancola, Silvio, Somers, Vladimir, Magera, Floriane, Zhou, Xin, Mkhallati, Hassan, Deliège, Adrien, Held, Jan, Hinojosa, Carlos, Mansourian, Amir M., Miralles, Pierre, Barnich, Olivier, De Vleeschouwer, Christophe, Alahi, Alexandre, Ghanem, Bernard, Van Droogenbroeck, Marc, Kamal, Abdullah, Maglo, Adrien, Clapés, Albert, Abdelaziz, Amr, Xarles, Artur, Orcesi, Astrid, Scott, Atom, Liu, Bin, Lim, Byoungkwon, Chen, Chen, Deuser, Fabian, Yan, Feng, Yu, Fufu, Shitrit, Gal, Wang, Guanshuo, Choi, Gyusik, Kim, Hankyul, Guo, Hao, Fahrudin, Hasby, Koguchi, Hidenari, Ardö, Håkan, Salah, Ibrahim, Yerushalmy, Ido, Muhammad, Iftikar, Uchida, Ikuma, Be'ery, Ishay, Rabarisoa, Jaonary, Lee, Jeongae, Fu, Jiajun, Yin, Jianqin, Xu, Jinghang, Nang, Jongho, Denize, Julien, Li, Junjie, Zhang, Junpei, Kim, Juntae, Synowiec, Kamil, Kobayashi, Kenji, Zhang, Kexin, Habel, Konrad, Nakajima, Kota, Jiao, Licheng, Ma, Lin, Wang, Lizhi, Wang, Luping, Li, Menglong, Zhou, Mengying, Nasr, Mohamed, Abdelwahed, Mohamed, Liashuha, Mykola, Falaleev, Nikolay, Oswald, Norbert, Jia, Qiong, Pham, Quoc-Cuong, Song, Ran, Hérault, Romain, Peng, Rui, Chen, Ruilong, Liu, Ruixuan, Baikulov, Ruslan, Fukushima, Ryuto, Escalera, Sergio, Lee, Seungcheon, Chen, Shimin, Ding, Shouhong, Someya, Taiga, Moeslund, Thomas B., Li, Tianjiao, Shen, Wei, Zhang, Wei, Li, Wei, Dai, Wei, Luo, Weixin, Zhao, Wending, Zhang, Wenjie, Yang, Xinquan, Ma, Yanbiao, Joo, Yeeun, Zeng, Yingsen, Gan, Yiyang, Zhu, Yongqiang, Zhong, Yujie, Ruan, Zheng, Li, Zhiheng, Huang, Zhijian, Meng, Ziyu
The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on https://www.soccer-net.org. Baselines and development kits can be found on https://github.com/SoccerNet.
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning
Denize, Julien, Rabarisoa, Jaonary, Orcesi, Astrid, Hérault, Romain
Contrastive representation learning has proven to be an effective self-supervised learning method for images and videos. Most successful approaches are based on Noise Contrastive Estimation (NCE) and use different views of an instance as positives that should be contrasted with other instances, called negatives, that are considered as noise. However, several instances in a dataset are drawn from the same distribution and share underlying semantic information. A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise. To circumvent this issue, we propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE). Our training objective is a soft contrastive one that brings the positives closer and estimates a continuous distribution to push or pull negative instances based on their learned similarities. We validate empirically our approach on both image and video representation learning. We show that SCE performs competitively with the state of the art on the ImageNet linear evaluation protocol for fewer pretraining epochs and that it generalizes to several downstream image tasks. We also show that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.
Similarity Contrastive Estimation for Self-Supervised Soft Contrastive Learning
Denize, Julien, Rabarisoa, Jaonary, Orcesi, Astrid, Hérault, Romain, Canu, Stéphane
Contrastive representation learning has proven to be an effective self-supervised learning method. Most successful approaches are based on the Noise Contrastive Estimation (NCE) paradigm and consider different views of an instance as positives and other instances as noise that positives should be contrasted with. However, all instances in a dataset are drawn from the same distribution and share underlying semantic information that should not be considered as noise. We argue that a good data representation contains the relations, or semantic similarity, between the instances. Contrastive learning implicitly learns relations but considers the negatives as noise which is harmful to the quality of the learned relations and therefore the quality of the representation. To circumvent this issue we propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE). Our training objective can be considered as soft contrastive learning. Instead of hard classifying positives and negatives, we propose a continuous distribution to push or pull instances based on their semantic similarities. The target similarity distribution is computed from weak augmented instances and sharpened to eliminate irrelevant relations. Each weak augmented instance is paired with a strong augmented instance that contrasts its positive while maintaining the target similarity distribution. Experimental results show that our proposed SCE outperforms its baselines MoCov2 and ReSSL on various datasets and is competitive with state-of-the-art algorithms on the ImageNet linear evaluation protocol.
Open Set Domain Adaptation using Optimal Transport
Kechaou, Marwa, Hérault, Romain, Alaya, Mokhtar Z., Gasso, Gilles
We present a 2-step optimal transport approach that performs a mapping from a source distribution to a target distribution. Here, the target has the particularity to present new classes not present in the source domain. The first step of the approach aims at rejecting the samples issued from these new classes using an optimal transport plan. The second step solves the target (class ratio) shift still as an optimal transport problem. We develop a dual approach to solve the optimization problem involved at each step and we prove that our results outperform recent state-of-the-art performances. We further apply the approach to the setting where the source and target distributions present both a label-shift and an increasing covariate (features) shift to show its robustness.
Neural Networks Regularization Through Class-wise Invariant Representation Learning
Belharbi, Soufiane, Chatelain, Clément, Hérault, Romain, Adam, Sébastien
Training deep neural networks is known to require a large number of training samples. However, in many applications only few training samples are available. In this work, we tackle the issue of training neural networks for classification task when few training samples are available. We attempt to solve this issue by proposing a new regularization term that constrains the hidden layers of a network to learn class-wise invariant representations. In our regularization framework, learning invariant representations is generalized to the class membership where samples with the same class should have the same representation. Numerical experiments over MNIST and its variants showed that our proposal helps improving the generalization of neural network particularly when trained with few samples. We provide the source code of our framework https://github.com/sbelharbi/learning-class-invariant-features .
Inversion using a new low-dimensional representation of complex binary geological media based on a deep neural network
Laloy, Eric, Hérault, Romain, Lee, John, Jacques, Diederik, Linde, Niklas
Efficient and high-fidelity prior sampling and inversion for complex geological media is still a largely unsolved challenge. Here, we use a deep neural network of the variational autoencoder type to construct a parametric low-dimensional base model parameterization of complex binary geological media. For inversion purposes, it has the attractive feature that random draws from an uncorrelated standard normal distribution yield model realizations with spatial characteristics that are in agreement with the training set. In comparison with the most commonly used parametric representations in probabilistic inversion, we find that our dimensionality reduction (DR) approach outperforms principle component analysis (PCA), optimization-PCA (OPCA) and discrete cosine transform (DCT) DR techniques for unconditional geostatistical simulation of a channelized prior model. For the considered examples, important compression ratios (200 - 500) are achieved. Given that the construction of our parameterization requires a training set of several tens of thousands of prior model realizations, our DR approach is more suited for probabilistic (or deterministic) inversion than for unconditional (or point-conditioned) geostatistical simulation. Probabilistic inversions of 2D steady-state and 3D transient hydraulic tomography data are used to demonstrate the DR-based inversion. For the 2D case study, the performance is superior compared to current state-of-the-art multiple-point statistics inversion by sequential geostatistical resampling (SGR). Inversion results for the 3D application are also encouraging.
Efficient training-image based geostatistical simulation and inversion using a spatial generative adversarial neural network
Laloy, Eric, Hérault, Romain, Jacques, Diederik, Linde, Niklas
Probabilistic inversion within a multiple-point statistics framework is still computationally prohibitive for large-scale problems. To partly address this, we introduce and evaluate a new training-image based simulation and inversion approach for complex geologic media. Our approach relies on a deep neural network of the spatial generative adversarial network (SGAN) type. After training using a training image (TI), our proposed SGAN can quickly generate 2D and 3D unconditional realizations. A key feature of our SGAN is that it defines a (very) low-dimensional parameterization, thereby allowing for efficient probabilistic (or deterministic) inversion using state-of-the-art Markov chain Monte Carlo (MCMC) methods. A series of 2D and 3D categorical TIs is first used to analyze the performance of our SGAN for unconditional simulation. The speed at which realizations are generated makes it especially useful for simulating over large grids and/or from a complex multi-categorical TI. Subsequently, synthetic inversion case studies involving 2D steady-state flow and 3D transient hydraulic tomography are used to illustrate the effectiveness of our proposed SGAN-based probabilistic inversion. For the 2D case, the inversion rapidly explores the posterior model distribution. For the 3D case, the inversion recovers model realizations that fit the data close to the target level and visually resemble the true model well. Future work will focus on the inclusion of direct conditioning data and application to continuous TIs.
Key point selection and clustering of swimmer coordination through Sparse Fisher-EM
Komar, John, Hérault, Romain, Seifert, Ludovic
To answer the existence of optimal swimmer learning/teaching strategies, this work introduces a two-level clustering in order to analyze temporal dynamics of motor learning in breaststroke swimming. Each level have been performed through Sparse Fisher-EM, a unsupervised framework which can be applied efficiently on large and correlated datasets. The induced sparsity selects key points of the coordination phase without any prior knowledge.