Goto

Collaborating Authors

 video sequence




License of the assets

Neural Information Processing Systems

Licence for the codes We use the code for MS-TCN [13], ASRF [24], LAS [9], all of which are under MITLicense according to https://opensource.org/licenses/MIT. For the Jigsaws [18] dataset, we follow the data use agreeement according to https://cs.jhu. Action classification: Action classification is the task of identifying a single action, as opposed to a sequence of actions. Several methods use 2DCNNs to extract frame-wise features from an input video, which are then combined to predict a coarse action taking place in the video [56, 39, 59]. There also exist several works that perform action classification from kinematic data [2, 12]. Action segmentation: Action segmentation is the problem of segmenting an input stream of data, labeling each frame according to the action that is being carried out. Earlier methods for action segmentation employed hidden Markov models [33, 22]. More recently, convolutional neural networks [58, 26] and recurrent neural networks [50] have been applied to this problem Inspired by the success of temporal convolutional networks (TCNs) in speech synthesis, [37] adapted these models to action segmentation. MS-TCN [13], which uses a multi-stage TCN architecture, has become one of the most widely used architecture for action segmentation. Although these methods achieve high frame-wise accuracy, they still produce a significant number of over-segmentation errors. In order to address this, several boundary-aware methods have been developed which perform temporal smoothing of the frame-wise predictions [57, 24]. These methods use ground-truth boundary information to train a binary classification network to perform boundary detection. The boundary estimates are then used to aggregate the frame-wise predictions either in a soft manner (boundary-aware pooling) or by setting a hard threshold. However, for elemental actions with a short duration, such as the functional primitives in the StrokeRehab dataset, the duration of each action is very short. As a result, the boundaries between actions can be hard to detect or even hard to define (see Figure 4). Sequence-to-sequence models: Our proposed method is based on sequence-to-sequence (seq2seq) models. These models allow us to learn a mapping of a variable-length input sequence to a variablelength output sequence [53].


MBW: Multi-view Bootstrapping in the Wild

Neural Information Processing Systems

Labeling articulated objects in unconstrained settings has a wide variety of applications including entertainment, neuroscience, psychology, ethology, and many fields of medicine. Large offline labeled datasets do not exist for all but the most common articulated object categories (e.g., humans). Hand labeling these landmarks within a video sequence is a laborious task. Learned landmark detectors can help, but can be error-prone when trained from only a few examples. Multi-camera systems that train fine-grained detectors have shown significant promise in detecting such errors, allowing for self-supervised solutions that only need a small percentage of the video sequence to be hand-labeled.


Trading robust representations for sample complexity through self-supervised visual experience

Neural Information Processing Systems

Learning in small sample regimes is among the most remarkable features of the human perceptual system. This ability is related to robustness to transformations, which is acquired through visual experience in the form of weak-or self-supervision during development. We explore the idea of allowing artificial systems to learn representations of visual stimuli through weak supervision prior to downstream supervised tasks. We introduce a novel loss function for representation learning using unlabeled image sets and video sequences, and experimentally demonstrate that these representations support one-shot learning and reduce the sample complexity of multiple recognition tasks. We establish the existence of a trade-off between the sizes of weakly supervised, automatically obtained from video sequences, and fully supervised data sets. Our results suggest that equivalence sets other than class labels, which are abundant in unlabeled visual experience, can be used for self-supervised learning of semantically relevant image embeddings.






Video Dynamics Prior: An Internal Learning Approach for Robust Video Enhancements

Neural Information Processing Systems

In this paper, we present a novel robust framework for low-level vision tasks, including denoising, object removal, frame interpolation, and super-resolution, that does not require any external training data corpus.