sqair
Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects
It can reliably discover and track objects through the sequence; it can also conditionally generate future frames, thereby simulating expected motion of objects. This is achieved by explicitly encoding object numbers, locations and appearances in the latent variables of the model. SQAIR retains all strengths of its predecessor, Attend, Infer, Repeat (AIR, Eslami et.
Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects
It can reliably discover and track objects through the sequence; it can also conditionally generate future frames, thereby simulating expected motion of objects. This is achieved by explicitly encoding object numbers, locations and appearances in the latent variables of the model. SQAIR retains all strengths of its predecessor, Attend, Infer, Repeat (AIR, Eslami et.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > Canada > Quebec > Montreal (0.04)
particular, we clarify some potential misunderstandings from R# 3 and provide extra experiments as suggested by R#3
We thank all reviewers for their valuable and constructive comments. Below, we address the detailed comments. It is shown that PR can be extended to "selectively" incorporate uncertain We'll make this clearer in the final version. The odd columns are real data and even ones are the reconstruction results. It was a fault to miss the 8-th column (i.e., the reconstruction We'll fix these issues for better presentation.
Reviews: Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects
I have read the other reviews and the author rebuttal. I am still very much in favor of accepting this paper, but I have revised my score down from a 9 to an 8; some of the issues pointed out by the other reviewers, while well-addressed in the rebuttal, made me realize that my initial view of the paper was a bit too rosy. The model starts with the basic Attend, Infer, Repeat (AIR) framework and extends it to handle images sequences (SQAIR). This extension requires taking into account the fact that objects may enter into or leave the frame over the course of a motion sequence. To support this behavior, SQAIR's generative and inference networks for each frame have two phases.
Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects
Kosiorek, Adam, Kim, Hyunjik, Teh, Yee Whye, Posner, Ingmar
It can reliably discover and track objects through the sequence; it can also conditionally generate future frames, thereby simulating expected motion of objects. This is achieved by explicitly encoding object numbers, locations and appearances in the latent variables of the model. SQAIR retains all strengths of its predecessor, Attend, Infer, Repeat (AIR, Eslami et. We use a moving multi-\textsc{mnist} dataset to show limitations of AIR in detecting overlapping or partially occluded objects, and show how \textsc{sqair} overcomes them by leveraging temporal consistency of objects. Finally, we also apply SQAIR to real-world pedestrian CCTV data, where it learns to reliably detect, track and generate walking pedestrians with no supervision.
Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking
Crawford, Eric, Pineau, Joelle
The ability to detect and track objects in the visual world is a crucial skill for any intelligent agent, as it is a necessary precursor to any object-level reasoning process. Moreover, it is important that agents learn to track objects without supervision (i.e. without access to annotated training videos) since this will allow agents to begin operating in new environments with minimal human assistance. The task of learning to discover and track objects in videos, which we call \textit{unsupervised object tracking}, has grown in prominence in recent years; however, most architectures that address it still struggle to deal with large scenes containing many objects. In the current work, we propose an architecture that scales well to the large-scene, many-object setting by employing spatially invariant computations (convolutions and spatial attention) and representations (a spatially local object specification scheme). In a series of experiments, we demonstrate a number of attractive features of our architecture; most notably, that it outperforms competing methods at tracking objects in cluttered scenes with many objects, and that it can generalize well to videos that are larger and/or contain more objects than videos encountered during training.
- North America > Canada > Quebec > Montreal (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Variational Tracking and Prediction with Generative Disentangled State-Space Models
Akhundov, Adnan, Soelch, Maximilian, Bayer, Justin, van der Smagt, Patrick
We address tracking and prediction of multiple moving objects in visual data streams as inference and sampling in a disentangled latent state-space model. By encoding objects separately and including explicit position information in the latent state space, we perform tracking via amortized variational Bayesian inference of the respective latent positions. Inference is implemented in a modular neural framework tailored towards our disentangled latent space. Generative and inference model are jointly learned from observations only. Comparing to related prior work, we empirically show that our Markovian state-space assumption enables faithful and much improved long-term prediction well beyond the training horizon. Further, our inference model correctly decomposes frames into objects, even in the presence of occlusions. Tracking performance is increased significantly over prior art.
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (16 more...)
Scalable Object-Oriented Sequential Generative Models
Jiang, Jindong, Janghorbani, Sepehr, de Melo, Gerard, Ahn, Sungjin
In SCALOR, we achieve scalability with respect to the object density by parallelizing both the propagation and discovery processes, reducing the parallel time complexity per scene image to O (1) from O (N) with N the number of objects in an image. We also observe that the serial object processing in SQAIR based on an RNN not only increases the computation time but also deteriorates discovery performance. To this end, we propose a parallel discovery model with much better discovery capacity and performance. Temporally predicting and detecting trajectories of objects, SCALOR can also be regarded as a generative tracking model. In our experiments, we show that SCALOR can model videos with nearly one hundred moving objects along with complex background on synthetic datasets. Furthermore, we evaluate and demonstrate SCALOR on natural videos as well with tens of objects with complex background. The contribution of this work are: (i) We propose the SCALOR model that significantly improves (two orders of magnitude) the scalability with regard to the the object density. It is applicable to nearly a hundred objects with comparable computation time to SQAIR, which scales only to a few objects.