We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e.g. action classification) and video generation tasks (e.g. future prediction). We propose a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background. Experiments suggest this model can generate tiny videos up to a second at full frame rate better than simple baselines, and we show its utility at predicting plausible futures of static images. Moreover, experiments and visualizations show the model internally learns useful features for recognizing actions with minimal supervision, suggesting scene dynamics are a promising signal for representation learning. We believe generative video models can impact many applications in video understanding and simulation.
Below are some selected videos that are generated by our model. These videos are not real; they are hallucinated by a generative video model. While they are not photo-realistic, the motions are fairly reasonable for the scene category they are trained on.
We can also use the model to add animations to static images by training it to generate the future. Of course, the future is uncertain, so the model rarely generates the "correct" future, but we think the prediction has some plausibility.
For more results, check out the links below. Warning: The pages are fairly large (100 MB), so please browse on a computer and not a phone, and wait for the GIFs to download completely.
Learning models that generate videos may also be a promising way to learn representations. For example, we can train generators on a large repository of unlabeled videos, then fine-tune the discriminator on a small labeled dataset in order to recognize some actions with minimal supervision. We report accuracy on UCF101 and compare to other unsupervised learning methods for videos:
We can also visualize what emerges in the representation for predicting the future. While not all units are semantic, we found there are a few hidden units that fire on objects which are sources of motions, such as people or train tracks. Since generating the future requires understanding moving objects, the network may learn to recognize these objects internally, even though it is not supervised to do so.Hidden unit for "people" Hidden unit for "train tracks"
The images above highlight regions where a particular convolutional hidden unit fires.
Our approach builds on generative image models that leverage adversarial learning, which we apply to video. The basic idea behind the approach is to compete two deep networks against each other. One network ("the generator") tries to generate a synthetic video, and another network ("the discriminator") tries to discriminate synthetic versus real videos. The generator is trained to fool the discriminator.
For the generator, we use a deep convolutional network that inputs low-dimensional random noise and outputs a video. To model video, we use spatiotemporal up-convolutions (2D for space, 1D for time). The generator also models the background separately from the foreground. The network produces a static background (which is replicated over time) and a moving foreground that is combined using a mask. We illustrate this below:
We simultaneously train a discriminator network to distinguish real videos from fake videos. We use a deep spatiotemporal convolutional network for the discriminator, visualized here.
We downloaded two years of videos from Flickr, which we stabilize, automatically filter by scene category, and use for training.
To predict the future, one can attach an encoder to the generator, visualized here.
Adversarial learning is tricky to get right, and our model has several limitations we wish to point out.
This project builds upon many great ideas, which we encourage you to check out. They have very cool results!
Please send us any references we may have accidentally missed.
We are making the stabilized video frames available for others to use. You may download it all or in parts. Each tarball contains JPEG files which in turn contain 32 frames vertically concatenated, corresponding to roughly a second of video. The videos have been stabilized by SIFT+RANSAC.
The total dataset is 9 TB, which consists of 35 million clips. This is over one year of video.
Or you can download pre-processed and stabilized frames directly:
Videos from "Videos1" are part of the Yahoo Flickr Creative Commons Dataset. Videos from "Videos2" are downloaded by querying Flickr for common tags and English words. There should be no overlap.
If you use this data in your research project, please cite the Yahoo dataset and our paper.
We thank Yusuf Aytar for dataset discussions. We thank MIT TIG, especially Garrett Wollman, for troubleshooting issues on storing the 26 terabytes of video. We are grateful for the Torch7 community for answering many questions. NVidia donated GPUs used for this research. This work was partially supported by NSF grant #1524817 to AT, START program at UMBC to HP, and the Google PhD fellowship to CV.