TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/DVD-GAN

DVD-GAN

Computer VisionIntroduced 20001 papers
Source Paper

Description

DVD-GAN is a generative adversarial network for video generation built upon the BigGAN architecture.

DVD-GAN uses two discriminators: a Spatial Discriminator D_S\mathcal{D}\_{S}D_S and a Temporal Discriminator D_T\mathcal{D}\_{T}D_T. D_S\mathcal{D}\_{S}D_S critiques single frame content and structure by randomly sampling kkk full-resolution frames and judging them individually. The temporal discriminator D_T\mathcal{D}\_{T}D_T must provide GGG with the learning signal to generate movement (not evaluated by D_S\mathcal{D}\_{S}D_S).

The input to GGG consists of a Gaussian latent noise z∼N(0,I)z \sim N\left(0, I\right)z∼N(0,I) and a learned linear embedding e(y)e\left(y\right)e(y) of the desired class yyy. Both inputs are 120-dimensional vectors. GGG starts by computing an affine transformation of [z;e(y)]\left[z; e\left(y\right)\right][z;e(y)] to a [4,4,ch_0]\left[4, 4, ch\_{0}\right][4,4,ch_0]-shaped tensor. [z;e(y)]\left[z; e\left(y\right)\right][z;e(y)] is used as the input to all class-conditional Batch Normalization layers throughout GGG. This is then treated as the input (at each frame we would like to generate) to a Convolutional GRU.

This RNN is unrolled once per frame. The output of this RNN is processed by two residual blocks. The time dimension is combined with the batch dimension here, so each frame proceeds through the blocks independently. The output of these blocks has width and height dimensions which are doubled (we skip upsampling in the first block). This is repeated a number of times, with the output of one RNN + residual group fed as the input to the next group, until the output tensors have the desired spatial dimensions.

The spatial discriminator D_S\mathcal{D}\_{S}D_S functions almost identically to BigGAN’s discriminator. A score is calculated for each of the uniformly sampled kkk frames (default k=8k = 8k=8) and the D_S\mathcal{D}\_{S}D_S output is the sum over per-frame scores. The temporal discriminator D_T\mathcal{D}\_{T}D_T has a similar architecture, but pre-processes the real or generated video with a 2×22 \times 22×2 average-pooling downsampling function ϕ\phiϕ. Furthermore, the first two residual blocks of D_T\mathcal{D}\_{T}D_T are 3-D, where every convolution is replaced with a 3-D convolution with a kernel size of 3×3×33 \times 3 \times 33×3×3. The rest of the architecture follows BigGAN.

Papers Using This Method

Adversarial Video Generation on Complex Datasets2019-07-15