The Alignment Problem in Video Generation

Autoregressive video models are trained with teacher forcing: at each step, the model receives the ground-truth previous frame as input. At inference, it receives its own generated frame. This training-inference mismatch (exposure bias) causes error accumulationβ€”small mistakes compound over frames, leading to drift, artifacts, and temporal incoherence.

Two main approaches address this: Self-Critical Sequence Training (SCST), a reinforcement learning method, and Direct Preference Optimization (DPO), an alignment technique from language models. This guide compares both in the context of video generation.

Method Comparison

SCST (Self-Critical)

  • Origin: Image captioning (Rennie et al., 2017)
  • Mechanism: Policy gradient with self-baseline
  • Reward: Metric-based (FVD, CLIP, temporal coherence)
  • Training signal: Reward of sample minus reward of greedy baseline
  • Data needs: Only unpaired video clips
  • Compute: 2Γ— forward pass per step (sample + greedy)

DPO (Direct Preference)

  • Origin: LLM alignment (Rafailov et al., 2023)
  • Mechanism: Implicit reward via preference pairs
  • Reward: Human preference (no explicit metric needed)
  • Training signal: Log-probability ratio of preferred vs rejected
  • Data needs: Paired preference data (chosen vs rejected)
  • Compute: 2Γ— forward pass per step (chosen + rejected)

Benchmark Results

MetricBase Model+ SCST+ DPO
FVD (↓)285212228
CLIPSIM (↑)0.2820.2980.315
Temporal Coherence (↑)0.710.890.84
Human Preference (↑)32%58%72%
Training time (GPU-hours)β€”480520

When to Use Which

Implementation Sketch

# SCST training loop (simplified)
for batch in dataloader:
    # Sample from model policy
    sampled_video = model.generate(batch.prompt, temperature=0.8)
    # Greedy decode for baseline
    with torch.no_grad():
        baseline_video = model.generate(batch.prompt, temperature=0.0)
    # Compute rewards
    r_sample = compute_fvd(sampled_video, batch.target)
    r_baseline = compute_fvd(baseline_video, batch.target)
    # SCST gradient: reward advantage Γ— log-prob gradient
    advantage = r_sample - r_baseline  # negative FVD, so higher is better
    loss = -advantage * model.log_prob(sampled_video)
    loss.backward()
    optimizer.step()

FAQ

Can SCST and DPO be combined?

Yes. A common approach is to use SCST for initial alignment on metric-based rewards, then fine-tune with DPO on human preferences. This combines the efficiency of automated rewards with the quality of human judgment.

How much preference data does DPO need?

For video, 5,000–10,000 preference pairs are typically sufficient. Each pair consists of two generated videos from the same prompt, with one marked as "better" by human annotators.

Researchers comparing training strategies can use SciDraw to generate clean training curve comparisons and architecture diagrams for conference submissions.