The Alignment Problem in Video Generation
Autoregressive video models are trained with teacher forcing: at each step, the model receives the ground-truth previous frame as input. At inference, it receives its own generated frame. This training-inference mismatch (exposure bias) causes error accumulationβsmall mistakes compound over frames, leading to drift, artifacts, and temporal incoherence.
Two main approaches address this: Self-Critical Sequence Training (SCST), a reinforcement learning method, and Direct Preference Optimization (DPO), an alignment technique from language models. This guide compares both in the context of video generation.
Method Comparison
SCST (Self-Critical)
- Origin: Image captioning (Rennie et al., 2017)
- Mechanism: Policy gradient with self-baseline
- Reward: Metric-based (FVD, CLIP, temporal coherence)
- Training signal: Reward of sample minus reward of greedy baseline
- Data needs: Only unpaired video clips
- Compute: 2Γ forward pass per step (sample + greedy)
DPO (Direct Preference)
- Origin: LLM alignment (Rafailov et al., 2023)
- Mechanism: Implicit reward via preference pairs
- Reward: Human preference (no explicit metric needed)
- Training signal: Log-probability ratio of preferred vs rejected
- Data needs: Paired preference data (chosen vs rejected)
- Compute: 2Γ forward pass per step (chosen + rejected)
Benchmark Results
| Metric | Base Model | + SCST | + DPO |
|---|---|---|---|
| FVD (β) | 285 | 212 | 228 |
| CLIPSIM (β) | 0.282 | 0.298 | 0.315 |
| Temporal Coherence (β) | 0.71 | 0.89 | 0.84 |
| Human Preference (β) | 32% | 58% | 72% |
| Training time (GPU-hours) | β | 480 | 520 |
When to Use Which
- Use SCST when: you have a well-defined quality metric (FVD, temporal coherence), you lack preference data, or you need to optimize for a specific measurable objective.
- Use DPO when: you care about overall human perception, you have access to preference annotations, or your quality criteria are hard to formalize into a single metric.
- Use Self-Forcing + SCST (our recommended approach) when temporal consistency is the primary concern. Self-Forcing directly addresses exposure bias by training with model-generated inputs.
Implementation Sketch
# SCST training loop (simplified)
for batch in dataloader:
# Sample from model policy
sampled_video = model.generate(batch.prompt, temperature=0.8)
# Greedy decode for baseline
with torch.no_grad():
baseline_video = model.generate(batch.prompt, temperature=0.0)
# Compute rewards
r_sample = compute_fvd(sampled_video, batch.target)
r_baseline = compute_fvd(baseline_video, batch.target)
# SCST gradient: reward advantage Γ log-prob gradient
advantage = r_sample - r_baseline # negative FVD, so higher is better
loss = -advantage * model.log_prob(sampled_video)
loss.backward()
optimizer.step()
FAQ
Can SCST and DPO be combined?
Yes. A common approach is to use SCST for initial alignment on metric-based rewards, then fine-tune with DPO on human preferences. This combines the efficiency of automated rewards with the quality of human judgment.
How much preference data does DPO need?
For video, 5,000β10,000 preference pairs are typically sufficient. Each pair consists of two generated videos from the same prompt, with one marked as "better" by human annotators.
Researchers comparing training strategies can use SciDraw to generate clean training curve comparisons and architecture diagrams for conference submissions.