Closing the Teacher-Forcing Gap

Self-forcing techniques train sequence models on their own outputs, eliminating the mismatch between training and inference that causes error accumulation.

Neural network training and self-improvement
~8.7%Relative WER gain
20%Mandarin error reduction
OpenBenchmarks & code

Core Concepts

📉

Teacher-Forcing Gap

The fundamental problem: models trained on ground truth but tested on own predictions. Self-forcing bridges this gap.

🎯

SCST / REINFORCE

Self-Critical Sequence Training optimizes non-differentiable metrics (WER, CIDEr) by using the model's own greedy output as baseline.

🔁

ASR⇄TTS Cycles

Iterative refinement loops: TTS generates speech → ASR transcribes → errors update both models. No extra labeled data needed.

🧪

Semi-Supervised

Leverage unlabeled data through self-training and pseudo-labeling. The model distills its own knowledge into better predictions.

🤖

LLM Self-Play

Same principles power DPO, RLHF, and self-play fine-tuning in large language models. Train on model outputs to improve alignment.

📊

Benchmarks

Reproducible benchmarks on LibriSpeech, AISHELL, and CIDEr. Code and results on GitHub.

Self-Forcing: A Technical Deep Dive

The Problem: Exposure Bias

Autoregressive models—including transformers for text, RNNs for speech, and decoders for image captioning—are trained with teacher forcing: at each step, the model receives the ground-truth previous token as input. But at inference, it receives its own previous prediction. If the model makes a small error at step t, that error propagates to step t+1, and errors compound. This is called exposure bias, and it is a fundamental limitation of standard cross-entropy training.

Solution 1: Self-Critical Sequence Training (SCST)

SCST, introduced by Rennie et al. (2017), applies REINFORCE with a self-critical baseline. Instead of an external reward model, the model's own greedy-decoded output serves as the baseline. Sampled sequences that score higher than the baseline get reinforced; those that score lower get penalized. This is simple to implement, requires no additional models, and directly optimizes task metrics (WER for ASR, CIDEr for captioning).

Solution 2: ASR⇄TTS Self-Refining Cycles

For speech tasks, a powerful technique is the ASR⇄TTS cycle: a TTS model generates synthetic speech from unpaired text, an ASR model transcribes it back, and the reconstruction error updates both models. This creates a self-supervised loop that improves both systems simultaneously. On Mandarin ASR benchmarks, this approach cuts character error rates by up to 20% without additional labeled data.

Connection to Modern LLM Training

The principles behind self-forcing are now central to LLM alignment. Direct Preference Optimization (DPO) trains on pairs of model outputs—fundamentally the same idea as SCST but with pairwise comparisons instead of reward scores. Self-play fine-tuning generates synthetic training data from the model itself. These techniques all share the core insight: training on the model's own distribution closes the teacher-forcing gap.

Visualizing Training Dynamics

Researchers studying self-forcing often need to visualize training curves, error distributions, and attention patterns. For creating publication-ready figures for ICML, NeurIPS, or ACL papers, SciDraw offers AI-powered scientific diagram generation that can produce architecture diagrams, training curve plots, and comparison charts directly from data.

Frequently Asked Questions

What is the teacher-forcing gap?

Models trained on ground truth but tested on own predictions. Self-forcing closes this gap by training on the model's own outputs.

What is SCST?

Self-Critical Sequence Training uses the model's own greedy output as a REINFORCE baseline. No external reward model needed.

How does the ASR⇄TTS cycle work?

TTS generates speech → ASR transcribes → errors update both models. Reduces Mandarin WER by up to 20%.

Does self-forcing help with LLMs?

Yes. DPO and self-play fine-tuning are modern applications of the same core principle.

Where can I find benchmarks?

GitHub repository with reproducible benchmarks on LibriSpeech, AISHELL, and CIDEr tasks.

About Self-Forcing Tech

Self-Forcing Tech provides research summaries, benchmark code, and practical guides for training techniques that close the teacher-forcing gap. We cover SCST, scheduled sampling, ASR⇄TTS cycles, and their connections to modern LLM alignment methods.