Self-forcing techniques train sequence models on their own outputs, eliminating the mismatch between training and inference that causes error accumulation.
The fundamental problem: models trained on ground truth but tested on own predictions. Self-forcing bridges this gap.
Self-Critical Sequence Training optimizes non-differentiable metrics (WER, CIDEr) by using the model's own greedy output as baseline.
Iterative refinement loops: TTS generates speech → ASR transcribes → errors update both models. No extra labeled data needed.
Leverage unlabeled data through self-training and pseudo-labeling. The model distills its own knowledge into better predictions.
Same principles power DPO, RLHF, and self-play fine-tuning in large language models. Train on model outputs to improve alignment.
Reproducible benchmarks on LibriSpeech, AISHELL, and CIDEr. Code and results on GitHub.
Autoregressive models—including transformers for text, RNNs for speech, and decoders for image captioning—are trained with teacher forcing: at each step, the model receives the ground-truth previous token as input. But at inference, it receives its own previous prediction. If the model makes a small error at step t, that error propagates to step t+1, and errors compound. This is called exposure bias, and it is a fundamental limitation of standard cross-entropy training.
SCST, introduced by Rennie et al. (2017), applies REINFORCE with a self-critical baseline. Instead of an external reward model, the model's own greedy-decoded output serves as the baseline. Sampled sequences that score higher than the baseline get reinforced; those that score lower get penalized. This is simple to implement, requires no additional models, and directly optimizes task metrics (WER for ASR, CIDEr for captioning).
For speech tasks, a powerful technique is the ASR⇄TTS cycle: a TTS model generates synthetic speech from unpaired text, an ASR model transcribes it back, and the reconstruction error updates both models. This creates a self-supervised loop that improves both systems simultaneously. On Mandarin ASR benchmarks, this approach cuts character error rates by up to 20% without additional labeled data.
The principles behind self-forcing are now central to LLM alignment. Direct Preference Optimization (DPO) trains on pairs of model outputs—fundamentally the same idea as SCST but with pairwise comparisons instead of reward scores. Self-play fine-tuning generates synthetic training data from the model itself. These techniques all share the core insight: training on the model's own distribution closes the teacher-forcing gap.
Researchers studying self-forcing often need to visualize training curves, error distributions, and attention patterns. For creating publication-ready figures for ICML, NeurIPS, or ACL papers, SciDraw offers AI-powered scientific diagram generation that can produce architecture diagrams, training curve plots, and comparison charts directly from data.
Models trained on ground truth but tested on own predictions. Self-forcing closes this gap by training on the model's own outputs.
Self-Critical Sequence Training uses the model's own greedy output as a REINFORCE baseline. No external reward model needed.
TTS generates speech → ASR transcribes → errors update both models. Reduces Mandarin WER by up to 20%.
Yes. DPO and self-play fine-tuning are modern applications of the same core principle.
GitHub repository with reproducible benchmarks on LibriSpeech, AISHELL, and CIDEr tasks.
Self-Forcing Tech provides research summaries, benchmark code, and practical guides for training techniques that close the teacher-forcing gap. We cover SCST, scheduled sampling, ASR⇄TTS cycles, and their connections to modern LLM alignment methods.