Self-Critical Sequence Training (SCST) uses the model's own baseline output as a reward signal for REINFORCE. It trains the model to improve upon itself without needing external reward models.

Self-Forcing Tech – Closing the Teacher-Forcing Gap in Sequence Models

Q: What is the teacher-forcing gap?

During training, autoregressive models receive ground-truth tokens as input. At inference, they use their own predictions, which may contain errors. This mismatch causes error accumulation and quality degradation, known as the teacher-forcing gap.

Q: How does the ASR⇄TTS cycle work?

A TTS model generates synthetic speech from text, an ASR model transcribes it back, and errors are used to fine-tune both models. This self-refining loop can reduce Mandarin WER by up to 20%.

Q: Does self-forcing help with LLMs?

Yes. Self-forcing techniques relate to DPO and self-play fine-tuning methods used in modern LLM alignment. The core idea—training on the model's own outputs—is the same.

Q: Where can I find benchmarks?

We provide benchmark code and results on our GitHub repository, covering ASR (LibriSpeech, AISHELL), TTS (MOS scores), and caption generation (CIDEr) tasks.

Core Concepts

📉

Teacher-Forcing Gap

The fundamental problem: models trained on ground truth but tested on own predictions. Self-forcing bridges this gap.

🎯

SCST / REINFORCE

Self-Critical Sequence Training optimizes non-differentiable metrics (WER, CIDEr) by using the model's own greedy output as baseline.

🔁

ASR⇄TTS Cycles

Iterative refinement loops: TTS generates speech → ASR transcribes → errors update both models. No extra labeled data needed.

🧪

Semi-Supervised

Leverage unlabeled data through self-training and pseudo-labeling. The model distills its own knowledge into better predictions.

🤖

LLM Self-Play

Same principles power DPO, RLHF, and self-play fine-tuning in large language models. Train on model outputs to improve alignment.

📊

Benchmarks

Reproducible benchmarks on LibriSpeech, AISHELL, and CIDEr. Code and results on GitHub.

Self-Forcing: A Technical Deep Dive

The Problem: Exposure Bias

Autoregressive models—including transformers for text, RNNs for speech, and decoders for image captioning—are trained with teacher forcing: at each step, the model receives the ground-truth previous token as input. But at inference, it receives its own previous prediction. If the model makes a small error at step t, that error propagates to step t+1, and errors compound. This is called exposure bias, and it is a fundamental limitation of standard cross-entropy training.

Solution 1: Self-Critical Sequence Training (SCST)

SCST, introduced by Rennie et al. (2017), applies REINFORCE with a self-critical baseline. Instead of an external reward model, the model's own greedy-decoded output serves as the baseline. Sampled sequences that score higher than the baseline get reinforced; those that score lower get penalized. This is simple to implement, requires no additional models, and directly optimizes task metrics (WER for ASR, CIDEr for captioning).

Solution 2: ASR⇄TTS Self-Refining Cycles

For speech tasks, a powerful technique is the ASR⇄TTS cycle: a TTS model generates synthetic speech from unpaired text, an ASR model transcribes it back, and the reconstruction error updates both models. This creates a self-supervised loop that improves both systems simultaneously. On Mandarin ASR benchmarks, this approach cuts character error rates by up to 20% without additional labeled data.

Connection to Modern LLM Training

The principles behind self-forcing are now central to LLM alignment. Direct Preference Optimization (DPO) trains on pairs of model outputs—fundamentally the same idea as SCST but with pairwise comparisons instead of reward scores. Self-play fine-tuning generates synthetic training data from the model itself. These techniques all share the core insight: training on the model's own distribution closes the teacher-forcing gap.

Visualizing Training Dynamics

Researchers studying self-forcing often need to visualize training curves, error distributions, and attention patterns. For creating publication-ready figures for ICML, NeurIPS, or ACL papers, SciDraw offers AI-powered scientific diagram generation that can produce architecture diagrams, training curve plots, and comparison charts directly from data.

Frequently Asked Questions

What is the teacher-forcing gap?

Models trained on ground truth but tested on own predictions. Self-forcing closes this gap by training on the model's own outputs.

What is SCST?

Self-Critical Sequence Training uses the model's own greedy output as a REINFORCE baseline. No external reward model needed.

How does the ASR⇄TTS cycle work?

TTS generates speech → ASR transcribes → errors update both models. Reduces Mandarin WER by up to 20%.

Does self-forcing help with LLMs?

Yes. DPO and self-play fine-tuning are modern applications of the same core principle.

Where can I find benchmarks?

GitHub repository with reproducible benchmarks on LibriSpeech, AISHELL, and CIDEr tasks.

Closing the Teacher-Forcing Gap