A Unified Encoder-Decoder for Spoken Language Processing

SpeechT5 is a Method paper that introduces a shared encoder-decoder pre-training framework for spoken language processing. Inspired by T5’s text-to-text paradigm, SpeechT5 reformulates all spoken language tasks as “speech/text to speech/text” problems. The framework uses modal-specific pre-nets and post-nets to interface between raw speech or text and a shared Transformer encoder-decoder, enabling a single pre-trained model to handle six downstream tasks: automatic speech recognition (ASR), text-to-speech synthesis (TTS), speech translation (ST), voice conversion (VC), speech enhancement (SE), and speaker identification (SID).

Bridging the Gap Between Speech and Text Pre-Training

Prior speech pre-training work (wav2vec 2.0, HuBERT) suffered from two key limitations. First, these models learned speech representations from unlabeled audio alone, ignoring the complementary information in text data that is critical for cross-modal tasks like ASR and TTS. Second, they relied on encoder-only architectures with task-specific prediction heads, leaving the decoder un-pretrained for sequence-to-sequence generation tasks.

SpeechT5 addresses both gaps by (1) jointly pre-training on unlabeled speech and text data, and (2) using a full encoder-decoder architecture that benefits generation tasks directly. The approach builds on the observation that speech and text, despite their surface differences, share underlying semantic structure that a unified representation can capture.

Cross-Modal Vector Quantization for Alignment

The core innovation in SpeechT5 is a cross-modal vector quantization (VQ) mechanism that aligns speech and text representations into a shared semantic space. The architecture consists of three components:

Shared encoder-decoder backbone. A Transformer with 12 encoder blocks and 6 decoder blocks (768-dim, 12 heads), using relative position embeddings.

Modal-specific pre/post-nets. Six specialized networks handle the conversion between raw modalities and the shared representation space:

  • Speech-encoder pre-net: a convolutional feature extractor (from wav2vec 2.0) downsampling raw waveforms
  • Speech-decoder pre-net: three FC layers with ReLU, processing 80-dimensional log Mel-filterbank features
  • Speech-decoder post-net: a linear layer predicting Mel features plus five 1D conv layers (256 channels) for residual refinement, with an x-vector speaker embedding concatenated for multi-speaker support
  • Text pre/post-nets: shared embedding layers mapping between character-level token indices and hidden states (768-dim)

Cross-modal vector quantization. A shared codebook $\mathbf{C}^{K}$ with $K$ learnable embeddings bridges the two modalities. Encoder outputs $\mathbf{u}_i$ are quantized via nearest-neighbor lookup:

$$ \mathbf{c}_i = \arg\min_{j \in [K]} | \mathbf{u}_i - \mathbf{c}_j |_2 $$

A proportion (10%) of contextual representations are randomly replaced with these quantized latent units before being fed to the decoder’s cross-attention. This mixing forces the quantizer to capture cross-modal features. A diversity loss encourages full codebook utilization:

$$ \mathcal{L}_d = \frac{1}{K} \sum_{k=1}^{K} p_k \log p_k $$

Pre-Training Objectives

SpeechT5 combines three pre-training objectives:

Speech pre-training uses two tasks. A bidirectional masked prediction loss $\mathcal{L}_{mlm}^{s}$ follows HuBERT’s approach, masking 8% of timesteps in 10-step spans and predicting frame-level targets from an acoustic unit discovery model:

$$ \mathcal{L}_{mlm}^{s} = \sum_{n \in \mathcal{M}} \log p(\mathbf{z}_n \mid \hat{\mathbf{H}}, n) $$

A reconstruction loss $\mathcal{L}_{1}^{s}$ minimizes the $L_1$ distance between predicted and original Mel-filterbank features, plus a binary cross-entropy stop-token loss $\mathcal{L}_{bce}^{s}$.

Text pre-training uses BART-style denoising, masking 30% of text spans (Poisson $\lambda = 3.5$) and training with maximum likelihood estimation:

$$ \mathcal{L}_{mle}^{t} = \sum_{n=1}^{N^t} \log p(\mathbf{y}_n^t \mid \mathbf{y}_{< n}^t, \hat{\mathbf{X}}^t) $$

The full pre-training loss combines all components:

$$ \mathcal{L} = \mathcal{L}_{mlm}^{s} + \mathcal{L}_{1}^{s} + \mathcal{L}_{bce}^{s} + \mathcal{L}_{mle}^{t} + \gamma \mathcal{L}_d $$

where $\gamma = 0.1$.

Evaluation Across Six Spoken Language Tasks

SpeechT5 was evaluated on six downstream tasks, each using a different combination of the shared encoder-decoder and task-appropriate pre/post-nets:

Automatic Speech Recognition (ASR)

Fine-tuned on LibriSpeech 100h with joint CTC/attention decoding. The decoding objective maximizes a combination of decoder, CTC, and language model log-probabilities:

$$ \alpha \log P_{Dec} + (1 - \alpha) \log P_{CTC} + \beta \log P_{LM} $$

where $\alpha = 0.5$ and $\beta = 1.0$ for the 100h setting (beam size 30). Results on the test sets:

ModelLMtest-cleantest-other
wav2vec 2.0 BASE-6.113.3
HuBERT BASE-5.813.3
SpeechT5-4.410.4
wav2vec 2.0 BASETransf.2.66.3
SpeechT5Transf.2.45.8

Text-to-Speech Synthesis (TTS)

Fine-tuned on LibriTTS 460h clean sets with HiFi-GAN vocoder:

ModelNaturalnessMOSCMOS
Ground Truth-3.87 ± 0.04-
Baseline2.763.56 ± 0.050
SpeechT52.913.65 ± 0.04+0.290

Speech Translation (ST)

Evaluated on MUST-C English-to-German and English-to-French:

ModelEN-DEEN-FR
Fairseq ST22.7032.90
Adapter Tuning24.6334.98
Baseline (HuBERT init)23.4333.76
SpeechT525.1835.30

Voice Conversion (VC)

Evaluated on CMU Arctic:

ModelWER (bdl→slt)MCD (bdl→slt)
VTN w/ TTS7.6%6.33
Many-to-many VTN-6.13
SpeechT57.8%5.93

Speech Enhancement (SE)

On WHAM! dataset, SpeechT5 reduced WER from 76.1% (noisy) to 8.9%, a relative 9% improvement over the baseline’s 10.9%.

Speaker Identification (SID)

On VoxCeleb1, SpeechT5 achieved 96.49% accuracy, outperforming HuBERT LARGE at 90.33% (from SUPERB) and SpeechNet multi-task at 87.90%.

Ablation Study and Key Findings

The ablation study reveals the contribution of each pre-training component:

ModelASR (clean)ASR (other)VC (MCD)SID (ACC)
SpeechT54.410.75.9396.49%
w/o Speech PT--6.4938.61%
w/o Text PT5.412.86.0395.60%
w/o Joint PT4.611.36.1895.54%
w/o $\mathcal{L}_{mlm}^{s}$7.622.46.2990.91%

Key findings:

  1. Speech pre-training is critical: without it, ASR fails to converge entirely, and SID accuracy drops to 38.61%.
  2. Text pre-training complements speech: removing it degrades ASR by ~20% relative, confirming that textual knowledge transfers to speech tasks.
  3. Joint pre-training enables cross-modal transfer: the vector quantization approach is essential for modality-bridging tasks like ASR.
  4. The masked prediction loss $\mathcal{L}_{mlm}^{s}$ is the most important single component, responsible for learning strong acoustic features.

The authors note limitations in the current scope (English-only, BASE model size) and propose scaling to larger models and multilingual settings as future work.


Reproducibility Details

Data

PurposeDatasetSizeNotes
Speech pre-trainingLibriSpeech960 hoursFull training set
Text pre-trainingLibriSpeech LM text400M sentencesNormalized language model text
ASR fine-tuningLibriSpeech100h / 960h subsets
TTS fine-tuningLibriTTS460h clean sets
ST fine-tuningMUST-CEN-DE, EN-FR
VC fine-tuningCMU Arctic4 speakersbdl, clb, slt, rms
SE fine-tuningWHAM!16 kHz maxenhance-single task
SID fine-tuningVoxCeleb1100k+ utterances1,251 speakers

Algorithms

  • Optimizer: Adam with warmup (8% of steps) to peak LR $2 \times 10^{-4}$, then linear decay
  • Speech masking: 8% of timesteps, 10-step spans
  • Text masking: 30% of spans, Poisson $\lambda = 3.5$
  • Vector quantization: 2 codebooks × 100 entries = $10^4$ theoretical maximum codes
  • CTC/attention joint decoding for ASR (beam size 30)
  • HiFi-GAN vocoder for TTS and SE waveform generation
  • Parallel WaveGAN vocoder for VC

Fine-Tuning Hyperparameters

TaskGPUsStepsPeak LRBatch (per GPU)Schedule
ASR (100h)8×V10080k6e-5256k audio samplesWarmup 10%, hold 40%, linear decay
ASR (960h)8×V100320k1.3e-4256k audio samplesWarmup 10%, hold 40%, linear decay
TTS8×V100120k4e-445k tokensWarmup 10k steps, inv. sqrt decay
ST8×V10080k--Warmup 10k steps
VC8×V10060k1e-420k tokens6k warmup, inv. sqrt decay
SE8×V100100k1e-416k tokens10k warmup, inv. sqrt decay
SID8×V10060k5e-464 segments (3s each)Triangular cyclical (1e-8 to 5e-4)

Models

  • Encoder: 12 Transformer blocks (768-dim, 3072 FFN, 12 heads)
  • Decoder: 6 Transformer blocks (same dimensions)
  • Speech-encoder pre-net: 7 conv blocks (512 channels, strides [5,2,2,2,2,2,2], kernels [10,3,3,3,3,2,2])
  • Code and pre-trained models available at github.com/microsoft/SpeechT5 (MIT license)

Artifacts

ArtifactTypeLicenseNotes
microsoft/SpeechT5CodeMITOfficial Fairseq-based implementation
Pre-trained models (via repo)ModelMITSpeechT5 BASE encoder-decoder checkpoints
LibriSpeechDatasetCC-BY-4.0960h speech pre-training and ASR fine-tuning
LibriTTSDatasetCC-BY-4.0460h TTS fine-tuning
MUST-CDatasetCC-BY-NC-ND-4.0Speech translation fine-tuning
CMU ArcticDatasetFreeVoice conversion fine-tuning
WHAM!DatasetCC-BY-NC-4.0Speech enhancement fine-tuning
VoxCeleb1DatasetCC-BY-SA-4.0Speaker identification fine-tuning

Hardware

  • Pre-training: 32 NVIDIA V100 GPUs
  • Batch: ~90s speech per GPU + 12k text tokens per GPU, gradient accumulation 2
  • Pre-training steps: 500k

Paper Information

Citation: Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., Zhang, Y., Wei, Z., Qian, Y., Li, J., & Wei, F. (2022). SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5723-5738.

@inproceedings{ao2022speecht,
  title={SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
  author={Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={5723--5738},
  year={2022},
  doi={10.18653/v1/2022.acl-long.393}
}