This repository contains the code for the following published papers:
- Speechformer: Reducing Information Loss in Direct Speech Translation
- Dealing with training and test segmentation mismatch: FBK@ IWSLT2021
This repository contains the code for the preprocessing, training and evaluation steps of the PlainConvattention
and
Speechformer
architectures as well as the pretrained models.
For further details, please refer to the paper: Speechformer: Reducing Information Loss in Direct Speech Translation.
Clone this repository and install it as explained in the original Fairseq(-py). For the experiments we used MuST-C (en-de, en-es, en-nl), make sure to download the corpus.
Before starting the training, the data has to be preprocessed.
After downloading the MuST-C dataset into the MUSTC_ROOT
directory, create your working directory DATA_ROOT
and link there the data for the target language LANG
to be preprocessed, with the command
mkdir $DATA_ROOT
for t in train dev tst-COMMON; do
ln -s ${MUSTC_ROOT}en-$LANG/data/$t/txt/$t.* $DATA_ROOT
done
Once your DATA_ROOT
is ready, run the following command to preprocess the data, where FAIRSEQ_DIR
is the path to this Fairseq installation and MUSTC_SAVE_DIR
is the path where you want to save the preprocessed files (it can be equal to DATA_ROOT
):
python ${FAIRSEQ_DIR}/examples/speech_to_text/preprocess_generic.py \
--data-root ${DATA_ROOT} --wav-dir ${MUSTC_ROOT}/wav \
--save-dir ${MUSTC_SAVE_DIR} \
--task st --src-lang en --tgt-lang ${LANG} \
--splits train dev tst-COMMON \
--vocab-type unigram \
--vocab-size 8000 \
--src-normalize
⭐️Pay attention! ➜ To replicate the experiments of the Speechformer, the source vocabulary size has to be 5000. You have to run this
script again changing --vocab-size 8000
to --vocab-size 5000
, with the option
--no-filterbank-extraction
to avoid the re-computation of the mel-filterbank features.
In the following, there are the scripts for training both PlainConvattention
and Speechformer
architectures.
⭐️Please note that the training phase of PlainConvattention
(which corresponds to the encoder pretraining of the
Speechformer) is mandatory to successfully train the Speechformer
architecture.
To start the training of the PlainConvattention
architecture, run the following command, where ST_SAVE_DIR
is the directory in which you
want to save the trained model and CONFIG_YAML_NAME
is the name of the .yaml file:
fairseq-train ${MUSTC_SAVE_DIR} \
--train-subset train_st_src --valid-subset dev_st_src \
--save-dir ${ST_SAVE_DIR} \
--num-workers 8 --max-update 100000 \
--max-tokens 10000 \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml ${CONFIG_YAML_NAME}.yaml \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 --best-checkpoint-metric loss \
--arch speechformer_m \
--ctc-encoder-layer 8 \
--compressed 4 --compress-kernel-size 8 --stride 1 \
--shared-layer-kv-compressed --shared-kv-compressed \
--CNN-first-layer \
--optimizer adam --lr 1e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 \
--clip-norm 10.0 \
--seed 1 --update-freq 16 \
--skip-invalid-size-inputs-valid-test
The script above is intended to be run on 2 V100 GPUs with 32GB of RAM. In case you have more GPUs, you should divide
the --update-freq
parameter accordingly, e.g. if you have 4 GPUs use 8 as --update-freq
.
In case your GPUs have lower RAM, you can halve the --max-tokens
value and duplicate --update-freq
.
To start the training of the Speechformer
arcitecture, the first step is to select only the first part of the
PlainConvattention
encoder (until the layer to which the CTC is
applied) by running this command:
python ${FAIRSEQ_DIR}/examples/speech_to_text/strip_after_ctc.py \
--user-dir examples/speech_to_text \
--model-path ${CHECKPOINT_PATH} \
--new-model-path ${STRIPPED_CHECKPOINT_PATH}
where CHECKPOINT_PATH
is the absolute path to your PlainConvattention checkpoint .pt and STRIPPED_CHECKPOINT_PATH
is the absolute path
to the new checkpoint .pt generated containing only the first part of the encoder. Also --num-encoder-layers
and
--ctc-encoder-layer
have to be specified if different from our default architecture
(with values 12 and 8 respectively).
⭐️Please note that, to replicate our paper, the checkpoint used are the average 7, as explained in the Generate section.
Then, to start the training, run the following command:
fairseq-train ${MUSTC_SAVE_DIR} \
--train-subset train_st_src --valid-subset dev_st_src \
--save-dir ${ST_SAVE_DIR} \
--num-workers 8 --max-update 100000 \
--max-tokens 10000 \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml ${CONFIG_YAML_NAME}.yaml \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 --best-checkpoint-metric loss \
--arch speechformer_m \
--load-pretrained-encoder-from ${STRIPPED_CHECKPOINT_PATH} \
--allow-partial-encoder-loading \
--transformer-after-compression \
--ctc-encoder-layer 8 \
--ctc-compress-strategy avg \
--compressed 4 --compress-kernel-size 8 --stride 1 \
--shared-layer-kv-compressed --shared-kv-compressed \
--CNN-first-layer \
--optimizer adam --lr 1e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 \
--clip-norm 10.0 \
--seed 1 --update-freq 16 \
--skip-invalid-size-inputs-valid-test
and you can use the parameter --patience
to early stopping the training once the loss does not improve for a certain
number of epochs (15 in our case).
For the generate phase, you first have to average 7 checkpoints, among which the middle one is the best checkpoint
on the validation set (according to the loss) obtained during training.
Run the following command and set BEST_CKP+3
as the number of your best checkpoint plus 3 to make the average 7 and
AVERAGE_CHECKPOINT_NAME
as the name that you want to give to the average checkpoint:
python ${FAIRSEQ_DIR}/scripts/average_checkpoints.py \
--inputs ${ST_SAVE_DIR} \
--output "${ST_SAVE_DIR}/${AVERAGE_CHECKPOINT_NAME}.pt" \
--num-epoch-checkpoints 7 \
--checkpoint-upper-bound ${BEST_CKP+3}
Then, run the following command to perform the generate:
fariseq-generate ${MUSTC_SAVE_DIR} \
--config-yaml ${CONFIG_YAML_NAME}.yaml \
--gen-subset tst-COMMON_st_src \
--task speech_to_text_ctc \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy \
--user-dir examples/speech_to_text \
--path ${ST_SAVE_DIR}/${AVERAGE_CHECKPOINT_NAME}.pt \
--max-tokens 25000 --beam 5 --scoring sacrebleu --no-repeat-ngram-size 5 \
--results-path ${ST_SAVE_DIR}
Note that we set --max-tokens 25000
since we used a K80 GPU with 12 GB of RAM to generate the output.
Download our vocabulary and yaml files if you want to use our pretrained models:
- Generic yaml
- Source: En
- Target: De, Nl, Es
Click on the corresponding language pair to download the model:
Model | --arch | Params | en-de | en-nl | en-es |
---|---|---|---|---|---|
Baseline | s2t_transformer_m_fbk | 77M | 22.87 | 27.21 | 28.09 |
Baseline+compress. | s2t_transformer_m_fbk | 77M | 22.89 | 26.93 | 28.09 |
PlainConvattn | speechformer_m | 79M | 23.29 | 27.18 | 28.01 |
Speechformer | speechformer_m | 79M | 23.84 | 27.85 | 28.56 |
Remember that the results in our paper are the average BLEU score of 3 runs, here you can download the checkpoint a of a single run.
Please cite as:
@inproceedings{papi2021speechformer,
title = {{Speechformer: Reducing Information Loss in Direct Speech Translation}},
author = {Papi, Sara and Gaido, Marco and Negri, Matteo and Turchi, Marco},
booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2021},
}
Below, there is the original Fairseq README file.
Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.
We provide reference implementations of various sequence modeling papers:
List of implemented papers
- Convolutional Neural Networks (CNN)
- Language Modeling with Gated Convolutional Networks (Dauphin et al., 2017)
- Convolutional Sequence to Sequence Learning (Gehring et al., 2017)
- Classical Structured Prediction Losses for Sequence to Sequence Learning (Edunov et al., 2018)
- Hierarchical Neural Story Generation (Fan et al., 2018)
- wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)
- LightConv and DynamicConv models
- Long Short-Term Memory (LSTM) networks
- Effective Approaches to Attention-based Neural Machine Translation (Luong et al., 2015)
- Transformer (self-attention) networks
- Attention Is All You Need (Vaswani et al., 2017)
- Scaling Neural Machine Translation (Ott et al., 2018)
- Understanding Back-Translation at Scale (Edunov et al., 2018)
- Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)
- Lexically constrained decoding with dynamic beam allocation (Post & Vilar, 2018)
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (Dai et al., 2019)
- Adaptive Attention Span in Transformers (Sukhbaatar et al., 2019)
- Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)
- RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)
- Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)
- Jointly Learning to Align and Translate with Transformer Models (Garg et al., 2019)
- Multilingual Denoising Pre-training for Neural Machine Translation (Liu et at., 2020)
- Neural Machine Translation with Byte-Level Subwords (Wang et al., 2020)
- Unsupervised Quality Estimation for Neural Machine Translation (Fomicheva et al., 2020)
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)
- Generating Medical Reports from Patient-Doctor Conversations Using Sequence-to-Sequence Models (Enarvi et al., 2020)
- Linformer: Self-Attention with Linear Complexity (Wang et al., 2020)
- Cross-lingual Retrieval for Iterative Self-Supervised Training (Tran et al., 2020)
- Deep Transformers with Latent Depth (Li et al., 2020)
- Non-autoregressive Transformers
- Non-Autoregressive Neural Machine Translation (Gu et al., 2017)
- Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (Lee et al. 2018)
- Insertion Transformer: Flexible Sequence Generation via Insertion Operations (Stern et al. 2019)
- Mask-Predict: Parallel Decoding of Conditional Masked Language Models (Ghazvininejad et al., 2019)
- Levenshtein Transformer (Gu et al., 2019)
- Finetuning
- December 2020: GottBERT model and code released
- November 2020: Adopted the Hydra configuration framework
- November 2020: fairseq 0.10.0 released
- October 2020: Added R3F/R4F (Better Fine-Tuning) code
- October 2020: Deep Transformer with Latent Depth code released
- October 2020: Added CRISS models and code
- September 2020: Added Linformer code
- September 2020: Added pointer-generator networks
- August 2020: Added lexically constrained decoding
- August 2020: wav2vec2 models and code released
- July 2020: Unsupervised Quality Estimation code released
Previous updates
- May 2020: Follow fairseq on Twitter
- April 2020: Monotonic Multihead Attention code released
- April 2020: Quant-Noise code released
- April 2020: Initial model parallel support and 11B parameters unidirectional LM released
- March 2020: Byte-level BPE code released
- February 2020: mBART model and code released
- February 2020: Added tutorial for back-translation
- December 2019: fairseq 0.9.0 released
- November 2019: VizSeq released (a visual analysis toolkit for evaluating fairseq models)
- November 2019: CamemBERT model and code released
- November 2019: BART model and code released
- November 2019: XLM-R models and code released
- September 2019: Nonautoregressive translation code released
- August 2019: WMT'19 models released
- July 2019: fairseq relicensed under MIT license
- July 2019: RoBERTa models and code released
- June 2019: wav2vec models and code released
- multi-GPU training on one machine or across multiple machines (data and model parallel)
- fast generation on both CPU and GPU with multiple search algorithms implemented:
- beam search
- Diverse Beam Search (Vijayakumar et al., 2016)
- sampling (unconstrained, top-k and top-p/nucleus)
- lexically constrained decoding (Post & Vilar, 2018)
- gradient accumulation enables training with large mini-batches even on a single GPU
- mixed precision training (trains faster with less GPU memory on NVIDIA tensor cores)
- extensible: easily register new models, criterions, tasks, optimizers and learning rate schedulers
- flexible configuration based on Hydra allowing a combination of code, command-line and file based configuration
We also provide pre-trained models for translation and language modeling
with a convenient torch.hub
interface:
en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de.single_model')
en2de.translate('Hello world', beam=5)
# 'Hallo Welt'
See the PyTorch Hub tutorials for translation and RoBERTa for more examples.
- PyTorch version >= 1.5.0
- Python version >= 3.6
- For training new models, you'll also need an NVIDIA GPU and NCCL
- To install fairseq and develop locally:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./
# to install the latest stable release (0.10.0)
# pip install fairseq==0.10.0
- For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
- For large datasets install PyArrow:
pip install pyarrow
- If you use Docker make sure to increase the shared memory size either with
--ipc=host
or--shm-size
as command line options tonvidia-docker run
.
The full documentation contains instructions for getting started, training new models and extending fairseq with new model types and tasks.
We provide pre-trained models and pre-processed, binarized test sets for several tasks listed below, as well as example training and evaluation commands.
- Translation: convolutional and transformer models are available
- Language Modeling: convolutional and transformer models are available
We also have more detailed READMEs to reproduce results from specific papers:
- Cross-lingual Retrieval for Iterative Self-Supervised Training (Tran et al., 2020)
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)
- Unsupervised Quality Estimation for Neural Machine Translation (Fomicheva et al., 2020)
- Training with Quantization Noise for Extreme Model Compression ({Fan*, Stock*} et al., 2020)
- Neural Machine Translation with Byte-Level Subwords (Wang et al., 2020)
- Multilingual Denoising Pre-training for Neural Machine Translation (Liu et at., 2020)
- Reducing Transformer Depth on Demand with Structured Dropout (Fan et al., 2019)
- Jointly Learning to Align and Translate with Transformer Models (Garg et al., 2019)
- Levenshtein Transformer (Gu et al., 2019)
- Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)
- RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)
- wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)
- Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)
- Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019)
- Understanding Back-Translation at Scale (Edunov et al., 2018)
- Classical Structured Prediction Losses for Sequence to Sequence Learning (Edunov et al., 2018)
- Hierarchical Neural Story Generation (Fan et al., 2018)
- Scaling Neural Machine Translation (Ott et al., 2018)
- Convolutional Sequence to Sequence Learning (Gehring et al., 2017)
- Language Modeling with Gated Convolutional Networks (Dauphin et al., 2017)
- Twitter: https://twitter.com/fairseq
- Facebook page: https://www.facebook.com/groups/fairseq.users
- Google group: https://groups.google.com/forum/#!forum/fairseq-users
fairseq(-py) is MIT-licensed. The license applies to the pre-trained models as well.
Please cite as:
@inproceedings{ott2019fairseq,
title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
year = {2019},
}