A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation.
MuAViC provides
- 1200 hours of transcribed audio-visual speech for 9 languages (English, Arabic, German, Greek, Spanish, French, Italian, Portuguese and Russian)
 - text translations for 6 English-to-X directions and 6 X-to-English directions (X = Greek, Spanish, French, Italian, Portuguese or Russian)
 
The raw data is collected from TED/TEDx talk recordings.
Audio-Visual Speech Recognition
| Language | Code | Train Hours (H+P) | Train Speakers | 
|---|---|---|---|
| English | En | 436 + 0 | 4.7K | 
| Arabic | Ar | 16 + 0 | 95 | 
| German | De | 10 + 0 | 53 | 
| Greek | El | 25 + 0 | 113 | 
| Spanish | Es | 178 + 0 | 987 | 
| French | Fr | 176 + 0 | 948 | 
| Italian | It | 101 + 0 | 487 | 
| Portuguese | Pt | 153 + 0 | 810 | 
| Russian | Ru | 49 + 0 | 238 | 
Audio-Visual En-X Speech-to-Text Translation
| Direction | Code | Train Hours (H+P) | Train Speakers | 
|---|---|---|---|
| English-Greek | En-El | 17 + 420 | 4.7K | 
| English-Spanish | En-Es | 21 + 416 | 4.7K | 
| English-French | En-Fr | 21 + 416 | 4.7K | 
| English-Italian | En-It | 20 + 417 | 4.7K | 
| English-Portuguese | En-Pt | 18 + 419 | 4.7K | 
| English-Russian | En-Ru | 20 + 417 | 4.7K | 
Audio-Visual X-En Speech-to-Text Translation
| Direction | Code | Train Hours (H+P) | Train Speakers | 
|---|---|---|---|
| Greek-English | El-En | 8 + 17 | 113 | 
| Spanish-English | Es-En | 64 + 114 | 987 | 
| French-English | Fr-En | 45 + 131 | 948 | 
| Italian-English | It-En | 48 + 53 | 487 | 
| Portuguese-English | Pt-En | 53 + 100 | 810 | 
| Russian-English | Ru-En | 8 + 41 | 238 | 
We provide scripts to generate the audio/video data and AV-HuBERT training manifests for MuAViC.
First, clone this repo for the scripts
git clone https://github.com/facebookresearch/muavic.gitInstall required packages:
conda install -c conda-forge ffmpeg==4.2.2
conda install -c conda-forge sox
pip install -r requirements.txtThen get audio-visual speech recognition and translation data via
python get_data.py --root-path ${ROOT} --src-lang ${SRC_LANG}where the speech language ${SRC_LANG} is one of en, ar, de, el, es, fr, it, pt and ru.
Generated data will be saved to ${ROOT}/muavic:
${ROOT}/muavic/${SRC_LANG}/audiofor processed audio files${ROOT}/muavic/${SRC_LANG}/videofor processed video files${ROOT}/muavic/${SRC_LANG}/*.tsvfor AV-HuBERT AVSR training manifests${ROOT}/muavic/${SRC_LANG}/${TGT_LANG}/*.tsvfor AV-HuBERT AVST training manifests
In the following table, we provide all end-to-end trained models mentioned in our paper:
| Task | Languages | Best Checkpoint | Dictionary | Tokenizer | 
|---|---|---|---|---|
| AVSR | ar | best_ckpt.pt | dict | tokenizer | 
| de | best_ckpt.pt | dict | tokenizer | |
| el | best_ckpt.pt | dict | tokenizer | |
| en | best_ckpt.pt | dict | tokenizer | |
| es | best_ckpt.pt | dict | tokenizer | |
| fr | best_ckpt.pt | dict | tokenizer | |
| it | best_ckpt.pt | dict | tokenizer | |
| pt | best_ckpt.pt | dict | tokenizer | |
| ru | best_ckpt.pt | dict | tokenizer | |
| ar,de,el,es,fr,it,pt,ru | best_ckpt.pt | dict | tokenizer | |
| AVST | en-el | best_ckpt.pt | dict | tokenizer | 
| en-es | best_ckpt.pt | dict | tokenizer | |
| en-fr | best_ckpt.pt | dict | tokenizer | |
| en-it | best_ckpt.pt | dict | tokenizer | |
| en-pt | best_ckpt.pt | dict | tokenizer | |
| en-ru | best_ckpt.pt | dict | tokenizer | |
| el-en | best_ckpt.pt | dict | tokenizer | |
| es-en | best_ckpt.pt | dict | tokenizer | |
| fr-en | best_ckpt.pt | dict | tokenizer | |
| it-en | best_ckpt.pt | dict | tokenizer | |
| pt-en | best_ckpt.pt | dict | tokenizer | |
| ru-en | best_ckpt.pt | dict | tokenizer | |
| {el,es,fr,it,pt,ru}-en | best_ckpt.pt | dict | tokenizer | 
To try out our state-of-the-art audio-visual models with different audio and video inputs, including a recorded video through the webcam or an uploaded video, checkout our demo:
demo.mp4
You can read more about our model in the README file in the demo folder.
For training Audio-Visual models, we are going to use AV-HuBERT framework.
- 
Clone and install AV-HuBERT in the root directory:
$ # Clone the "muavic" branch of av_hubert's repo $ git -b muavic clone https://github.com/facebookresearch/av_hubert.git $ # Set the fairseq version $ cd avhubert $ git submodule init $ git submodule update $ # Install av-hubert's requirements $ pip install -r requirements.txt $ # Install fairseq $ cd fairseq $ pip install --editable ./
 - 
Download an AV-HuBERT pre-trained model from here.
 - 
Open the training script (
scripts/train.sh) and replace these variables:# language direction (e.g "en" or "en-fr") LANG= # path where output trained models will be located OUT_PATH= # path to the downloaded pre-trained model PRETRAINED_MODEL_PATH=
 - 
Run the training script:
$ bash scripts/train.sh
 
Note:
All audio-visual models found here used thelarge_vox_iter5.ptpre-trained model.
To evaluate your trained model (or our trained models) against MuAViC, follow these steps:
- 
Open the decoding script (
scripts/decode.sh) and replace these variables:# language direction (e.g "en" or "en-fr") LANG=??? # data split (e.g "test" or "valid") GROUP=??? # inference modality (choices: "audio", "video", "audio,video") MODALITIES=??? # path to the trained model MODEL_PATH=??? # path where decoding results and scores will be located OUT_PATH=???
 - 
Run the decoding script:
$ bash scripts/decode.sh
 
CC-BY-NC 4.0
@article{anwar2023muavic,
  title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
  author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
  journal={arXiv preprint arXiv:2303.00628},
  year={2023}
}