PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
We provide our implementation and pretrained models in this repository.
Visit our demo page for audio samples.
- December, 2022: GenerSpeech (NeurIPS 2022) released at Github.
- Multi-level Style Transfer for expressive text-to-speech.
- Enhanced model generalization to out-of-distribution (OOD) style reference.
We provide an example of how you can generate high-fidelity samples using GenerSpeech.
To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.
You can use pretrained models we provide here, and data here. Details of each folder are as in follows:
| Model | Dataset (16 kHz) | Discription |
|---|---|---|
| GenerSpeech | LibriTTS,ESD | Acousitic model (config) |
| HIFI-GAN | LibriTTS,ESD | Neural Vocoder |
| Encoder | / | Emotion Encoder |
More supported datasets are coming soon.
A suitable conda environment named generspeech can be created
and activated with:
conda env create -f environment.yaml
conda activate generspeech
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count().
You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.
Here we provide a speech synthesis pipeline using GenerSpeech.
- Prepare GenerSpeech (acoustic model): Download and put checkpoint at
checkpoints/GenerSpeech - Prepare HIFI-GAN (neural vocoder): Download and put checkpoint at
checkpoints/trainset_hifigan - Prepare Emotion Encoder: Download and put checkpoint at
checkpoints/Emotion_encoder.pt - Prepare dataset: Download and put statistical files at
data/binary/training_set - Prepare path/to/reference_audio (16k): By default, GenerSpeech uses ASR + MFA to obtain the text-speech alignment from reference.
CUDA_VISIBLE_DEVICES=$GPU python inference/GenerSpeech.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --hparams="text='here we go',ref_audio='assets/0011_001570.wav'"Generated wav files are saved in infer_out by default.
- Set
raw_data_dir,processed_data_dir,binary_data_dirin the config file, and download dataset toraw_data_dir. - Check
preprocess_clsin the config file. The dataset structure needs to follow the processorpreprocess_cls, or you could rewrite it according to your dataset. We provide a Libritts processor as an example inmodules/GenerSpeech/config/generspeech.yaml - Download global emotion encoder to
emotion_encoder_path. For more details, please refer to this branch. - Preprocess Dataset
# Preprocess step: unify the file structure.
python data_gen/tts/bin/preprocess.py --config $path/to/config
# Align step: MFA alignment.
python data_gen/tts/bin/train_mfa_align.py --config $path/to/config
# Binarization step: Binarize data for fast IO.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/configYou could also build a dataset via NATSpeech, which shares a common MFA data-processing procedure. We also provide our processed dataset (16kHz LibriTTS+ESD).
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --resetCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --inferThis implementation uses parts of the code from the following Github repos: FastDiff, NATSpeech, as described in our code.
If you find this code useful in your research, please cite our work:
@inproceedings{huanggenerspeech,
title={GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech},
author={Huang, Rongjie and Ren, Yi and Liu, Jinglin and Cui, Chenye and Zhao, Zhou},
booktitle={Advances in Neural Information Processing Systems}
}Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.