Vietnamese F5-TTS🗣️ (Vi-F5-TTS)

Introduction

Vietnamese F5-TTS (Vi-F5-TTS) is a fine-tuned text-to-speech model for Vietnamese, based on F5-TTS. Using the vin100h-preprocessed-v2 dataset, it generates natural Vietnamese audio 🎙️ for applications like voice assistants 🤖 and audiobooks 📚, powered by PyTorch, Transformers, and Gradio 🚀.

Key Features

Vietnamese Speech 🎤: Produces natural Vietnamese audio from text.
Fine-Tuned 🔧: Optimized for better pronunciation and intonation.
Customizable Inference 🎛️: Uses reference audio/text for tailored output.
Gradio GUI 🖥️: Interactive interface for easy testing.
Open-Source 🌐: MIT-licensed with accessible code and checkpoints.
Multi-Platform ☁️: Supports Colab, Kaggle, SageMaker, and local use.
Flexible Training ⚙️: Configurable with accelerate support.

Notebook

Dataset

You can explore more in this HuggingFace Dataset available at the given link for further details: .

Base Model

This repository is built upon the F5-TTS model, originally developed by . For more details, explore the model on the .

Demonstration

Experience the magic of Vietnames Text to Speech from:

HuggingFace Space:
Demo GUI:

To run the Gradio app locally (localhost:7860):

python apps/gradio_app.py

Usage Guide

Setup Instructions

Step 1: Clone the Repository

Clone the project repository and navigate to the project directory:

git clone https://github.com/danhtran2mind/Vi-F5-TTS.git
cd Vi-F5-TTS

Step 2: Install Dependencies

Install the required Python packages:

pip install -e .

Or Install Dependencies using requirements.txt

pip install -r requirements/requirements.txt

Step 3: Configure the Environment

Run the following scripts to set up the project:

Install Third-Party Dependencies
```
python scripts/setup_third_party.py
```

Download Model Checkpoints

Use SWivid/F5-TTS:

python scripts/download_ckpts.py \
    --repo_id "SWivid/F5-TTS" --local_dir "./ckpts" \
    --folder_name "F5TTS_v1_Base_no_zero_init"

Use danhtran2mind/Vi-F5-TTS:

python scripts/download_ckpts.py \
    --repo_id "danhtran2mind/Vi-F5-TTS" \
    --local_dir "./ckpts" --pruning_model

Prepare Dataset (Optional, for Training)
```
python scripts/process_dataset.py
```

Training

Config

Configuration of the accelerate

accelerate config default

Training Bash

To train the model:

accelerate launch ./src/f5_tts/train/finetune_cli.py \
    --exp_name F5TTS_Base \
    --dataset_name vin100h-preprocessed-v2 \
    --finetune \
    --tokenizer pinyin \
    --learning_rate 1e-05 \
    --batch_size_type frame \
    --batch_size_per_gpu 3200 \
    --max_samples 64 \
    --grad_accumulation_steps 2 \
    --max_grad_norm 1 \
    --epochs 80 \
    --num_warmup_updates 2761 \
    --save_per_updates 4000 \
    --keep_last_n_checkpoints 1 \
    --last_per_updates 4000 \
    --log_samples \
    --pretrain "<your_pretrain_model>"# such as "./ckpts/F5TTS_v1_Base_no_zero_init/model_1250000.safetensors"

Training Arguments

Refer to the Training Documents for detailed arguments used in fine-tuning the model. ⚙️

Inference

Inference Bash

To generate videos using the trained model:

python src/f5_tts/infer/infer_cli.py \
    --model_cfg "vi-fine-tuned-f5-tts.yaml" \
    --ckpt_file "model_last.pt" \
    --vocab_file "vocab.txt" \
    --ref_audio "<path_to_your_reference_audio>" \
    --ref_text "<text_of_your_reference_audio>" \
    --gen_text "<your_generative_text>" \
    --output_dir "<output_folder>" \
    --output_file "<output_file_name>"

Inference Arguments

Refer to the Inference Documents for detailed arguments used in Inference. ⚙️

Inference Example

Example 1:

Reference Text: bạn và tôi đều như nhau nhé, rồi chúng ta đi đâu nè
Reference Audio:

refer_audio.mp4

Inference Text: chào mọi người, mọi người khỏe không?
Inference Audio:

infer_audio.mp4

Example 2:

Reference Text: Chúng thường sống hòa bình với các loài động vật khác, kể cả những loài săn mồi.
Reference Audio:

refer_audio.mp4

Inference Text: Tôi rất khỏe,cảm ơn mọi người đã quan tâm.
Inference Audio:

infer_audio.mp4

Example 3:

Reference Text: Sau nhà Ngô, lần lượt các triều Đinh, Tiền Lê, Lý và Trần tổ chức chính quyền tương tự các triều đại Trung Hoa, lấy Phật giáo làm tôn giáo chính của quốc gia và cho truyền bá cả Nho giáo và Đạo giáo.
Reference Audio:

refer_audio.mp4

Inference Text: Nhà Tiền Lê, Lý và Trần đã chống trả các cuộc tấn công của nhà Tống và nhà Mông – Nguyên, đều thắng lợi và bảo vệ được Đại Việt.
Inference Audio:

infer_audio.mp4

Example 4:

Reference Text: Cấu trúc sừng và mào là phổ biến ở tất cả các nhóm khủng long, và vài nhóm thậm chí còn phát triển các biến đổi bộ xương như giáp mô hoặc gai.
Reference Audio:

refer_audio.mp4

Inference Text: Người dân Đông Á cổ đại đã uống trà trong nhiều thế kỷ, thậm chí có thể là hàng thiên niên kỷ , trước khi sử dụng nó như một thức uống.
Inference Audio:

infer_audio.mp4

Limitation

Dataset Constraints: The model is trained on the vin100h-preprocessed-v2 dataset, which, while useful, is limited in size and diversity. This can result in suboptimal audio quality, particularly in capturing varied intonations, accents, and emotional nuances specific to Vietnamese speech.
Inference Audio Quality: Due to the dataset's limited scope, the generated audio may exhibit inconsistencies in pronunciation, unnatural pauses, or lack of expressiveness, especially for complex or context-heavy text inputs.
Resource Limitations: The current training setup is constrained by computational resources, limiting the model's ability to generalize across diverse speech patterns. With access to larger and more varied datasets, as well as enhanced computational resources, the model’s performance and audio quality could be significantly improved.
Accent and Dialect Coverage: The dataset primarily focuses on standard Vietnamese, potentially leading to reduced performance for regional dialects or non-standard accents.
Noise and Artifacts: Some generated audio may contain minor artifacts or background noise, stemming from the dataset's preprocessing limitations or insufficient training data to model clean speech effectively.
Future Improvements: The audio quality and robustness of the model are expected to improve with access to larger, higher-quality datasets and more powerful training infrastructure, enabling better generalization and richer speech output.

Project Description

This repository is trained from , a fork of , with numerous bug fixes and rewritten code for improved performance and stability. You can explore the SWivid/F5-TTS Model Hub from the . You also can explore more models on .

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
apps		apps
assets		assets
ckpts		ckpts
configs		configs
data		data
docs		docs
notebooks		notebooks
requirements		requirements
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vietnamese F5-TTS🗣️ (Vi-F5-TTS)

Introduction

Key Features

Notebook

Dataset

Base Model

Demonstration

Usage Guide

Setup Instructions

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Configure the Environment

Training

Config

Training Bash

Training Arguments

Inference

Inference Bash

Inference Arguments

Inference Example

Example 1:

Example 2:

Example 3:

Example 4:

Limitation

Project Description

About

Uh oh!

Releases

Packages

Languages

License

danhtran2mind/Vi-F5-TTS

Folders and files

Latest commit

History

Repository files navigation

Vietnamese F5-TTS🗣️ (Vi-F5-TTS)

Introduction

Key Features

Notebook

Dataset

Base Model

Demonstration

Usage Guide

Setup Instructions

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Configure the Environment

Training

Config

Training Bash

Training Arguments

Inference

Inference Bash

Inference Arguments

Inference Example

Example 1:

Example 2:

Example 3:

Example 4:

Limitation

Project Description

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages