This repository contains the official data preparation tools for the LRAC Challenge.
This repository is a fork of the URGENT 2025 Challenge repository and adapts its data preparation scripts and general structure for our challenge.
The goal of the challenge is to develop an audio codec that can compress speech to a very low bitrate while maintaining the highest possible perceptual quality and intelligibility.
❗️❗️**[2025-09-01]** Excluded sampling rate from noise and rir scp files for baseline support
❗️❗️**[2025-08-25]** Added lists of files used for the open test set (datafiles/open_testset
). Added evaluation data preparation for the baseline recipe.
❗️❗️**[2025-08-06]** First commit containing the data preparation core functionality.
- OS: Linux
- Disk Space: At least 1.2 TB of free disk space for datasets.
- Dependencies:
ffmpeg
is required for audio processing.
-
Clone the repository:
git clone https://github.com/cisco-open/lrac_data_generation cd lrac_data_generation
-
Download and Prepare the Datasets: Run the main preparation script. This script automates the entire process:
- It downloads the original large-scale corpora. The downloaded corpora can be accessed in their compressed form in the directory with the same name as the dataset.
- It selects a high-quality subset using our pre-filtered file lists to ensure data quality.
- It resamples all selected audio to a 24kHz sampling rate for compatibility with the baseline model.
- All final, ready-to-use data is placed in the
./data
directory.
. ./prepare_espnet_data.sh
The datasets used in the challenge can be found under this link: https://lrac.short.gy/datasets
The datasets are automatically handled by the prepare_espnet_data.sh
script.
All prepared data will be located in the ./data
directory.
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.