This is the official repository for the Interspeech 2025 paper: Lessons Learned from the URGENT 2024 Speech Enhancement Challenge. We will refer to it as the URGENT 2024 analysis paper below for brevity.
The WADA-SNR algorithm was adopted to roughly analyze the signal-to-noise ratio (SNR) distribution of the reference speech samples in the datasets used in the URGENT 2024 Speech Enhancement Challenge. Our implementation of the WADA-SNR algorithm is provided at wada_snr/. The analysis results can be found below:
Tip
We can observe that
- The WADA-SNR distribution on each dataset almost remains the same before (Figure 2) and after (Figure 2a) applying the baseline TF-GridNet model for enhancement. This indicates that the label noise issue (as discussed in Section 2.1 of the URGENT 2024 analysis paper) may still persist in the enhanced samples.
The tag occurrences of different samples in the non-blind and blind test sets can be found below:
This is the enlarged version of Figure 3 in the URGENT 2024 analysis paper.
For each tag (in the
$x$ -axis), there are two bars:
- the left one represents the number of occurrences of the tag in the non-blind test set,
- while the right one (hatched) represents the number of occurrences of the tag in the blind test set.
The color of each bar, as well as the numbers (num1 : num2) above each bar, indicates the number of occurrences of the corresponding tag in other and hard samples, respectively.
The detailed definitions of the hard samples and tags can be found in tagging/README.md.
We also calculated the correlation between the human-annotated Mean Opinion Scores (MOS) and different objective metrics on the blind test data. The objective metrics include:
The metrics in orange (DNSMOS Pro, UTMOS, WV-MOS, SCOREQ, VQScore, and WADA-SNR) have not been used during the challenge. The implementations of these additional metrics are provided in mos/. The rest are the metrics used officially in the challenge, and their implementations are available in https://github.com/urgent-challenge/urgent2024_challenge/tree/main/evaluation_metrics.
Category | Metric | Need Reference Signals? | Supported Sampling Frequencies | Value Range | Run on CPU or GPU? |
---|---|---|---|---|---|
Non-intrusive SE metrics | DNSMOS ↑ | ❌ | 16 kHz | [1, 5] | CPU or GPU |
NISQA ↑ | ❌ | 48 kHz | [1, 5] | CPU or GPU | |
DNSMOS Pro ↑ | ❌ | 16 kHz | [1, 5] | CPU or GPU | |
UTMOS ↑ | ❌ | 16 kHz | [1, 5] | CPU or GPU | |
WV-MOS ↑ | ❌ | 16 kHz | [1, 5] | CPU or GPU | |
SCOREQ ↑ | ❌ | 16 kHz | [1, 5] | CPU or GPU | |
VQScore ↑ | ❌ | 16 kHz | [-1, 1] | CPU or GPU | |
WADA-SNR ↑ | ❌ | Any | [-20, 100] | CPU | |
Intrusive SE metrics | PESQ ↑ | ✔ | {8, 16} kHz | [-0.5, 4.5] | CPU |
ESTOI ↑ | ✔ | 10 kHz | [0, 1] | CPU | |
SDR ↑ | ✔ | Any | (-∞, +∞) | CPU | |
MCD ↓ | ✔ | Any | [0, +∞) | CPU | |
LSD ↓ | ✔ | Any | [0, +∞) | CPU | |
POLQA ↑ | ✔ | 8~48 kHz | [1, 5] | CPU (proprietary GUI program) | |
Downstream-task-independent metrics | SpeechBERTScore ↑ | ✔ | 16 kHz | [-1, 1] | CPU or GPU |
Levenshtein phone similarity (LPS) ↑ | ✔ | 16 kHz | (-∞, 1] | CPU or GPU | |
Downstream-task-dependent metrics | SpkSim ↑ | ✔ | 16 kHz | [-1, 1] | CPU or GPU |
WAcc (=1-WER) ↑ | ❌ | 16 kHz | (-∞, 1] | CPU or GPU |
The MOS labels for each original/enhanced speech sample in the blind test set are released at https://huggingface.co/datasets/urgent-challenge/urgent2024_mos. The correlations between MOS and different objective metrics in the blind test data are shown below:
This is the refined version of Figure 4 in the URGENT 2024 analysis paper, with two major changes:
- A new correlation measure, the Spearman's rank correlation coefficient (SRCC), is added.
- The overall ranking score calculated based solely on objective metrics (those colored in blue) is added, denoted as
---- Overall ranking score (w/o MOS) in the figure.
Tip
We can observe that
- Both Overall ranking score (w/o MOS) and Overall ranking score are highly correlated with the human-annotated MOS, outperforming all individual objective metrics officially used in the challenge in terms of KRCC and SRCC. This highlights the importance and effectiveness of the comprehensive evaluation protocol design in the challenge.
- Several non-intrusive metrics that have not been specifically designed for the universal speech enhancement task (also unused in the challenge), such as UTMOS and SCOREQ, also show very strong correlations with the human-annotated MOS, indicating their potential for future SE research.
If you find this repository useful, please consider citing our paper:
@inproceedings{Lessons-Zhang2025,
title={Lessons Learned from the {URGENT} 2024 Speech Enhancement Challenge},
author={Zhang, Wangyou and Saijo, Kohei and Cornell, Samuele and Scheibler, Robin and Li, Chenda and Ni, Zhaoheng and Kumar, Anurag and Sach, Marvin and Wang, Wei and Fu, Yihui and Watanabe, Shinji and Fingscheidt, Tim and Qian, Yanmin},
booktitle={Proc. Interspeech},
pages={853--857},
year={2025},
doi={10.21437/Interspeech.2025-1246},
url={https://www.isca-archive.org/interspeech_2025/zhang25j_interspeech.html},
}