GitHub - urgent-challenge/urgent2024_analysis: This is the official repository for the Interspeech (2025) paper 'Lessons Learned from the URGENT 2024 Speech Enhancement Challenge'

Introduction

This is the official repository for the Interspeech 2025 paper: Lessons Learned from the URGENT 2024 Speech Enhancement Challenge. We will refer to it as the URGENT 2024 analysis paper below for brevity.

WADA-SNR

The WADA-SNR algorithm was adopted to roughly analyze the signal-to-noise ratio (SNR) distribution of the reference speech samples in the datasets used in the URGENT 2024 Speech Enhancement Challenge. Our implementation of the WADA-SNR algorithm is provided at wada_snr/. The analysis results can be found below:

Tip

We can observe that

The WADA-SNR distribution on each dataset almost remains the same before (Figure 2) and after (Figure 2a) applying the baseline TF-GridNet model for enhancement. This indicates that the label noise issue (as discussed in Section 2.1 of the URGENT 2024 analysis paper) may still persist in the enhanced samples.

Hard sample tagging

The tag occurrences of different samples in the non-blind and blind test sets can be found below:

This is the enlarged version of Figure 3 in the URGENT 2024 analysis paper.

For each tag (in the $x$-axis), there are two bars:

the left one represents the number of occurrences of the tag in the non-blind test set,

while the right one (hatched) represents the number of occurrences of the tag in the blind test set.

The color of each bar, as well as the numbers (num1 : num2) above each bar, indicates the number of occurrences of the corresponding tag in other and hard samples, respectively.

The detailed definitions of the hard samples and tags can be found in tagging/README.md.

MOS correlation

We also calculated the correlation between the human-annotated Mean Opinion Scores (MOS) and different objective metrics on the blind test data. The objective metrics include:

The metrics in orange (DNSMOS Pro, UTMOS, WV-MOS, SCOREQ, VQScore, and WADA-SNR) have not been used during the challenge. The implementations of these additional metrics are provided in mos/. The rest are the metrics used officially in the challenge, and their implementations are available in https://github.com/urgent-challenge/urgent2024_challenge/tree/main/evaluation_metrics.

Category	Metric	Need Reference Signals?	Supported Sampling Frequencies	Value Range	Run on CPU or GPU?
Non-intrusive SE metrics	DNSMOS ↑	❌	16 kHz	[1, 5]	CPU or GPU
	NISQA ↑	❌	48 kHz	[1, 5]	CPU or GPU
	DNSMOS Pro ↑	❌	16 kHz	[1, 5]	CPU or GPU
	UTMOS ↑	❌	16 kHz	[1, 5]	CPU or GPU
	WV-MOS ↑	❌	16 kHz	[1, 5]	CPU or GPU
	SCOREQ ↑	❌	16 kHz	[1, 5]	CPU or GPU
	VQScore ↑	❌	16 kHz	[-1, 1]	CPU or GPU
	WADA-SNR ↑	❌	Any	[-20, 100]	CPU
Intrusive SE metrics	PESQ ↑	✔	{8, 16} kHz	[-0.5, 4.5]	CPU
	ESTOI ↑	✔	10 kHz	[0, 1]	CPU
	SDR ↑	✔	Any	(-∞, +∞)	CPU
	MCD ↓	✔	Any	[0, +∞)	CPU
	LSD ↓	✔	Any	[0, +∞)	CPU
	POLQA ↑	✔	8~48 kHz	[1, 5]	CPU (proprietary GUI program)
Downstream-task-independent metrics	SpeechBERTScore ↑	✔	16 kHz	[-1, 1]	CPU or GPU
Downstream-task-independent metrics	Levenshtein phone similarity (LPS) ↑	✔	16 kHz	(-∞, 1]	CPU or GPU
Downstream-task-dependent metrics	SpkSim ↑	✔	16 kHz	[-1, 1]	CPU or GPU
Downstream-task-dependent metrics	WAcc (=1-WER) ↑	❌	16 kHz	(-∞, 1]	CPU or GPU

The MOS labels for each original/enhanced speech sample in the blind test set are released at https://huggingface.co/datasets/urgent-challenge/urgent2024_mos. The correlations between MOS and different objective metrics in the blind test data are shown below:

This is the refined version of Figure 4 in the URGENT 2024 analysis paper, with two major changes:

A new correlation measure, the Spearman's rank correlation coefficient (SRCC), is added.

The overall ranking score calculated based solely on objective metrics (those colored in blue) is added, denoted as
---- Overall ranking score (w/o MOS) in the figure.

Tip

We can observe that

Both Overall ranking score (w/o MOS) and Overall ranking score are highly correlated with the human-annotated MOS, outperforming all individual objective metrics officially used in the challenge in terms of KRCC and SRCC. This highlights the importance and effectiveness of the comprehensive evaluation protocol design in the challenge.
Several non-intrusive metrics that have not been specifically designed for the universal speech enhancement task (also unused in the challenge), such as UTMOS and SCOREQ, also show very strong correlations with the human-annotated MOS, indicating their potential for future SE research.

Citation

If you find this repository useful, please consider citing our paper:

@inproceedings{Lessons-Zhang2025,
  title={Lessons Learned from the {URGENT} 2024 Speech Enhancement Challenge},
  author={Zhang, Wangyou and Saijo, Kohei and Cornell, Samuele and Scheibler, Robin and Li, Chenda and Ni, Zhaoheng and Kumar, Anurag and Sach, Marvin and Wang, Wei and Fu, Yihui and Watanabe, Shinji and Fingscheidt, Tim and Qian, Yanmin},
  booktitle={Proc. Interspeech},
  pages={853--857},
  year={2025},
  doi={10.21437/Interspeech.2025-1246},
  url={https://www.isca-archive.org/interspeech_2025/zhang25j_interspeech.html},
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
mos		mos
tagging		tagging
wada_snr		wada_snr
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Table of Contents

WADA-SNR

Hard sample tagging

MOS correlation

Citation

About

Uh oh!

Releases

Packages

Languages

License

urgent-challenge/urgent2024_analysis

Folders and files

Latest commit

History

Repository files navigation

Introduction

Table of Contents

WADA-SNR

Hard sample tagging

MOS correlation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages