Unsupervised Speech Enhancement

Background

Speech enhancement is the task of recovering clean speech from noisy speech that has been degraded due to e.g. background noise, interfering speakers or reverberation in poor acoustic conditions. Deep learning-based speech enhancement is typically trained in a supervised setup using datasets consisting of separate clean speech and background noise samples, which are combined at training time and the network is then trained to directly recover the clean speech from the artificially corrupted mixture. This artificial mixing step limits the diversity of data that models are exposed to and harms the resulting model’s ability to generalize to real-world conditions.

In this project we will investigate unsupervised alternatives to the usual speech enhancement setup. Using unpaired clean speech and real-world noisy speech data the goal will be to train a neural network to convert noisy speech into clean speech. Large noisy and clean speech data sets are openly available (e.g. LibriSpeech/GigaSpeech/Common Voice and AudioSet/FSD50K). The speech enhancement quality can be evaluated by objective metrics (e.g. SI-SDR) on paired data sets (e.g. LibriSpeech mixed with WHAM! or DEMAND noise), and using subjective metrics such as predicted MOS (mean opinion score).

One potential approach is to base the network on recent neural audio codecs, which are high-fidelity pre-trained autoencoders operating on raw waveform audio, and fine-tune these networks to produce disentangled latent variables for speech and noise components of the waveform. The disentanglement can be induced by a discriminator penalizing the network for leaking non-speech information into a speech-specific subset of the latent variables. Unsupervised speech enhancement can then be performed by encoding noisy speech and reconstructing using only the speech components of the encoded latent variables.

Objective(s)

i) Training a neural network to perform speech enhancement using > unpaired clean and noisy speech.

ii) Evaluating speech enhancement quality using objective (e.g. SI-SDR) > and subjective (e.g. predicted MOS) metrics.

iii) (Optional) Using the discriminator to also penalize the enhancement > network for leaking speech information into non-speech latent > variables.

Requirements

Need to have:

Familiarity with practical deep learning in Python (e.g. 02456
curriculum).

Nice to have:

Experience with adversarial learning, GANs, etc.
Experience with variational autoencoders.
Experience working with audio data.

Maximum number of students

3-4 students max

Supervisors

Bjørn Sand Jensen (main supervisor)

Morten Mørup? (co-supervisor)

Kenny Falkær Olsen (co-supervisor)

Contact information

Name

Address

E-mail

Phone nr. Phone nr.

References

UnSE: Unsupervised Speech Enhancement using Optimal Transport:
[https://www.isca-speech.org/archive/interspeech_2023/jiang23b_interspeech.html]{.underline}
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss:
[https://arxiv.org/abs/1905.05879]{.underline}
High-Fidelity Audio Compression with Improved RVQGAN:
[https://arxiv.org/abs/2306.06546]{.underline}

Background#

Objective(s)#

Requirements#

Maximum number of students#

Supervisors#

Contact information#

References#