Diffusion-Based Neural Speech Coding

Recently, neural speech codecs (NSCs) trained as generative models have shown superior performance compared to conventional codecs at low bitrates.
The most prominent NSCs are trained as Generative Adversarial Networks (GANs), a well-known paradigm in Generative Artificial Intelligence (GenAI).

Among GenAI paradigms, Diffusion Models (DMs) represent a promising alternative due to their superior performance in image generation relative to GANs as well as their success in audio generation applications, including audio and speech coding.

However, the design of diffusion-based NSCs has not yet been explored in a systematic way. We addressed this gap in the research literature by providing a comprehensive analysis of diffusion-based NSCs by

proposing a categorization based on the conditioning and output domains of the DM
systematically investigating unexplored designs by creating and evaluating new diffusion-based NSCs within the conceptual framework
comparing the proposed models to existing GAN and DM baselines through objective metrics and subjective listening tests

Findings in a nutshell

In our experiments, we considered three audio representations: waveform, magnitude mel-spectrogram and a latent representation learned by a NSC.
The best configurations proved to be “mel2mel”, where the DM receives a quantized magnitude mel-spectrogram as input and outputs a “clean” magnitude mel-spectrogram, which is finally fed to a NN vocoder (HiFiGAN) to obtain the coded waveform. Albeit this configuration was the best among the diffusion-based codecs we investigated, “mel2mel” did not outperform a GAN-based baseline trained from scratch on the same data.