SE phase reconstruction

This is the sample page for our ICASSP 2024 submission 'Phase reconstruction in single channel speech enhancement based on phase gradients and estimated clean-speech amplitudes'.

Audio samples

We will demonstrate the benefit of the proposed method in the following two aspects:

First we show the improvement obtained by our proposed, speech enhancement agnostic method for phase estimation, compared to the fully synthetic phase retrieval proposed in [1];
Secondly, we demonstrate the benefit of the proposed phase estimation applied on top of the speech enhancement approach.

All the noisy samples are normalised to -26 dBov individually based on the active speech level (ASL).
We recommend using neutral sound headphone to better discern differences among the presented methods.

Using synthesised phase [1] vs the proposed reconstructed phase

Here we see the effect of using the synthetic phase retrived by the algorithm proposed in [1].
It may be observed that using the purely synthesised phase leads to an unnatural output, especially at high SNR, where the underlying signal phase is not heavily distorted by the noise. In comparison, our proposed method provides a more natural sounding output and the effect of the phase enhancement is perceivable especially in the voiced segments as having less "vocoding-like" artefacts. Since our method combines the predicted phase with that of the mixture signal, it helps preserve the naturalness when the underlying signal is less corrupted by noise.

Methods	Noisy input	Real CRUSE	Real CRUSE-synthetic	Real CRUSE-agnostic	Clean ref

Improvement by the proposed phase reconstruction

Now, we demonstrate the benefit of using the estimated phase obtained from the proposed method.
The samples below are processed by:

Real CRUSE: CRUSE net [2] predicting a real-valued mask from the noisy magnitude;
Real CRUSE - agnostic: reconstructing phase based on the magnitude estimated by real CRUSE. The DNNs for phase gradient estimation are trained in an SE-Agnostic manner, i.e., on magnitudes of clean speech;
Real CRUSE - clean phase: combining the magnitude estimated by real CRUSE and the clean signal phase. We consider this the performance upper bound of phase reconstruction;
Complex CRUSE: CRUSE net [2] predicting a complex mask. To enable a complex mask prediction, the network takes the real- and imaginary-part of the noisy STFT as input and employs hyperbolic tangent function as the final activation function. Other parts of the network and the traning scheme are kept the same as in [2];
Complex CRUSE - agnostic: reconstructing phase based on the magnitude estimated by complex CRUSE. The DNNs for phase gradient estimation are trained in an SE-Agnostic manner, i.e., on magnitudes of clean speech;
Complex CRUSE - clean phase: combining the magnitude estimated by complex CRUSE and the clean signal phase. We consider this the performance upper bound of phase reconstruction;

Note that the effect of the phase reconstruction is best perceivable as reduction of "vocoding-like" artefacts, which occur when noise is present between harmonics. This is also visible in the spectra.
Obviously, if the initial phase estimate is good, less difference is be observed bewtween the speech estimate with the noisy phase and the one with the reconstructed phase.

For the ease of observation, we zoom all the spectrogram into [0, 4] kHz.
By moving your mouse over the spectrogram of the proposed method, you can see the difference to using the noisy phase.
Click the spectrogram to enlarge/reset it.

Methods	Noisy input	Real CRUSE	Real CRUSE - agnostic	Real CRUSE - clean phase	Complex CRUSE	Complex CRUSE - agnostic	Complex CRUSE - clean phase	Clean ref

References

Y. Masuyama, K. Yatabe, K. Nagatomo, and Y. Oikawa, "Online phase reconstruction via DNN-based phase differences estimation", IEEE/ACM Transactions on Audio, Speech, and Language Processing, (2022), 31, 163-176.
S. Braun, H. Gamper, C. K. Reddy, and I. Tashev, “Towards efficient models for real-time deep noise suppression,” ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 656–660.

ASPIRE@IDLAB

Audio samples

Using synthesised phase [1] vs the proposed reconstructed phase

Improvement by the proposed phase reconstruction

References