This is the sample page for our ICASSP 2024 submission 'Phase reconstruction in single channel speech enhancement based on phase gradients and estimated clean-speech amplitudes'.
Audio samples
We will demonstrate the benefit of the proposed method in the following two aspects:
- First we show the improvement obtained by our proposed, speech enhancement agnostic method for phase estimation, compared to the fully synthetic phase retrieval proposed in [1];
- Secondly, we demonstrate the benefit of the proposed phase estimation applied on top of the speech enhancement approach.
All the noisy samples are normalised to -26 dBov individually based on the active speech level (ASL).
We recommend using neutral sound headphone to better discern differences among the presented methods.
Using synthesised phase [1] vs the proposed reconstructed phase
Here we see the effect of using the synthetic phase retrived by the algorithm proposed in [1].
It may be observed that using the purely synthesised phase leads to an unnatural output, especially at high SNR, where the underlying signal phase is not heavily distorted by the noise. In comparison, our proposed method provides a more natural sounding output and the effect of the phase enhancement is perceivable especially in the voiced segments as having less "vocoding-like" artefacts. Since our method combines the predicted phase with that of the mixture signal, it helps preserve the naturalness when the underlying signal is less corrupted by noise.
Methods |
Noisy input |
Real CRUSE |
Real CRUSE-synthetic |
Real CRUSE-agnostic |
Clean ref |
Improvement by the proposed phase reconstruction
Now, we demonstrate the benefit of using the estimated phase obtained from the proposed method.
The samples below are processed by:
- Real CRUSE: CRUSE net [2] predicting a real-valued mask from the noisy magnitude;
- Real CRUSE - agnostic: reconstructing phase based on the magnitude estimated by real CRUSE. The DNNs for phase gradient estimation are trained in an SE-Agnostic manner, i.e., on magnitudes of clean speech;
- Real CRUSE - clean phase: combining the magnitude estimated by real CRUSE and the clean signal phase. We consider this the performance upper bound of phase reconstruction;
- Complex CRUSE: CRUSE net [2] predicting a complex mask. To enable a complex mask prediction, the network takes the real- and imaginary-part of the noisy STFT as input and employs hyperbolic tangent function as the final activation function. Other parts of the network and the traning scheme are kept the same as in [2];
- Complex CRUSE - agnostic: reconstructing phase based on the magnitude estimated by complex CRUSE. The DNNs for phase gradient estimation are trained in an SE-Agnostic manner, i.e., on magnitudes of clean speech;
- Complex CRUSE - clean phase: combining the magnitude estimated by complex CRUSE and the clean signal phase. We consider this the performance upper bound of phase reconstruction;
Note that the effect of the phase reconstruction is best perceivable as reduction of "vocoding-like" artefacts, which occur when noise is present between harmonics. This is also visible in the spectra.
Obviously, if the initial phase estimate is good, less difference is be observed bewtween the speech estimate with the noisy phase and the one with the reconstructed phase.
For the ease of observation, we zoom all the spectrogram into [0, 4] kHz.
By moving your mouse over the spectrogram of the proposed method, you can see the difference to using the noisy phase.
Click the spectrogram to enlarge/reset it.
Methods |
Noisy input |
Real CRUSE |
Real CRUSE - agnostic |
Real CRUSE - clean phase |
Complex CRUSE |
Complex CRUSE - agnostic |
Complex CRUSE - clean phase |
Clean ref |
References
-
Y. Masuyama, K. Yatabe, K. Nagatomo, and Y. Oikawa, "Online phase reconstruction via DNN-based phase differences estimation", IEEE/ACM Transactions on Audio, Speech, and Language Processing, (2022), 31, 163-176.
-
S. Braun, H. Gamper, C. K. Reddy, and I. Tashev, “Towards efficient models for real-time deep noise suppression,” ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 656–660.