Hybrid Deep Learning Model for Singing Voice Separation

Rusul Amer; Ahmed Al Tmeme

doi:10.13164/mendel.2021.2.044

Rusul Amer Information and Communication Eng. Dept., Al Khwarizmi Eng. College, University of Baghdad, Iraq
Ahmed Al Tmeme Information and Communication Eng. Dept., Al Khwarizmi Eng. College, University of Baghdad, Iraq

DOI: https://doi.org/10.13164/mendel.2021.2.044

Keywords: Monaural Source Separation, Hybrid Deep Learning, Time Frequency Masking, Convolution Neural Network, Dense Neural Network, Recurrent Neural Network

Abstract

Monaural source separation is a challenging issue due to the fact that there is only a single channel available; however, there is an unlimited range of possible solutions. In this paper, a monaural source separation model based hybrid deep learning model, which consists of convolution neural network (CNN), dense neural network (DNN) and recurrent neural network (RNN), will be presented. A trial and error method will be used to optimize the number of layers in the proposed model. Moreover, the effects of the learning rate, optimization algorithms, and the number of epochs on the separation performance will be explored. Our model was evaluated using the MIR-1K dataset for singing voice separation. Moreover, the proposed approach achieves (4.81) dB GNSDR gain, (7.28) dB GSIR gain, and (3.39) dB GSAR gain in comparison to current approaches

References

Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G., and Yu, D. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 10 (2014), 1533–1545.

Al-Tmeme, A., Woo, W. L., Dlay, S. S., and Gao, B. Underdetermined convolutive source separation using gem-mu with variational approximated optimum model order nmf2d. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 1 (2017), 35–49.

Al-Tmeme, A., Woo, W. L., Dlay, S. S., and Gao, B. Single channel informed signal separation using artificial-stereophonic mixtures and exemplar-guided matrix factor deconvolution. Int. J. Adapt. Control Signal Process. 32, 9 (sep 2018), 1259–1281.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, (jul 2011), 2121–2159.

Hermans, M., and Schrauwen, B. Training and analyzing deep recurrent neural networks. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1 (Red Hook, NY, USA, 2013), NIPS’13, Curran Associates Inc., p. 190–198.

Hsu, C., and Jang, J. R. On the improvement of singing voice separation for monaural recordings using the mir-1k dataset. IEEE/ACM Transactions on Audio, Speech, and Language Processing 18, 2 (2010), 310–319.

Huang, P., Chen, S., Smaragdis, P., and Hasegawa-Johnson, M. Singing-voice separation from monaural recordings using robust principal component analysis. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 57–60.

Huang, P., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. Deep learning for monaural speech separation. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 1562–1566.

Huang, P., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 12 (2015), 2136–2147.

Huang, P.-S., Kim, M., Hasegawa-Johnson, M. A., and Smaragdis, P. Singing-voice separation from monaural recordings using deep recurrent neural networks. In ISMIR (2014), pp. 477–482.

Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2015).

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 6 (may 2017), 84–90.

Luo, Y., and Mesgarani, N. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 8 (2019), 1256–1266.

Nugraha, A. A., Liutkus, A., and Vincent, E. Multichannel music separation with deep neural networks. In 2016 24th European Signal Processing Conference (EUSIPCO) (2016), pp. 1748–1752.

Parveen, S., and Green, P. Speech enhancement with missing data techniques using recurrent neural networks. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (2004), pp. 733–736.

Sebastian, J., and Murthy, H. A. Group delay based music source separation using deep recurrent neural networks. In 2016 International Conference on Signal Processing and Communications (SPCOM) (2016), pp. 1–5.

Shi, Z., Lin, H., Liu, L., Liu, R., Hayakawa, S., and Han, J. Furcax: End-to-end monaural speech separation based on deep gated (de)convolutional neural networks with adversarial example training. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), pp. 6985–6989.

Simpson, A. J. Probabilistic binary-mask cocktail-party source separation in a convolutional deep neural network. ArXiv abs/1503.06962 (2015).

Simpson, A. J. R., Roma, G., and Plumbley, M. D. Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network. In Latent Variable Analysis and Signal Separation (Cham, 2015), E. Vincent, A. Yeredor, Z. Koldovsk´y, and P. Tichavsk´y, Eds., Springer International Publishing, pp. 429–436.

Slizovskaia, O., Haro, G., and G´omez, E. Conditioned source separation for musical instrument performances. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2083–2095.

Sun, Y., Wang, W., Chambers, J., and Naqvi, S. M. Two-stage monaural source separation in reverberant room environments using deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 1 (2019), 125–139.

Tmeme, A. A., lok Woo, W., Dlay, S. S., and Gao, B. Underdetermined reverberant acoustic source separation using weighted fullrank nonnegative tensor models. The Journal of the Acoustical Society of America 138, 6 (2015), 3411–3426.

Uhlich, S., Giron, F., and Mitsufuji, Y. Deep neural network based instrument extraction from music. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 2135–2139.

Uhlich, S., Porcu, M., Giron, F., Enenkl, M., Kemp, T., Takahashi, N., and Mitsufuji, Y. Improving music source separation based on deep neural networks through data augmentation and network blending. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), pp. 261–265.

Wang, D. Time-frequency masking for speech separation and its potential for hearing aid design. Trends in Amplification 12, 4 (2008), 332–353. PMID: 18974204.

Wang, Y., Narayanan, A., and Wang, D. On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 12 (2014), 1849–1858.

Woo, W., Dlay, S., Al-Tmeme, A., and Gao, B. Reverberant signal separation using optimized complex sparse nonnegative tensor deconvolution on spectral covariance matrix. Digital Signal Processing 83 (2018), 9–23.

Yang, Y.-H. On sparse and low-rank matrix decomposition for singing voice separation. In Proceedings of the 20th ACM International Conference on Multimedia (New York, NY, USA, 2012), MM ’12, Association for Computing Machinery, p. 757–760.

Yang, Y.-H. Low-rank representation of both singing voice and music accompaniment via learned dictionaries. In ISMIR (2013), pp. 427–432.

Yuan, W., He, B., Wang, S., Wang, J., and Unoki, M. Enhanced feature network for monaural singing voice separation. Speech Communication 106 (2019), 1–6.

Zeiler, M. D. Adadelta: An adaptive learning rate method. ArXiv abs/1212.5701 (2012).