Speaker Diarization with Deep Learning Techniques
Main Article Content
Abstract
Speaker diarization is a task to identify the speaker when different speakers spoke in an audio or video recording environment. Artificial intelligence (AI) fields have effectively used Deep Learning (DL) to solve a variety of real-world application challenges. With effective applications in a wide range of subdomains, such as natural language processing, image processing, computer vision, speech and speaker recognition, and emotion recognition, cyber security, and many others, DL, a very innovative field of Machine Learning (ML), that is quickly emerging as the most potent machine learning technique. DL techniques have outperformed conventional approaches recently in speaker diarization as well as speaker recognition. The technique of assigning classes to speech recordings that correspond to the speaker's identity is known as speaker diarization, and it allows one to determine who spoke when. A crucial step in speech processing is speaker diarization, which divides an audio recording into different speaker areas. In-depth analysis of speaker diarization utilizing a variety of deep learning algorithms that are presented in this research paper. NIST-2000 CALLHOME and our in-house database ALSD-DB are the two voice corpora we used for this study's tests. TDNN-based embeddings with x-vectors, LSTM-based embeddings with d-vectors, and lastly embeddings fusion of both x-vector and d-vector are used in the tests for the basic system. For the NIST-2000 CALLHOME database, LSTM based embeddings with d-vector and embeddings integrating both x-vector and d-vector exhibit improved performance with DER of 8.25% and 7.65%, respectively, and of 10.45% and 9.65% for the local ALSD-DB database
Downloads
Metrics
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
References
J. Schmidhuber, Deep Learning in Neural Networks: An Overview. Neural Network, 61, 2015 pp.85–117
Y. LeCun,G. Hinton, Deep Learning. Nature, 521,2015, pp.436–444.
Y. Bengio, Learning deep architectures for AI. Found. Trends Mach. Learn. 2, 2009, pp.1–127.
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-End Factor Analysis for Speaker Verification. Vol. 19. No. 4. IEEE,2011.
X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G.Friedland,O. Vinyals, Speaker Diarization: a Review of Recent Research. Vol. 2. No. 2. IEEE, 2012, pp.356–370.
F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, C. Vair, Stream-based speaker segmentation using speaker factors and eigenvoices. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2008, pp. 4133–4136.
S.Shum, N. Dehak, E. Chuangsuwanich, D. Reynolds, J. Glass, Exploiting intra-conversation variability for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association,2011.
S.Shum, N. Dehak,R. Dehak,J.R. Glass, Unsupervised Methods for Speaker Diarization: an Integrated and Iterative Approach. Vol. 21. No. 10. IEEE, 2013,pp. 2015–2028.
S.Shum, N. Dehak, J. Glass, On the use of spectral and iterative methods for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association. 2012, pp. 482–485.
M.Senoussaoui, P. Kenny, T. Stafylakis, P. Dumouchel, A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Trans. Audio Speech Lang. Process. 22 (1), 2013b,pp. 217–227.
S.E.Tranter,D.A. Reynolds, Speaker diarisation for broadcast news. In: Odyssey. 2004, pp. 337–344.
S.E.Tranter,D.A. Reynolds ,An Overview of Automatic Speaker Diarization Systems. Vol. 14. No. 5. 2006, pp. 1557–1565.
V. Gupta, Speaker change point detection using deep neural nets. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, IEEE,2015, pp. 4420–4424.
R.Yin, H. Bredin, C. Barras, Speaker change detection in broadcast TV using bidirectional long short-term memory networks. In: Proc. Interspeech, 2017.pp. 3827–3831.
A.Jati, P. Georgiou, Speaker2Vec: Unsupervised learning and adaptation of a speaker manifold using deep neural networks with an evaluation on speaker segmentation. In: Proc. Interspeech.2017, pp. 3567–3571.
Z. Zajıc, M. Hrúz, L. Müller, Speaker diarization using convolutional neural network for statistics accumulation refinement. In: Proc. Interspeech, 2017, pp. 3562–3566.
G. Le Lan, D.Charlet, A. Larcher, S. Meignier, A triplet ranking-based neural network for speaker diarization and linking. In: Proc. Interspeech ,2017, pp. 3572–3576.
D.Wang, J.Chen, Supervised Speech Separation Based on Deep Learning: an Overview. Vol. 26. No. 10. IEEE, 2018,pp. 1702–1726.
G. Sun, C. Zhang, P.C.Woodland, Speaker diarisation using 2D self-attentive combination of embeddings. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, IEEE, 2019,pp. 5801–5805.
H. Bredin, Tristounet: triplet loss for speaker turn embedding. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, IEEE, 2017, pp. 5430–5434.
Q.Lin, R. Yin, M. Li, H. Bredin, C. Barras, LSTM based similarity measurement with spectral clustering for speaker diarization. In: Proc. Interspeech, 2019. pp. 366–370.
M.À. India Massana, J.A. Rodríguez Fonollosa, F.J.Hernando Pericás, LSTM neural network-based speaker segmentation using acoustic and language modelling. In: Proc. Interspeech, 2017. pp. 2834–2838.
T.J. Park, P. Georgiou, Multimodal speaker segmentation and diarization using lexical and acoustic cues via sequence to sequence neural networks. In:Proceedings of the Annual Conference of the International Speech Communication Association.2018, pp. 1373–1377.
T.J. Park, K.J.Han, J. Huang, X. He, B. Zhou, P. Georgiou, S. Narayanan, Speaker diarization with lexical information. In: Proceedings of the Annual Conference of the International Speech Communication Association. 2019, pp. 391–395.
D.Garcia-Romero, D.Snyder, G.Sell, D.Povey, A. McCree, Speaker diarization using deep neural network embeddings. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, IEEE, 2017,pp. 4930–4934.
M.Diez,L. Burget, S. Wang, J. Rohdin, J. Cernock`y, Bayesian HMM based x-vector clustering for speaker diarization. In: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 346–350.
L.Wan, Q. Wang, A. Papir, I.L. Moreno, Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, IEEE, 2018, pp. 4879–4883.
E.Variani, X. Lei, E.McDermott, I.L. Moreno, J. G-Dominguez, Deep neural networks for small footprint text-dependent speaker verification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2014,pp. 4052–4056.
G.Heigold, I.Moreno, S. Bengio, N. Shazeer, End-to-end text-dependent speaker verification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.2016, pp. 5115–5119.
D.Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: Robust DNN embeddings for speaker recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5329–5333.
A. Zhang, Q.Wang, Z.Zhu, J. Paisley, C.Wang, Fully supervised speaker diarization. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2019, pp. 6301–6305.
R.Yin, H. Bredin, C. Barras, Neural speech turn segmentation and affinity propagation for speaker diarization. In: Proc. Interspeech, 2018. pp. 1393–1397.
Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, S. Watanabe, End-to-end neural speaker diarization with permutation-free objectives. In: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 4300–4304.
Y.Fujita,N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, S. Watanabe, End-to-end neural speaker diarization with self-attention. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 2019b, pp. 296–303.
T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, R. Haeb-Umbach, All-neural online source separation, counting, and diarization for meeting analysis, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2019, pp. 91–95.
J.G. Fiscus, J. Ajot, M. Michel, J.S. Garofolo, The rich transcription 2006 spring meeting recognition evaluation, in: Proceedings of International Workshop on Machine Learning and Multimodal Interaction,NIST, 2006, pp. 309–322.
J. Silovsky,J. Zdansky, J. Nouza, P. Cerva,J. Prazak, Incorporation of the asr output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams, in: International Workshop on Multimedia Signal Processing, IEEE, 2012,pp. 118–123.
T.J.Park, P. Georgiou ,Multimodal speaker segmentation and diarization using lexical and acoustic cues via sequence to sequence neural networks, in: Proceedings of the Annual Conference of the InternationalSpeech Communication Association, 2018, pp. 1373–1377.
L.E. Shafey, H. Soltau, I. Shafran, Joint Speech Recognition and Speaker Diarization via Sequence Transduction, in: Proceedings of the Annual Conference of the International Speech Communication Association,ISCA,2019, pp. 396–400.
R. Haeb-Umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Ho_meister, M. L. Seltzer, H. Zen, M. Souden, Speech processing fordigital home assistants: Combining signal processing with deep-learning techniques, IEEE Signal Processing Magazine 36,2019, pp. 111–124.
T. Nakatani,T.Yoshioka,K. Kinoshita,M. Miyoshi,B.H. Juang, Speech dereverberation based on variance-normalized delayed linear prediction, IEEE Transactions on Audio, Speech, and Language Processing 18,2010, pp. 1717–1731.
T. Yoshioka,T. Nakatani, Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening, IEEE Transactions on Audio, Speech, and Language Processing 20,2012, pp.2707– 2720.
L. Drude, J. Heymann, C. Boeddeker, R. Haeb-Umbach, NARAWPE: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and o_ine processing, in: SpeechCommunication; 13th ITG-Symposium, VDE, 2018, pp. 1–5.
X. Anguera, C. Wooters, J. Hernando, Acoustic beamforming for speaker diarization of meetings, IEEE Transactions on Audio, Speech,and Language Processing 15,2007,pp.2011–2023.
T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, F. Alleva, Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018, pp. 3038–3042.
C. Boeddecker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, R. Haeb-Umbach, Front-end processing for the CHiME-5 dinner party scenario, in: Proceedings of CHiME 2018 Workshop on SpeechProcessing in Everyday Environments, 2018, pp. 35–40.
D. Haws, D. Dimitriadis, G. Saon, S. Thomas, M. Picheny, On the importance of event detection for asr, in: Proceedings of IEEE InternationalConference on Acoustics, Speech and Signal Processing, 2016.
T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, R. Haeb-Umbach, All-neural online source separation, counting, and diarization for meeting analysis, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2019, pp. 91–95.
Y. Jiang, K. A. Lee, L. Wang, Plda in the i-supervector space for textindependent speaker verification, EURASIP Journal on Audio, Speech, and Music Processing ,2014,pp. 1–13.
Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting10,000 classes, in: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2014, pp. 1891–1898.
K. J. Han, S. S. Narayanan, A robust stopping criterion for agglomerativehierarchical clustering in a speaker diarization system, in: Proceedingsof the Annual Conference of the International Speech CommunicationAssociation, 2007.
J. E. Rougui, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, Fast incrementalclustering of gaussian mixture speaker models for scaling upretrieval in on-line broadcast, in: Proceedings of IEEE InternationalConference on Acoustics, Speech and Signal Processing, volume 5,IEEE, 2006.
S. Novoselov, A. Gusev, A. Ivanov, T. Pekhovsky, A. Shulipa, A. Avdeeva, A. Gorlanov, A. Kozlov, Speaker diarization with deep speaker embeddings for dihard challenge ii., in: Proceedings of the Annual Conference of the International Speech Communication Association,2019, pp. 1003–1007.
G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, S. Khudanpur, Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge, in: Proceedings of theAnnual Conference of the International Speech Communication Association, 2018, pp. 2808–2812
P. Kenny, D. Reynolds, F. Castaldo, Diarization of telephone conversations using factor analysis, IEEE Journal of Selected Topics in Signal Processing 4, 2010,pp. 1059–1070.
M. Diez, L. Burget, F. Landini, J. Cˇ ernocky`, Analysis of speaker diarizationbased on BayesianHMMwith eigenvoice priors, IEEE/ACM Transactionson Audio, Speech, and Language Processing 28 ,2019, pp. 355–368.
N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, M. Liberman, The second DIHARD diarization challenge: Dataset, task, and baselines, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 978–982
N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, M. Liberman, The first dihard speech diarization challenge, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018.
T. J. Park, K. J. Han, M. Kumar, S. Narayanan, Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap, IEEE Signal Processing Letters 27 (2019) 381–385.
D. Dimitriadis, P. Fousek, Developing on-line speaker diarization system, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2017, pp. 2739–2743.
K. Boakye, B. Trueba-Hornero, O. Vinyals, G. Friedland, Overlapped speech detection for improved speaker diarization in multiparty meetings, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2008, pp. 4353–4356.
O. Cetin, E. Shriberg, Speaker overlaps and ASR errors in meetings: Effects before, during, and after the overlap, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, IEEE, 2006, pp. 357–360.
N. Kanda, C. Boeddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi, K. Nagamatsu, R. Haeb-Umbach, Guided source separation meets a strong ASR backend: Hitachi/Paderborn University joint investigation for dinner party ASR, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 1248– 1252
T. Yoshioka, D. Dimitriadis, A. Stolcke, W. Hinthorn, Z. Chen, M. Zeng, H. Xuedong, Meeting Transcription Using Asynchronous Distant Microphones, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 2968–2972.
L. E. Shafey, H. Soltau, I. Shafran, Joint Speech Recognition and Speaker Diarization via Sequence Transduction, in: Proceedings of the Annual Conference of the International Speech Communication Association, ISCA, 2019, pp. 396–400.
D. Dimitriadis, Enhancements for Audio-only Diarization Systems, arXiv preprint arXiv:1909.00082 (2019).
J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in: Proceedings of International Conference on Machine Learning, 2016, pp. 478–487.
X. Guo, L. Gao, X. Liu, J. Yin, Improved deep embedded clustering with local structure preservation, in: Proceedings of International Joint Conference on Artificial Intelligence, 2017, pp. 1753–1759.