Turkish Journal of Electrical Engineering and Computer Sciences
DOI
10.3906/elk-1901-125
Abstract
Speaker diarization aims to determine ?who spoke when?? from multispeaker recording environments. In this paper, we propose to learn a set of high-level feature representations, referred to as feature embeddings, from an unsupervised deep architecture for speaker diarization. These sets of embeddings are learned through a deep autoencoder model when trained on mel-frequency cepstral coefficients (MFCCs) of input speech frames. Learned embeddings are then used in Gaussian mixture model based hierarchical clustering for diarization. The results show that these unsupervised embeddings are better compared to MFCCs in reducing the diarization error rate. Experiments conducted on the popular subset of the AMI meeting corpus consisting of 5.4 h of recordings show that the new embeddings decrease the average diarization error rate by 2.96%. However, for individual recordings, maximum improvement of 8.05% is acquired.
Keywords
Diarization error rate, mel-frequency cepstral coefficients, hierarchical clustering, Gaussian mixture model, autoencoder
First Page
3138
Last Page
3149
Recommended Citation
AHMAD, REHAN and ZUBAIR, SYED
(2019)
"Unsupervised deep feature embeddings for speaker diarization,"
Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 27:
No.
4, Article 54.
https://doi.org/10.3906/elk-1901-125
Available at:
https://journals.tubitak.gov.tr/elektrik/vol27/iss4/54
Included in
Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons