TY - JOUR
T1 - Deep normalization for speaker vectors
AU - Cai, Yunqi
AU - Li, Lantian
AU - Abel, Andrew
AU - Zhu, Xiaoyan
AU - Wang, Dong
N1 - Funding Information: Manuscript received February 18, 2020; revised June 23, 2020 and October 30, 2020; accepted November 8, 2020. Date of publication December 17, 2020; date of current version February 1, 2021. This work was supported by the National Natural Science Foundation of China (NSFC) under the Project No. 61633013, and No. 61371136. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Lei Xie. (Yunqi Cai and Lantian Li contributed equally to this work.) (Corresponding author: Dong Wang.) Yunqi Cai is with the Center for Speech, and Language Technologies (CSLT), and the Department of Computer Science at Tsinghua University, Beijing 100084, China (e-mail: [email protected]).
Publisher Copyright: © 2014 IEEE.
Y. Cai, L. Li, A. Abel, X. Zhu and D. Wang, "Deep Normalization for Speaker Vectors," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 733-744, 2021, doi: 10.1109/TASLP.2020.3039573
PY - 2021/2/1
Y1 - 2021/2/1
N2 - Deep speaker embedding has demonstrated state-of-the-art performance in speaker recognition tasks. However, one potential issue with this approach is that the speaker vectors derived from deep embedding models tend to be non-Gaussian for each individual speaker, and non-homogeneous for distributions of different speakers. These irregular distributions can seriously impact speaker recognition performance, especially with the popular PLDA scoring method, which assumes homogeneous Gaussian distribution. In this article, we argue that deep speaker vectors require deep normalization, and propose a deep normalization approach based on a novel discriminative normalization flow (DNF) model. We demonstrate the effectiveness of the proposed approach with experiments using the widely used SITW and CNCeleb corpora. In these experiments, the DNF-based normalization delivered substantial performance gains and also showed strong generalization capability in out-of-domain tests.
AB - Deep speaker embedding has demonstrated state-of-the-art performance in speaker recognition tasks. However, one potential issue with this approach is that the speaker vectors derived from deep embedding models tend to be non-Gaussian for each individual speaker, and non-homogeneous for distributions of different speakers. These irregular distributions can seriously impact speaker recognition performance, especially with the popular PLDA scoring method, which assumes homogeneous Gaussian distribution. In this article, we argue that deep speaker vectors require deep normalization, and propose a deep normalization approach based on a novel discriminative normalization flow (DNF) model. We demonstrate the effectiveness of the proposed approach with experiments using the widely used SITW and CNCeleb corpora. In these experiments, the DNF-based normalization delivered substantial performance gains and also showed strong generalization capability in out-of-domain tests.
KW - normalization flow
KW - speaker embedding
KW - speaker recognition
KW - training
KW - transforms
KW - task analysis
KW - covariance matrices
KW - probabilistic logic
KW - dimensionality reduction
UR - http://www.scopus.com/inward/record.url?scp=85098762552&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2020.3039573
DO - 10.1109/TASLP.2020.3039573
M3 - Article
AN - SCOPUS:85098762552
SN - 2329-9290
VL - 29
SP - 733
EP - 744
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
M1 - 9296778
ER -