USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Wang, Wenbin; Song, Yang; Jha, Sanjay

doi:10.1109/TASLP.2024.3393714

Computer Science > Sound

arXiv:2404.18094 (cs)

[Submitted on 28 Apr 2024]

Title:USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Authors:Wenbin Wang, Yang Song, Sanjay Jha

View PDF HTML (experimental)

Abstract:Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.

Comments:	15 pages, 13 figures. Copyright has been transferred to IEEE
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2404.18094 [cs.SD]
	(or arXiv:2404.18094v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2404.18094
Journal reference:	IEEE/ACM Transactions on Audio, Speech and Language Processing, 2024
Related DOI:	https://doi.org/10.1109/TASLP.2024.3393714

Submission history

From: Wenbin Wang [view email]
[v1] Sun, 28 Apr 2024 06:50:55 UTC (9,494 KB)

Computer Science > Sound

Title:USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators