MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

Abstract

Voice conversion aims to modify the source speaker's voice to resemble the target speaker while preserving the original speech content. Despite notable advancements in voice conversion these days, multi-lingual voice conversion (including both monolingual and cross-lingual scenarios) has yet to be extensively studied. It faces two main challenges: 1) the considerable variability in prosody and articulation habits across languages; and 2) the rarity of paired multi-lingual datasets from the same speaker. In this paper, we propose MulliVC, a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data. Specifically, each training step of MulliVC contains three substeps: In step one the model is trained with monolingual speech-text data; then, steps two and three take inspiration from back translation, construct a cyclical process to disentangle the timbre and other information (content, prosody, and other language-related information) in the absence of multi-lingual data from the same speaker. Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts, demonstrating the system's efficacy and the viability of the three-step approach with cycle consistency.

Training Pipeline

Interpolate start reference image.

Each training step of our proposed MulliVC model consists of three substeps, and the losses from all three steps are summed up to perform a single model update. The content input and timbre input of step 2 and step 3 come from different languages, simulating a cross-language voice conversion scenario. The input of step 2 is used as the timbre input of step 3, and the two together form a cycle consistency loop

Model Overview

Interpolate start reference image.

Model architecture of MilliVC. Note that modules printed with a lock are frozen when training.

Zero-shot cross-lingual voice conversion

Source speech: English - Target Speech: Chinese

Source speech Target speech MulliVC ConsistencyVC DiffHierVC FreeVC FreeVC*

Source speech: Chinese - Target Speech: English

Source speech Target speech MulliVC ConsistencyVC DiffHierVC FreeVC FreeVC*

Zero-shot cross-lingual voice conversion for unseen language

Source speech: French - Target Speech: German

Source speech Target speech MulliVC ConsistencyVC DiffHierVC FreeVC FreeVC*

Source speech: German - Target Speech: French

Source speech Target speech MulliVC ConsistencyVC DiffHierVC FreeVC FreeVC*

Zero-shot monolingual voice conversion

Source speech Target speech MulliVC ConsistencyVC DiffHierVC FreeVC FreeVC*
Source speech Target speech MulliVC ConsistencyVC DiffHierVC FreeVC FreeVC*

Ablation Study

We list the speech examples of ablation study here.

Source speech Target speech MulliVC MulliVC w/o step2,3 MulliVC w/o step3 MulliVC w/o asr loss MulliVC w/o fine-grained timbre conformer