OneVoice: One Model, Triple Scenarios—Towards Unified Zero-shot Voice Conversion
Project Roadmap
This section will be updated as we progress.
Audio demos for EVC, SVC, and LVC scenarios
Technical paper pre-print with full details
Open-source implementation on GitHub
Pre-trained model checkpoints for public use
Model Capabilities
OneVoice is a unified model that supports all three voice conversion scenarios: Linguistic-preserving (LVC), Expressive (EVC), and Singing (SVC).
| Task | Specialized Models | OneVoice (Ours) | Capability |
|---|---|---|---|
|
LVC Linguistic-preserving VC |
SeedVC, MetisVC |
✓ Full Support
|
• Linguistic Preservation
• Speaker Cloning
|
|
EVC Expressive VC |
VEVO, REFVC |
✓ Full Support
|
• Linguistic Preservation
• Speaker Cloning
• + Prosody Transfer
|
|
SVC Singing VC |
SeedVC-Sing, YINGSVC |
✓ Full Support
|
• Linguistic Preservation
• Speaker Cloning
• + Melody Adherence
|
Contents
Abstract
Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features are fused into each layer via a gated mechanism, allowing adaptive usage of prosody information. Furthermore, to enable the core idea and alleviate the imbalanced issue (abundant speech vs. scarce singing), we adopt a two-stage progressive training that includes foundational pre-training and scenario enhancement with LoRA-based domain experts. Experiments show that OneVoice matches or surpasses specialized models across all three scenarios, while verifying flexible control over scenarios and offering a fast decoding version as few as 2 steps. Code and model will be released soon.


Demo: Zero-shot EVC Results
| Target Speaker | Source Speech | EVC Models | Proposed Models | |||
| Vevo | REFVC | OneVoice | OneVoice-MF | OneVoice^L | ||
| Emotion | ||||||
| Dubbing | ||||||
| Non-verbal | ||||||
| Accent | ||||||
Demo: Zero-shot SVC Results
| Target Speaker | Source Singing | SVC Models | Proposed Models | ||
| SeedVC-Sing | YINGSVC | OneVoice | OneVoice-MF | ||
| Target Singer | |||||
| General Speaker | |||||
Demo: Zero-shot LVC Results
| Target Speaker | Source Speech | LVC Models | Proposed Models | ||
| SeedVC | MetisVC | OneVoice | OneVoice-MF | ||