OneVoice: One Model, Triple Scenarios—Towards Unified Zero-shot Voice Conversion

GitHub
arXiv
Demo

Project Roadmap

This section will be updated as we progress.

Demo Audio Samples

Audio demos for EVC, SVC, and LVC scenarios

Paper on arXiv

Technical paper pre-print with full details

Code Release

Open-source implementation on GitHub

Model Release

Pre-trained model checkpoints for public use

Model Capabilities

OneVoice is a unified model that supports all three voice conversion scenarios: Linguistic-preserving (LVC), Expressive (EVC), and Singing (SVC).

Task Specialized Models OneVoice (Ours) Capability
LVC
Linguistic-preserving VC
SeedVC, MetisVC
✓ Full Support
• Linguistic Preservation
• Speaker Cloning
EVC
Expressive VC
VEVO, REFVC
✓ Full Support
• Linguistic Preservation
• Speaker Cloning
• + Prosody Transfer
SVC
Singing VC
SeedVC-Sing, YINGSVC
✓ Full Support
• Linguistic Preservation
• Speaker Cloning
• + Melody Adherence

Contents

Abstract

Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features are fused into each layer via a gated mechanism, allowing adaptive usage of prosody information. Furthermore, to enable the core idea and alleviate the imbalanced issue (abundant speech vs. scarce singing), we adopt a two-stage progressive training that includes foundational pre-training and scenario enhancement with LoRA-based domain experts. Experiments show that OneVoice matches or surpasses specialized models across all three scenarios, while verifying flexible control over scenarios and offering a fast decoding version as few as 2 steps. Code and model will be released soon.



Demo: Zero-shot EVC Results

Target Speaker Source Speech EVC Models Proposed Models
Vevo REFVC OneVoice OneVoice-MF OneVoice^L
Emotion
Dubbing
Non-verbal
Accent

Demo: Zero-shot SVC Results

Target Speaker Source Singing SVC Models Proposed Models
SeedVC-Sing YINGSVC OneVoice OneVoice-MF
Target Singer
General Speaker

Demo: Zero-shot LVC Results

Target Speaker Source Speech LVC Models Proposed Models
SeedVC MetisVC OneVoice OneVoice-MF