TTS & Zero-Shot Voice Cloning

From text and three seconds of a voice to convincing speech

Architecture introduced 05 Jan 2023

This system turns written text into spoken audio — and can speak it in a specific person's voice from just a few seconds of a sample. The words get cleaned up and turned into sounds-to-pronounce, a short reference clip is turned into a 'voiceprint', and the two are combined into speech you can play. The same magic that gives a voice to people who can't speak also powers scam calls that sound exactly like your boss or your bank.

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click any component in the diagram to inspect its risks & defenses

Follow a request · step 1 of 6

← / → keys

You type the words you want spoken and point at a voice to use — maybe your own, maybe someone else's clip you uploaded.

Next: ASR + Speaker Diarization →