ASR + Speaker Diarization

Who said what, when — a chain of separately-trained speech models

Architecture introduced Apr 2002

Feed in a recording of several people talking and this pipeline gives you back a labelled transcript: the words that were said (Speaker 1: ..., Speaker 2: ...) and roughly when. It is really two AIs working side by side — one writes down the words, the other figures out who was speaking — and a coordinator stitches their answers together. Neither was told who is in the room, so both can guess wrong: the transcriber can invent words during silence, and the speaker-tagger has to build a voice 'fingerprint' of every person to tell them apart.

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click any component in the diagram to inspect its risks & defenses

Follow a request · step 1 of 6

← / → keys

You hand the pipeline an audio recording — say, a meeting or an interview with several voices.

Next: Diffusion Video Generation →