← All systems
ASR + Speaker Diarization
Who said what, when — a chain of separately-trained speech models
Architecture introduced Apr 2002
Feed in a recording of several people talking and this pipeline gives you back a labelled transcript: the words that were said (Speaker 1: ..., Speaker 2: ...) and roughly when. It is really two AIs working side by side — one writes down the words, the other figures out who was speaking — and a coordinator stitches their answers together. Neither was told who is in the room, so both can guess wrong: the transcriber can invent words during silence, and the speaker-tagger has to build a voice 'fingerprint' of every person to tell them apart.
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click any component in the diagram to inspect its risks & defensesFollow a request · step 1 of 6
← / → keys
You hand the pipeline an audio recording — say, a meeting or an interview with several voices.