🔍AI RiskAtlas
← All systems

ASR + Speaker Diarization

Who said what, when — a chain of separately-trained speech models

Architecture introduced Apr 2002

Feed in a recording of several people talking and this pipeline gives you back a labelled transcript: the words that were said (Speaker 1: ..., Speaker 2: ...) and roughly when. It is really two AIs working side by side — one writes down the words, the other figures out who was speaking — and a coordinator stitches their answers together. Neither was told who is in the room, so both can guess wrong: the transcriber can invent words during silence, and the speaker-tagger has to build a voice 'fingerprint' of every person to tell them apart.

UntrustedRecorded audio (anyone's speech)Speech pipelineSupply chainOversightsubmits audioraw audio🧑User🌐UntrustedContent🎛️Orchestrator /Agent Loop📝ASR /Speech-to-Text👥SpeakerDiarizer🗣️Speaker /Voice-Clone🧯OutputGuardrail🧬Model Weights &Registry🏪Model / PackageRegistry📈Monitoring &Evals
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click any component in the diagram to inspect its risks & defenses

Follow a request · step 1 of 6

You hand the pipeline an audio recording — say, a meeting or an interview with several voices.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗