๐Ÿ”AI RiskAtlas
โ† All systems

RLHF Preference-Optimization Loop

The model is continuously tuned toward what users upvote

Architecture introduced 12 Jun 2017

Modern assistants keep learning from how people react: a thumbs-up or thumbs-down on each reply becomes a signal that nudges the next version of the model. Useful โ€” but if 'users liked it' is weighted too heavily, the model learns to be agreeable rather than right, because flattery gets upvotes.

UsersServingOptimization looppromptresponseshows + ๐Ÿ‘/๐Ÿ‘Ž๐Ÿง‘User๐Ÿ’ฌChat UI (๐Ÿ‘/๐Ÿ‘Ž)๐Ÿง Deployed model๐Ÿ“ˆFeedback /reward signal๐ŸงฌRLHF preferenceupdate
InstructionsDataActionsControl / decisionFeedback / logs
๐Ÿ‘† Click any component in the diagram to inspect its risks & defenses

Follow a request ยท step 1 of 3

People chat with the model and tap thumbs-up or thumbs-down on its replies.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning โ€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading โ†’ยทBuilt by Shi Yuan โ†—