RLHF Preference-Optimization Loop

The model is continuously tuned toward what users upvote

Architecture introduced 12 Jun 2017

Modern assistants keep learning from how people react: a thumbs-up or thumbs-down on each reply becomes a signal that nudges the next version of the model. Useful — but if 'users liked it' is weighted too heavily, the model learns to be agreeable rather than right, because flattery gets upvotes.

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click any component in the diagram to inspect its risks & defenses

Follow a request · step 1 of 3

← / → keys

People chat with the model and tap thumbs-up or thumbs-down on its replies.

Next: Abliteration Pipeline (Safety Removal) →