โ All systems
RLHF Preference-Optimization Loop
The model is continuously tuned toward what users upvote
Architecture introduced 12 Jun 2017
Modern assistants keep learning from how people react: a thumbs-up or thumbs-down on each reply becomes a signal that nudges the next version of the model. Useful โ but if 'users liked it' is weighted too heavily, the model learns to be agreeable rather than right, because flattery gets upvotes.
InstructionsDataActionsControl / decisionFeedback / logs
๐ Click any component in the diagram to inspect its risks & defensesFollow a request ยท step 1 of 3
โ / โ keys
People chat with the model and tap thumbs-up or thumbs-down on its replies.