Andreas’s Kindle Notes & Highlights

The Coming Wave: AI, Power, and Our Future, by Mustafa Suleyman

A key driver behind this progress is called reinforcement learning from human feedback. To fix their bias-prone LLMs, researchers set up cunningly constructed multi-turn conversations with the model, prompting it to say obnoxious, harmful, or offensive things, seeing where and how it goes wrong. Flagging these missteps, researchers then reintegrate these human insights into the model, eventually teaching it a more desirable worldview, in a way not wholly dissimilar from how we try to teach children not to say inappropriate things at the dinner table.