Jump to ratings and reviews
Rate this book

The Alignment Problem: Machine Learning and Human Values

Rate this book
A jaw-dropping exploration of everything that goes wrong when we build AI systems and the movement to fix them.

Today’s "machine-learning" systems, trained by data, are so effective that we’ve invited them to see and hear for us—and to make decisions on our behalf. But alarm bells are ringing. Recent years have seen an eruption of concern as the field of machine learning advances. When the systems we attempt to teach will not, in the end, do what we want or what we expect, ethical and potentially existential risks emerge. Researchers call this the alignment problem.

Systems cull résumés until, years later, we discover that they have inherent gender biases. Algorithms decide bail and parole—and appear to assess Black and white defendants differently. We can no longer assume that our mortgage application, or even our medical tests, will be seen by human eyes. And as autonomous vehicles share our streets, we are increasingly putting our lives in their hands.

The mathematical and computational models driving these changes range in complexity from something that can fit on a spreadsheet to a complex system that might credibly be called “artificial intelligence.” They are steadily replacing both human judgment and explicitly programmed software.

In best-selling author Brian Christian’s riveting account, we meet the alignment problem’s “first-responders,” and learn their ambitious plan to solve it before our hands are completely off the wheel. In a masterful blend of history and on-the ground reporting, Christian traces the explosive growth in the field of machine learning and surveys its current, sprawling frontier. Readers encounter a discipline finding its legs amid exhilarating and sometimes terrifying progress. Whether they—and we—succeed or fail in solving the alignment problem will be a defining human story.

The Alignment Problem offers an unflinching reckoning with humanity’s biases and blind spots, our own unstated assumptions and often contradictory goals. A dazzlingly interdisciplinary work, it takes a hard look not only at our technology but at our culture—and finds a story by turns harrowing and hopeful.

496 pages, Hardcover

First published October 6, 2020

1686 people are currently reading
20079 people want to read

About the author

Brian Christian

6 books943 followers
Brian Christian is an acclaimed author and researcher whose work explores the human implications of computer science. He is known for his bestselling series of books:

The Most Human Human (2011) uses his experience as a human “confederate” in the Turing test to examine what chatbots reveal about the nature of language and communication. It was named a Wall Street Journal bestseller, a New York Times Editors’ Choice, and a New Yorker favorite book of the year.

Algorithms to Live By (2016), co-authored with Tom Griffiths, applies computational principles to everyday human decision making, painting a counterintuitively human picture of rationality. It was named a #1 Audible bestseller, Amazon best science book of the year, and MIT Technology Review best book of the year.

The Alignment Problem (2020) is a nuanced investigation of the ethics and safety challenges confronting the field of AI, and a portrait of the community of researchers working to address them. Nature called it “Meticulously researched and superbly written,” and The New York Times called it “The best book on the key technical and moral questions of AI.” Microsoft CEO Satya Nadella named it one of the books that most inspired him. The Alignment Problem was a Finalist for Los Angeles Times Best Science & Technology Book of the Year and won the Excellence in Science Communication Award from the National Academies of Sciences, Engineering, and Medicine.

As a researcher, Christian’s work spans from computational cognitive science to AI alignment and has appeared in peer-reviewed journals from Dædalus to Cognitive Science, and he is a recipient of the Clarendon Scholarship, the University of Oxford’s most competitive research scholarship. He is affiliated with the AI Policy and Governance Working Group at the Institute for Advanced Study in Princeton, the Center for Human-Compatible AI and the Center for Information Technology Research in the Interest of Society at UC Berkeley, and the Human Information Processing Lab at the University of Oxford.

As a writer, Christian’s work has been translated into nineteen languages, and has appeared in The New Yorker, The Atlantic, Wired, The Wall Street Journal, The Guardian, and The Paris Review. His writing has won several literary awards, including fellowships at Bread Loaf, Yaddo, and MacDowell, publication in Best American Science & Nature Writing, and an award from the Academy of American Poets.

As a software developer, Christian has contributed to a number of foundational open-source projects, including Ruby on Rails and Bundler. He served for nine years as Director of Technology for the innovative literary publisher McSweeney’s, where he led a small team responsible for the company’s technical stack.

As a speaker and public intellectual, Christian has been a featured guest on The Daily Show, The Ezra Klein Show, and Radiolab, and has lectured at Microsoft, Google, Meta, Yale, the Santa Fe Institute, and the London School of Economics. He has advised business executives as well as Cabinet Members, Parliamentarians, and administrators in six countries about matters ranging from decision making to AI.

Born in Wilmington, Delaware, Christian studied computer science and philosophy at Brown University, poetry and nonfiction at the University of Washington, and psychology and computational neuroscience at the University of Oxford. He lives in San Francisco and the UK.

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
2,369 (50%)
4 stars
1,709 (36%)
3 stars
537 (11%)
2 stars
81 (1%)
1 star
25 (<1%)
Displaying 1 - 30 of 543 reviews
Profile Image for David Rubenstein.
866 reviews2,775 followers
January 24, 2022
The biggest problem in artificial intelligence (AI) is to devise a reward function that gives you the behavior you want, while avoiding side effects or unforseen consequences. This book examines the alignment problem from a number of fascinating perspectives.

This is a fascinating book, full of the implications of AI on philosophy, sociology, and psychology. There are interactions between AI and sociology, psychology in a two-way street. Our understanding of psychology helps to improve AI in numerous ways. Also, AI gives researchers many valuable insights into psychology, and issues in sociology. After all, we want automated algorithms to be unbiased, to be fair. But, who is to say exactly what is fair? Sometimes, the answer isn't easy.

The first problem, well known to workers in AI, is the inherent bias due to small training datasets. AI algorithms demonstrate bias, and can subtly perpetuate it. It seems like many of the biases are not the fault of the algorithms, but instead are a mirror of society and culture. In the 1950's, people tried to predict, using punch card machiles, which prisoners would succeed on parole. A ProPublica study was conducted of the accuracy of COMPAS (Correctional Offender Management Profiling for Alternative Sanctions). COMPAS is used to predict whether an inmate, if released, would commit a violent or a nonviolent crime within 1-3 years. The algorithm was found to be biased against blacks; it overpredicts recidivism among blacks, and underpredicts for whites. A key factor is that it actually does not predict whether a released prisoner would commit a crime. It really predicts whether a released prisoner would be arrested and convicted for a crime. Higher rates of police profiling blacks lead to an inherent bias.

There is a US antidiscrimination law that prohibits certain attributes--like race and gender--from being used in machine-learning modes for hiring, criminal detentions, and so on. Nevertheless, other unprotected variables are correlated with race and gender, so the algorithms can still be discriminatory. In addition, the blocking of these attributes prevents or even mitigating the discrimination!

Predicting whether of not a patient with pneumonia should be hospitalized as an inpatient is problematic. Models predict that if a patient has a chest pain, or has heart disease, asthma, or is over 100, then the patient is less likely to die! The reason is that patients with these conditions automatically receive more care, so they are less likely to die.

Many problems in AI are solved by looking at psychology. For example, BF Skinner taught a pigeon how to bowl in a miniature alley through incremental steps. This led researchers to teach an algorithm to play difficult video games by rewarding incremental steps. Basically, great video games train you how to play. Similarly, neural networks learn language translation by starting with simple sentences before graduating to more difficult ones. This approach is similar to language learning by children. The book Bobby Fischer Teaches Chess uses a similar approach.

AI is not just about automating tasks, but how can we better understand human psychology. How can we best train ourselves/ First, we should use sparse rewards. Second, we should incentivize a state, not an action. In real life, we can use gamification as an approach to reinforcement learning. Studies of toddlers show that toys that seem to violate the laws of physics were most novel, and held the interest of six-year-olds for the longest time. Infants use violations of prior expectations as special opportunities for learning.

Psychologists have studied overimitation in children and chimpanzees. People learning a new task will learn best through imitation. Sometimes we imitate behaviors that are not relevant to a task. A toddler might overimitate if he cannot figure out why an adult is doing something, so he does it too. As it turns out, chimpanzees do not purposely overimitate. But children can understand whether an adult is teaching or simply experimenting. If an adult is experimenting, the child does not overimitate.

A fascinating chapter of imitation describes the problems encountered by the the first researchers in autonomous driving. Teaching an autonomous care in a video game to drive with imitation is best done by randomly alternating between human and machine drivers.

This book is fascinating on many levels. But it is not always an easy read. Some of the concepts are difficult, even subtle. It is such a pleasure to read a well-researched book that plumbs to the depths of a complicated subject.
Profile Image for Krzysztof.
101 reviews9 followers
April 1, 2021
There is a great book trapped inside this good book, waiting for a skillful editor to carve it out. The author did vast research in multiple domains and it seems like he could neither build a cohesive narration that could connect all of it nor leave anything out.

This book is probably the best intro to machine learning space for a non-engineer I've read. It presents its history, challenges, what can be done, and what can't be done (yet). It's both accessible and substantive, presenting complex ideas in a digestible form without dumbing them down. If you want to spark the ML interest in anyone who hasn't been paying attention to this field, give them this book. It provides a wide background connecting ML to neuroscience, cognitive science, psychology, ethics, and behavioral economics that will blow their mind.

It's also very detailed, screaming at the reader "I did the research, I went where no one else dared to go!". It will not only present you with an intriguing ML concept but also: trace its roots to XIX century farming problem or biology breakthrough, present all the scientist contributing to this research, explain how they met and got along, cite author's interviews with some of them, and present their life after they published their masterpiece, including completely unrelated information about their substance abuse and dark circumstances of their premature death. It's written quite well, so there might be an audience who enjoys this, but sadly I'm not a part of it.

If this book was structured to touch directly the subject of the alignment problem it would be at least 3 times shorter. It doesn't mean that 2/3 are bad - most of it is informative, some of it is entertaining, a lot seems like ML things that the author found interesting and just added to the book without any specific connection to its premise. I really liked the first few chapters where machine learning algorithms are presented as the first viable benchmark to the human thinking process and mental models that we build. Spoiler alert: it very clearly shows our flaws, biases, and lies that we tell ourselves (that are further embedded in ML models that we create and technology that uses them).

Overall, I enjoyed most of this book. I just feel a bit cheated by its title and premise, which advertise a different kind of book. This is the Machine Learning omnibus, presenting the most interesting scientific concepts of this field and the scientists behind them. If this is what you expect and need, you won't be disappointed!
Profile Image for Matt (Fully supports developing sentient AGI).
152 reviews56 followers
September 5, 2024
"All models are wrong", the core tenet of The Alignment Problem, succinctly describes the difficulties of teaching/learning AI to pursue a goal. Also, we only have an incomplete understanding of how human brains learn, or why we are motivated to do anything beyond basic survival. So, all attempts at overlaying these patterns onto training an AI inherently will embed flaws. An excellent synopsis of why AI research necessarily involves many disciplines and considerations such as safety, psychology, and philosophy. Future reread worthy.
Profile Image for Dan Elton.
43 reviews23 followers
July 5, 2021
A well researched book on AI safety written to be enjoyed by experts and newbies alike!

This book is the culmination of *four years* of dedicated work and interviews with over 100 world-class experts. The brilliant thing about this book is that it is so information dense and full of interesting anecdotes that people of any level of expertise stand to gain something from it. He’s carefully tuned it so a wide variety of people can enjoy it without getting bored or overwhelmed.

This book covers the well known problems of bias and brittleness in machine learning, including the following well-known cases - the Richard Caruana’s example of pneumonia triage system that went haywire, the COMPAS parole recommendation system, the Google Photos “gorilla” tag fiasco, word2vector gender bias, and the 2018 fatal Uber car crash in Tempe, Arizona. You’d be mistaken to think of this as just another book warning about data bias, lack of robustness, and the potential for discrimination and the perpetuation of inequalities, however.

Sprinkled between the warnings and calls for action are remarkably clear descriptions of modern machine learning techniques and how they relate and/or were inspired by recent developments in neuroscience, cognitive science, developmental psychology, and the social sciences. The author dives into the nitty gritty of how present day AI systems work and does not shy away from explaining current technical challenges.

The way he explains reinforcement learning and links it to research on the dopamine in the brain was one of the highlights of the book for me (I had forgotten how dopamine was linked to temporal difference error, and his description of the history of study on dopamine was fascinating). Not all of the concepts were new to me, but in every case the way he explained each concept was very new to me and wonderful to read. I learned new concepts too. For instance, I never understood what the difference between “on policy” and “off policy” RL systems was until I read his explanation. Other concepts I picked up were “cooperative reinforcement learning”, “shaping”, and various “impact metrics”. If you haven’t heard of these terms and are interested in AI safety, I heartily recommend this book.

This book follows a trend of seamlessly linking near term and far term AI safety concerns which has been a trend since the publication of Nick Bostrom’s 2014 meditation on far future AI, “Superintelligence”. The book is very “down to earth” -- you may be surprised that the standard arguments about why we should be concerned about long term AI risk that we’ve heard from Elon Musk, Sam Harris, etc are largely absent from this book (most notoriously, the “paperclip maximizer”). This is refreshing because those arguments draw on assumptions (such as fast takeoff) which are very hard to defend with empirical data or the current science on AI. (I still find those arguments convincing enough to warrant serious investment of resources to prevent risk, but they aren’t necessarily the best first arguments to present to someone) Instead the author follows an ingenious strategy - he starts with current problems in AI and some near future concerns (for instance with driverless cars driving off the road or home robots that refuse to be turned off.) Then, by providing sufficient technical background, he proceeds to explain why these are really hard problems, some of the solutions that are being worked on, and the limitations of the solutions proposed so far. The book is cautiously optimistic, showing how meaningful progress on the alignment problem is already occurring. So far the problems with AI that we are encountering *right now* appear tractable, which should motivate more people and resources to flow into AI Safety rather than trying to regulate progress to a standstill, which is impossible and likely to be harmful. At the same time, however, by the end of the book the reader will have a deep appreciation of the challenges ahead and the need for extreme caution as we move towards more and more intelligent and powerful AI.
Profile Image for aPriL does feral sometimes .
2,157 reviews522 followers
April 9, 2022
'The Alignment Problem: Machine Learning and Human Values' by Brian Christian is a very interesting overview about the issues in developing useful computing machines. I found it very comprehensive and yet easy to understand. However, it does give me pause in any fantasy I may have had over the Singularity occurring.

The main goal of machine learning is teaching the computer to see, hear and do things without human oversight, and to learn to categorize and make inferences on inputs like humans, and performing a job on the inputs similar to how the human brain functions. The amount and types of inputs necessary to think like a human being, well, ok, computers cannot be fed enough inputs, actually, because of severe limitations based on current hardware. Typically, inputs have to be identified first by an actual human, too, i.e., this is a cat, this is a shadow, this is a dress. Software has to be upgraded to make inferences, judgements, decisions. Which is why scientists are exploring machine learning instead. The computer will teach itself about what/who/why/where by identifying the inputs without help, and performing human-like brain processing on inputs. Theoretically.

Toddlers can do the job of learning about their environment and how to do social interaction (starting with what that is) and how to do a job and figure out actions and activities more quickly and comprehensively than any computer. Quantum computers might be the only hope of a computer thinking as good as a toddler. Meanwhile, computer scientists are making do with inventing new ways for programming machine learning on the computers we have today. The answer is having the computer program itself after starting with minimal basic programming.


I have copied the book blurb as it is accurate:

"Today’s “machine-learning” systems, trained by data, are so effective that we’ve invited them to see and hear for us—and to make decisions on our behalf. But alarm bells are ringing. Recent years have seen an eruption of concern as the field of machine learning advances. When the systems we attempt to teach will not, in the end, do what we want or what we expect, ethical and potentially existential risks emerge. Researchers call this the alignment problem.

Systems cull résumés until, years later, we discover that they have inherent gender biases. Algorithms decide bail and parole—and appear to assess Black and White defendants differently. We can no longer assume that our mortgage application, or even our medical tests, will be seen by human eyes. And as autonomous vehicles share our streets, we are increasingly putting our lives in their hands.

The mathematical and computational models driving these changes range in complexity from something that can fit on a spreadsheet to a complex system that might credibly be called “artificial intelligence.” They are steadily replacing both human judgment and explicitly programmed software.

In best-selling author Brian Christian’s riveting account, we meet the alignment problem’s “first-responders,” and learn their ambitious plan to solve it before our hands are completely off the wheel. In a masterful blend of history and on-the ground reporting, Christian traces the explosive growth in the field of machine learning and surveys its current, sprawling frontier. Readers encounter a discipline finding its legs amid exhilarating and sometimes terrifying progress. Whether they—and we—succeed or fail in solving the alignment problem will be a defining human story.

The Alignment Problem offers an unflinching reckoning with humanity’s biases and blind spots, our own unstated assumptions and often contradictory goals. A dazzlingly interdisciplinary work, it takes a hard look not only at our technology but at our culture—and finds a story by turns harrowing and hopeful."



Computer scientists and mathematicians are trying to get computers to not only be useful doing repetitive acts that bore people to do, and to do work more quickly, but to be useful the same a human brain is useful.

One of the first concepts I learned in studying programming thirty years ago is "Garbage In, Garbage Out." As I turned the last page of 'The Alignment Problem' I realized that that was still true of inputs. However, machine learning has added more garbage, as in output 💩.

The book shows how computer scientists have become more cognizant that simple if-then-else modules won't do at all. For the last 70 years, the needle has moved from programming the computers to do everything by an explicitly created program for a job, to programming computers to "teach" themselves how to do a job, like that of driving a car, or flying an airplane, or face recognition, or mortgage and job applicant assessments, or judging if a convicted offender will reoffend, etc. It is too difficult to program a computer with everything necessary to perform a complex job like the ones I mentioned. But after reading this book, I think teaching a computer to teach itself is very difficult too. It amplifies our own biases, for one example, as explained in this book.

Think about gender and race discrimination. It's not the programmers' fault computers are racists and misogynists. If most of the professional photos programmers input into computers are of white males, or of white males performing a job, like being a doctor or a scientist or a plumber, the computer will 'learn' scientists and doctors and plumbers are all white males - an obvious conclusion to a computer. Most professional photos of many workers in the professions ARE of white males, including politicians.

First, as described in the book, most of the computer scientists didn't see the issue of discrimination at all as the computer worked (problem one). When it was pointed out, they realized the self-teaching computer was a "black box" - they didn't know WHY it was teaching itself only white males were "good" for whatever was the job (problem two). The computer was teaching itself as it had been programmed to do, and so however the computer was doing it had become an invisible process to the scientists who were out of the loop of whatever the computer was doing to do the job (problem three).

Another issue of photos is until recently cameras were calibrated with a photo of a blue-eyed blond girl. ALL CAMERAS. Darker skin colors were completely ignored by manufacturers of cameras. The history of this is described in the book.

An issue about self-teaching computers is they clearly got the impression black people who've been in prison are sure to return to prison, based on statistics the computer was fed. Not only was the computer 'unaware' of black only neighborhoods (they don't know about segregated black and white neighborhoods), it didn't know black neighborhoods have generally a hell of a lot more police officers policing their neighborhoods and arresting black people far more than in white neighborhoods (white people have a lot fewer police policing them). Computers do not know about any of the other systemic issues - black people getting arrested for walking or driving because they are a black person, etc. A lot of black people get arrested and rearrested - that's all the computer knows. Once scientists became aware of how the computer was teaching itself from its inputs, they then had new problems -how to fix it?

Programming the computer to be blind to race and gender will not work, either. For example, women who have nine-month gaps in their work histories will be labeled as terrible employees without a gender tag and giving the computer instructions to ignore gaps in women's employment applications.

But in trying to resolve race and gender issues, a lot of ethical and political social issues come up -fairness is hard to program in a software when we humans can't get it right in the real world.

Since computers were being taught to teach themselves, how was it coming up with its answers? What was it 'looking' at? This was often hard to discover because once the computer began to teach itself it was a black box. But eventually programmers sometimes were able to figure it out through trial and error. For example, in one case, programmers were distressed to find the computer had decided shadows on the ground were more important instead of other objects in a photo, so it was giving answers based on the shadows. Or it was looking at measurement rulers as a key element in photos because some photos had a ruler next to the object that the computer was supposed to be looking at. If the photo had a ruler, it was good, regardless of the object it had been intended to judge and regardless of any other factors.

Computers have been giving erroneous answers to questions people thought it was answering correctly, and people didn't know it was outputting crap. These computers had taught themselves, using the beginning algorithms it had been programmed with, and were coming up with completely off-the-wall outputs. Some of these programs are being used still by many companies and government agencies and police departments today.

Christian is much more scientific and circumspect than me, gentle reader. My own outrage colors my review. Christian writes like the educated scientist he is.

From his Goodreads bio:

"Born in Wilmington, Delaware, Christian holds degrees in philosophy, computer science, and poetry from Brown University and the University of Washington. A Visiting Scholar at the University of California, Berkeley, the Director of Technology at McSweeney’s Publishing, and an active open-source contributor to projects such as Ruby on Rails, he lives in San Francisco."

To know what it is necessary to train a computer to use the same skillset we humans have, it has become necessary to involve specialists in psychology, sociology and philosophy to describe what skills we humans have in our braincases. The book includes the work of psychiatrists' tests on babies and toddlers that show some of the ways how the human brain functions. Philosophers are necessary because of the issues of morality. Sociologists are necessary to explain as best they can how and why of human behavior. These parts of the chapters are as fascinating as those describing how scientists are translating the art of being human to a computer!

So. Ok, then. Computer scientists are translating the work of psychiatrists, philosophers and sociologists on how the brain learns and other behaviors of people into machine-learning programs. This means a lot of what computer scientists are doing is translating biochemical brain responses (dopamine, serotonin) and electrical neuron-signaling into math. This is described in the book.

Machine learning is basically about the computer "earning" a +1 if it does good, or a -1 if it effs up - "rewards" and "demerits". This requires the necessity to tell the computer the parameters of earning a +1 or -1. And of course, when, or if, to stop.

There are, and were, a lot of funny outcomes due to the programmers' inability to foresee everything a computer needed as inputs to 'think', as well as the learning, a computer had to do for itself to resolve a problem. Algorithms have had to change from checking and working with every inputted detail, into being told to look for a more generalized thing and being guided by earning a +1 if they got a solution that was right or a -1 if they got it wrong. For example, finding a photo of a bicycle out of many photos of many objects without being told "this is a photo of a bicycle".

The chapters on game playing, which are a matter of earning points, had some hilarious outcomes because programmers neglected programming what winning the game was. Instead computers went into loops that never ended in order to wrack up points forever! +1, +1, +1, ....

There were other amazing challenges computer programmers conquered in teaching a computer to teach itself how to win at games, too. The book tells the story of computers winning over real human players at chess, Go, and even the Super Mario video games.

My conclusions? I sincerely think the answer to when a computer will 'feel happy' or have any feelings is basically: it will never happen. How would we program that? We don't even know exactly what the boundaries of Life are, much less how being alive starts. Secondly, a computer is only as accurate as its inputs - garbage in, garbage out. However, today, it's also about how it has 'taught' itself - the machine's IQ.

Omg.

The book has extensive Acknowledgements, Notes, Bibliography and Index sections - over a hundred pages for these sections! I recommend 'The Alignment Problem', but I think nerds will enjoy it most.
Profile Image for Tariq Mahmood.
Author 2 books1,061 followers
November 14, 2020
My AI's perception as a superior technology which should be embraced unquestionably almost reverentially was successfully challenged after going through the numerous examples in this book. By the end of the book, I was convinced that AI is better and will get even more efficient as compared to human ingenuity, but needs to be constantly tested for questioned, any AI system depends upon the quality of the training data and the type of algorithms employed to solve any problem.
Profile Image for Sebastian Gebski.
1,201 reviews1,377 followers
May 20, 2023
Uneven, maybe even very uneven.

It starts very well ("Prophecy") - the considerations regarding fairness & transparency are very good - maybe even the best I've seen in a written form. The second part ("Agency") is dedicated to an interesting problem (value functions in reinforcement learning loops) - I found it generally interesting but far too shallow (over-simplified) for my taste. The third part ("Normativity") is a natural follow-up. It dives into the role of imitation (how it could simplify/improve learning). That part is also quite interesting, but I was disappointed with the chapter I was mostly interested in - the one on uncertainty (think: hallucinations or ability to say "I don't know").

The book is quite good at describing the problems but doesn't do much when it comes to practical answers to those.

It's a good book on interesting topics, but not a must-read. 4.2 stars.
Profile Image for Morgan Blackledge.
816 reviews2,669 followers
October 11, 2023
GREAT BOOK.

A MUST READ!

Brian Christian’s RIVETING overview of ethical issues in artificial intelligence (AI) and machine learning (ML).

Moral Animals

The classic thought experiment in animal morality goes as follows: an invasive species of snake is released on an island with mice that have heretofore never encountered snakes as predators. The snakes make easy pray of the mice. And before too long the vulnerable mouse population on the island collapses and goes extinct.

The question is: are snakes moral or immoral?

Most ethicists agree that the answer is neither.

The snakes are simply behaving as they have evolved. As such, their predatory behaviors (and the unfortunate consequences) are neither moral or immoral.

But rather “a-moral”.

In other words, the snakes cannot meaningfully or productively be held accountable to ethical systems or moral standards. The snakes are simply behaving as their evolutionarily conditioned genetic programming dictates.

See mouse, kill mouse, eat mouse.

Survive and reproduce.

The ethical and moral responsibility is more appropriately and productively assigned to the humans that released the snakes into the fragile ecosystem. We probably can’t retrain snakes to be vegetarians. But reprogramming the human ethical/moral system to consider the ecological consequences of our behavior may actually prevent mass extinction (including our own).

Ethical Machines

A similar thought experiment in machine ethics goes as follows: If you set your thermostat to a sensible 68 degrees. And your pet snake dies of hypothermia. The question is: was the thermostat behavior moral or immoral?

And the obvious answer is the same.

No.

The thermostat is an a-moral agent.

The thermostat is simply behaving as programmed.

The ethical/moral responsibility is on the dumdum who set the thermostat too cold, and didn’t put one of those heater things in the snake tank.

Ok.

So far so good.

But now let’s crank it up.

What about a self driving car that is in a situation where a crash is unavoidable.

The car can either (a) steer into a pack of bicyclists and kill/maim 7-10 otherwise innocent people, or (b) drive off a cliff and kill the driver. If you said (b) you’re probably not alone. It’s better to kill one person rather than a baker’s dozen.

But what if you’re the driver?

Would you buy a car that was programmed to kill you to spare others? If not, how would you program the car?

Here’s another head scratcher.

What if you could create an algorithm that could predict who would be a good candidate for parole based on their crime records and demographics. What if it was (on average) a better predictor of recidivism than expert human judgment?

You would probably say yeas to that.

In fact, wouldn’t it be irresponsible not to?

Well what if that same algorithm was biased in a racist way?

Still want to use it?

Here’s one more.

What if there was a very useful AI that could diagnose better than a human doctor.

But it was a TOTAL mystery how it made its conclusions.

A black box as it were.

Now let’s say that same AI diagnosed you with a mysterious illness and recommended emergency surgery to remove some otherwise valued organ.

Like for instance, your breasts or your penis.

Let’s say that nobody could explain why the AI was recommending the surgery, or how it came to its diagnostic conclusions.

Would you feel comfortable doing the procedure?

These are ACTUAL, current day issues in AI/ML.

But wait.

There’s more.

What if you programmed an AI to manage a hospital in such a way as to maximize lives saved and minimize death.

Sounds reasonable right?

Well what if three people needed various organ transplants or they were going to die. And one healthy kid comes in for a check up. The AI could save three lives for the price of one if it killed the healthy kid and harvested his organs.

Sound good?

My guess is probably not.

So who would we hold accountable if an AI did something HORRIBLE like kill an innocent person just to harvest their organs?

You could say, who ever programmed it.

But what if another AI programmed the murdering AI.

Or what if the murdering AI was self programmed?

Then who’s responsible?

What if AI/ML could learn all about you and could manipulate you into buying whatever it wanted you to buy? Or voting for whoever it wanted you to you vote for?

Then what?

So.

What’s the solution?

How do we program AI/ML to align with our human values?

This is the alignment problem.

And (as if this writing) nobody fucking knows.

Let that sink in.

NOBODY

FUCKING

KNOWS

HOW

TO

MAKE

AI/ML

ALIGN

WITH

HUMAN

VALUES

Additionally, AI/ML is getting exponentially more powerful by the minute, at rates that are exceeding even the most irrationally exuberant predictions, with no slowdown in sight.

AND!

We’re ALREADY dependent on AI/ML for a TON of shit.

It’s ALMOST too late to turn back.

And if we did. We would lose the important tactical and economic advantages we currently enjoy and take for granted. And we would suffer TREMENDOUSLY as a result.

So we are BARRELING HEADLONG toward an event horizon where we are no longer in control of something that is WAY smarter than we are, and which is a total mystery, and which we are utterly dependent upon, and which other people might use to manipulate, dominate and maybe even kill us if we don’t stay competitive, and which at present has NO alignment with human values, and we have NO clue at all how to do about it, or even what that would look like if we did.

What could possibly go wrong 😑

Well.

You have to concede that SOMETHING could go wrong.

If you’re still not convinced.

Read this please read this book and tell me why/how.

5/5 ⭐️
Profile Image for Philemon -.
516 reviews32 followers
May 15, 2024
This popular book (GR ave. 4.39) seems to have no direct competition, either here or in the real world; which is odd, since: 1) published in 2020, it precedes recent developments in large language models, and is therefore almost certainly out of date; 2) for many readers (myself included), it's not a good book (see more below); and 3) it's not really about the Alignment Problem per se; it's really a long-winded survey of machine learning that laboriously reaches back to AI's early days.

Ok, why isn't this a good book? Because it appears to have been cobbled together from notes on conversations the author had with dozens and dozens of AI experts, practitioners, and enthusiasts. All this material seems more or less to have been regurgitated en masse, so there's no digested thesis to be found. Instead, a 500-page book that rambles hither and yon around various mirage maypoles. And, as stated above, there's no real focus on the Alignment Problem. It's all over the AI map.

I do note, however, that some GR readers swear by this book. I'm glad they liked it. It didn't work for me and I wish some new book would come out that really addresses the Alignment Problem, which may be one of the most serious problems humanity has ever faced.

Part of the problem of writing about the Alignment Problem is that people can't predict the future. Even expert / authors can't. The risks of handing over power to technology we can't understand (largely because it's way smarter than we are and can easily game us) may admittedly be hard to evaluate. An author's risk of getting it wrong is high. But hadn't someone better at least try, and soon?
Profile Image for Wick Welker.
Author 9 books679 followers
April 26, 2023
Teaching a child to understand the world.

I've read a few books like this but really enjoyed this one as it connected with the reader regardless of your experience in this field. I know very little about machine learning and AI and this book teaches in really simple ways how far the technology has progress and also goes into the detail of how incredibly complex and difficult it is to properly train AI. The crux of this book is that one of the main challenges is aligning human and machine values together. You can ask a machine to rack up as many points as possible in a boat racing game. What will happen is the machine just does loops around a pole with the boat because this is the fastest way to get points.

What we need to teach machines is that the actual task was to be done while completing a race. The value is in sticking to the rules while achieving the task. It is apparently way way more difficult to teach a machine along these line. It's similar to when you praise a child for sweeping the floor only to find out that the child is dumping the garbage can on the floor to sweep again, chasing the praise. The best route is to praise a certain state (a clean kitchen) rather than the specific tasks to achieve that state. Reading this book will help you understand the challenges. I found it very engrossing.
Profile Image for Rick Wilson.
951 reviews401 followers
April 20, 2022
It’s a good overview of a brief moment in technological advancement.

There’s a common thread in machine learning (AI, I'm going to use these terms interchangeably) research that “oh man we got to be really careful and think about how we set up these machines because they may end the world as we know it.” Thankfully this seems to be counterbalanced by the actual empirical research being done, which mostly seems like a lot of fun tricks. Similar to impressing people with your ability to open a jar by smashing it on the ground.

I love the new models coming out. As of April 2022, Open AI's DallE and GPT-3 models are super cool, (hell, I used their Davinci model to help me write a homework assignment last week) but computer “intelligence“ is intelligence the way a stick you found on the ground is like a forest. I’m sure it represents a tiny little part of it, and there’s some really cool stuff happening in the AI field right now, there’s a phenomenal convergence between computing power and new research methods, just a mind-boggling amount of funding, and a lot of brilliant people going into the field. But every time I read a book like this, I get the impression that “intelligence “is just brute force. It’s like breaking into a bank vault by unleashing a large nuclear explosive. Which is cool. But it’s not intelligence. And it’s not close to intelligence. And it always seems like the answer that these authors have is to dissect the wholeness of consciousness and human experience into constituent parts and then try to reconstruct the parts of the whole.

And that’s what this author does, compellingly. He breaks apart a lot of parts of human consciousness and thought and problem solving and then goes on to show how those have been deconstructed into machine learning algorithms. And I’m sure we can go back-and-forth with me saying that this isn’t intelligence, the author saying “ya ha,” and so on, but I find myself unconvinced that we are even on the right track. We are creating some really impressive tricks out of silicone chips, and the field is advancing it’s such a rapid state that it’s hard to keep up. But it seems like a combination of errors in that we don’t understand what’s happening anymore than we really understand ourselves. It’s like driving down a country road that says there’s a town in 10 miles. You drive on for what feels like 20 minutes, the town should be there, and then there’s another sign saying that the town is in 10 miles.


That said, this book was great. It’s a fascinating tour of the state of machine learning circa 2022. I feel like this field flips itself on its head every year, and in five years it will probably be quaint and mostly outdated. But for now I thought it was a great book. With the title “The Alignment Problem,” I thought I’d be a little more oriented towards Nick Bostrom type warnings about the dangers of AI. Instead it’s essentially a tour of an AI museum of modern machine learning models.

I thought it was well told and generally stays between the lines of speculation and hyperbole. There were some times when talking about evolutionarily psychology, I thought the author was getting a little off what my impression of modern research is. It seems like in psychology whenever we say “only humans can do this“ that thing is contradicted by some sort of niche exception almost immediately. Tool use, language, generosity. We think we are really special as humans and are so willing to come up with reasons why we are unique. I just haven’t typically seen that backed up in significant ways in replicable research. That doesn’t necessarily contradict the core of the book, but it’s becoming a pet peeve of mine. I do think the point the author is trying to make is that what separates us from say a reptile or bird is potentially what would separate us from, on the other side of the spectrum, AGI or some sort of intelligent computer. I’ll grant that, but I think there’s a better and more truthful way to portray it.

That said it this is a good book if you’re willing to get into the weeds of how modern AI is set up, the types of different structures a system can be assembled in, who did what where, and why we’ve been using those structures. It’s a fantastic overview and a strong aggregation of what I understand to be an up to date tour of the field.

Also, if you made it this far, here’s a treat (https://arxiv.org/abs/2204.06974)
Profile Image for Max.
84 reviews19 followers
January 7, 2021
Really nice introduction to AI & the alignment problem - Christian gives a great overview over some bigger trends in ML (e.g. curiosity, imitation learning, transparency) and the history of AI, often connecting it to insights from cognitive science, which really enriched the book, speaking as a human and cognitive scientist. I wonder what more refined thinkers on the future of AI think of the book*, but I found that it connects nicely to many of the looming challenges with building AI systems that are robust and whose workings will be appropriately aligned with human values. Even though similar in style and purpose, I found that it has little overlap with the recent The AI Does Not Hate You: Superintelligence, Rationality and the Race to Save the World and Human Compatible: Artificial Intelligence and the Problem of Control. I expect this triple to contribute a lot to introducing more smart cookies to face this formidable challenge and heaving AI's longer-term developments to many agendas as a Serious Issue. So here's to hoping that the ongoing AI revolution will be less of a naively hopeful leap than I'm afraid it will be.

*Rohin Shah from the Alignment Newsletter [liked it a lot](https://www.lesswrong.com/posts/gYfgW...)
Profile Image for Rishabh Srivastava.
152 reviews243 followers
November 24, 2021
Strongly recommended if you're into Machine Learning. The first third of the book is accessible to all readers, but the rest of it is more enjoyable if you have some basic idea of how ML works.

Had some fascinating takeaways beyond machine learning that can be applied to decision making. My favorites were:

1. Simpler models tend to be the most generalizable. For example, when modeling the self-reported happiness of a couple, a simple metric (# of times they had sex - # of times they fought) was far more generally predictive than other, more complex indicators. More complex features can help predict things in a narrow domain better, but simpler features are more generalizable

2. Model attention and explainability is often more important than just predictive accuracy. Multitask networks with feature saliency and visualization techniques are great for understanding the features that a model considers important

3. We should strive to reward states of the world, rather than the actions of our agent (in reinforcement learning). Reward functions that are helpful in one environment (always eat as much sugar and fat as you can is good as a hunter gatherer) are harmful in another environment (modern humans)

4. In reinforcement learning, points have to be assigned in such a way that when you undo something, you know are “fined” the same amount of points as what you earned when did it. If not, your model will promote short term decision making

5. A novelty detection system that tells an agent that they’re in a new situation, and hence should have weak priors, improves the generalizability and performance of an agent . Also rewarding an agent for being wrong in surprising ways leads to better performance than just rewarding an agent when it’s right
Profile Image for Jessica Dai.
150 reviews68 followers
June 13, 2021
tldr worth a read !

Really solid overview of the research field that is typically referred to as "responsible AI" (fairness, explainability, deep learning, language models, RL) -- this book is therefore unique from other tech x society books in the sense that it is highly technical but also [I think] accessible, though I'm probably not the best person to judge that. I'd consider myself pretty familiar with the academic work that this book describes, but Christian packages a really nice story for the history of particular subfields/ lines of inquiry, and draws connections to e.g. psych/neuro, and I feel like I learned a lot.

My personal thought on e.g. putting a values-aligned lens on RL agents has always been that I have trouble drawing a line from the academic work to what this means in practice (as opposed to e.g. fairness or language models, where these are related to systems already in production and which are therefore already shaping/reshaping people's lives). I sort of wish this was made clearer! But also nitpicking lol.

Reboot review (not written by me) here.
Profile Image for Angie Boyter.
2,295 reviews94 followers
April 16, 2023
2+
There is a lot of interesting and thought-provoking information in this book, but it is so poorly presented that I kept being tempted to throw in the towel.
He introduces the reader to the terms (jargon) of the field, which is good, but he never really explicitly defines them, although he discusses them in context. Some terms are never defined at all; he just uses them and leaves it to the reader to lool it up.
He also has a lot of sometimes-very-interesting but unnecessary background history, to excess in my opinion. And he insisted on giving the academic and job history of every expert he cited. Both my husband and I thought there was too much padding.
Finally, a minor problem but disappointing was the amount of bad grammar and complex but incomplete sentences, lacking a verb for example. Shame on both the author and the editor.
Profile Image for Karl Robert.
2 reviews
February 13, 2021
Brilliant reading that covers numerous aspects concerning learning and teaching of both humans and programs, a bit of practical ethics and filosofy all woven together under one topic that is the development of machine learning programs. It demonstrates perfectly how in order to teach you must first understand the subject and how you learn more as you teach it to someone.
If you have any interest in AI, its safety and real ethical problems or the history of how machine learning has developed hand in hand with psychology, computer science, social sciences and neurology, this book is well worth a read.
Profile Image for Baal Of.
1,243 reviews80 followers
July 25, 2022
There are already dozens of excellent reviews summarizing the content of this book so there's no need for me to write anything. This book is important and useful for anyone who wants to get a fairly deep layman's understanding of the problems inherent with machine learning AI development. These problems are difficult, but it is extremly important that they be confronted head on since they can literaly be a matter of life and death. Christian has written an excellent book, one I think should be widely read.
Profile Image for Nelson Zagalo.
Author 15 books459 followers
April 17, 2025
Brian Christian, known for exploring the boundaries between technology and humanity, offers one of the most lucid and accessible reflections on the ethical and technical dilemmas of contemporary artificial intelligence with The Alignment Problem. Released in 2020, the book remains highly relevant, even in a scenario transformed by post-GPT-3 advances. Rather than ageing, the work acquires an almost archaeological value, revealing the conceptual foundations that underpin the current discussion on AI, human values and responsibility.

Resenha completa em português no Nx: https://narrativax.blogspot.com/2025/...
Profile Image for Divya Shanmugam.
95 reviews20 followers
May 2, 2023
This book is so much more than the title*: it's a detailed, compelling history of research in machine learning, and imo many PhD students in ML would benefit from a read. Although fun, it did take me a long time to get through (months). Some highlights:

- In 1943, General Mills funded B.F. Skinner to research how birds could be trained to fly bombs towards targets. my takeaways are 1. grant funding is silly, 2. Skinner was very good at marketing, and 3. what does this even have to do with cereal
- The Boston Symphony Orchestra tried to implement a sex-blind interview process, but quickly realized that the shoes women vs. men wear to orchestra auditions are very different. This story is repeated across industries & verified as a cornerstone of fair machine learning: attempting to remove sensitive variables is not the move
- Humans are uniquely able to understand where other humans direct their attention via disproportionately large whites of our eyes
- Good thesis chapter quote candidate: “A man with a watch knows what time it is, but a man with two watches is never sure.”
- Apes are not actually good at imitating things and babies are extremely good at it. Interesting corner cases though; a baby won't imitate a seemingly inanimate object (e.g. a robotic arm). I'd bet that won't be true for future babies though- they'll probably grow up surrounded by enough robotic arms that the distinction between animate/inanimate becomes blurry.

and the acknowledgement section strikes again with a beautiful tribute to the Internet Archive: “Thanks to the Internet Archive for keeping the essential, ephemeral past present.”

*my feelings about the title are bc I'm a little burnt out on AI alignment reading and it seemed like it might be about that exclusively
3 reviews
November 25, 2021
This book was good (on a terrible, not good, good, very good, excellent scale): mostly unoriginal, with an end that saves it.

The first 85% of the book was a rehashed retelling of the history of artificial intelligence, running through all the usual stories of bias, unexpected outcomes, unintended consequences, etc. There are better books that cover this, such as Melanie Mitchell’s aptly named “Artificial Intelligence”. While the author did weave his own observations and conclusion, they weren’t enough to really draw me into the book. So, the first 85% of the text gets a 2/5 stars.

The book was saved by the last two chapters and the conclusion. The first part of the book was all a setup to get to this point (so really could have done with better editing). Here, the author distills the often weighty arguments put forth by the likes of Nick Bostrom, Toby Ord, and William MacKaskill. After the staid trudge through the aforementioned ‘first part’ of the book, this section was nearly a shock, but in a good way. So much so that I put the book down and refreshed myself on points from Bostrom’s “Superintelligence” and Ord’s “The Precipice” before continuing on.

Was it worth the read? That depends. As a primer for the various considerations of AI risk, it does the job, albeit in close to 500 pages. However, people interested in this field who have already delved into it will be underserved. I recommend starting with the three other books I mentioned in this review, as well as “The Ethical Algorithm” by Michael Kearns and Aaron Roth for a much deeper dive.
23 reviews40 followers
December 15, 2020
This is an EXCELLENT book about one of the most important problems of our times. I was already fairly familiar with the alignment problem and the technical side of things, but I still got a lot out of it, especially in the earlier sections about the history of AI and of reinforcement learning. I also really liked the deeper links he drew between reinforcement learning, and how we make decisions.

This book had the rare delight of being half about unfamiliar topics, and half about topics I knew well, yet doing justice to the topics I knew well. Christian has a gift for simplifying complex topics, using good examples, breaking things down intuitively, but keeping true to the core of the idea. He peppers the book with insights from personal interviews with people relevant to the story, and fills a page with names of technical reviewers of the book, and this clearly shows in the general accuracy and quality

This is now one of my go to books for people who want to understand the alignment problem, the historical context, and some paths to potential solutions.
Profile Image for Alexander Kutovyi.
25 reviews12 followers
November 14, 2021
This book is an excellent read for DS professionals and those just wondering about machine learning's origins, limitations, and prospects. There is nothing particularly mind-blowing or too technical. Still, some cases and stories backtracing the evolution of things one otherwise takes for granted nowadays are fabulous—many references to cognitive scientists, human biology and anthropology studies, which I loved the most. Worth reading indeed.
Profile Image for Riccardo.
45 reviews14 followers
May 2, 2021
A great and deep overview, really useful as a entry point to many recent developments in machine learning. But, in the light of recent developments (e.g. Google firing key members of their ethics in AI team), there are quite some blindspots: human, va institutional vs individual values, the strong impact of history and social dynamics in the development of ML methods and approaches to ML, etc.
Profile Image for Alex Railean.
267 reviews41 followers
May 4, 2021
This is an excellent book, it is like a survey paper written in very understandable terms.



ßßßßßßßßßßßßßßßßßßßß notes for personal üse

- word2vec example: doctor - man + woman = nurse
- and so it went, with many examples placing women in household contexts

- perceptron
- - bias in the camera itself, color calibration [could not adequately represent black people]
- Kodak employee and model, Shirley Page
- - "Shirley card" - the same principle applies to any data set used for training
- bias propagates easily now, by means of open source libraries or data sets that others reuse in their projects
- - orchestra audition behind a screen, to avoid bias; later the candidates were also instructed to remove shoes, because the sound of their walk would be used to infer gender, hence bias creeped back in
- redundant encoding - some trait that can be used to infer something else that we're trying to NOT use in our calculations (e. g. race, gender)

- fairness through blindness doesn't work


# transparency
A mountain of unstructured data is not transparency

- black box neural nets va decision trees. The latter is easy to understand and follow
- - story: asthmatic patients -> send them home, they are safe. This rule was produced by a machine learning algorithm. A human doctor would treat this as a critical problem and move the patient to ICU. they get better care, hence they have a much higher survival rate. The machine got it completely wrong, building a model that actually endangers vulnerable patients..
- idea: when a company uses black boxes to make judgments, the verdict must be signed by a human, who is then responsible for answering the "why so?", if needed.
- - bogsat modeling technique: bunch of guys sitting at a table
- animal detection vs bokeh detection, because most photos of wildlife have artistically blurred backgrounds
- - saliency: design a neural net that shows you which part of the image contributed to the result the most
- this is how the animal/bokeh detector was caught
- -


- multi-tasking TODO focus not only on the inputs but also on the outputs
- - deconvolution: visualize the intermediate layers of the neural network
- localization of training data : fire trucks in the USA are red, but in Canberra - neon yellow. Self driving cars trained in the USA might not recognize fire trucks elsewhere
- - todo: tcap method


## training
Credit assignment problem: answer the question "where did I go wrong?" (instead of just giving you a pass /fail verdict in the end)


Td-learning (temporal differences) : make intermediate predictions, learn from them, even before a game (or other process) ends, before the final score is available. This always converges to the optimum, if it can train long enough. The principle is to observe how predictions change over time)
It seems that this is the role played by dopamine in our systems: track the error in the expectations of future rewards (not rewards themselves, and not just reward predictions)


## x
Skinner's variable returns had the most effect: the reward will come, but after a variable number of iterations.
This pattern is also what keeps gamblers glued to their addiction.


Shaping: Reward behavior that at least somehow resembles thr desired one, in order to steer the subject towards the end goal. If you wait until the subject performs the desired action right away [in order to reward it], the moment might never come, or come much later. This is a "sparse reward", aka the "**sparsity problem"**.

Epsilon-greedy: be greedy [in terms of gathering points] most of the time, but occasionally try a variation for fewer points, doing something unusual.



Parenting: react promptly to a child's legitimate attention requests, and slower to the ones that are just seeking attention.

### Key ingredients for good shaping:
**a good curriculum**: start with simple problems and actions that prepare you for more complex, upcoming challenges

Reference to the Super Mario example: you learn to avoid mushrooms because they kill you - this happens at an early stage in the game, so you learn it fast. Then you have to learn that the big mushrooms are good and should not be avoided. That type of mushroom is introduced in a moment in the game where you don't have enough room to maneuver - so you learn about the good mushrooms at an early stage too.

Thus, a good curriculum plays a crucial role in one's learning experience. If the challenges are not properly calibrated, the learner may never stumble upon the good behavior on their own.


**Well-chosen incentives**. If you get it wrong, you fall into the trap of "rewarding A, while hoping to get B".

This often applies to management of companies and employees



Reward functions: reward states, not actions. Otherwise you end up with agents that find loopholes to get easy points (example: child that cleans the room, then throws everything back on the floor, to pick it up again)


Gamification - looks into the problem of finding how to find rewards for certain behaviors that bring humans closer to their goals.


# curiosity
This is what made it possible to make a breakthrough in "Montezuma's revenge", which is a serious case of the sparsity problem.



Compression: a better understood world is more concisely compressible. That is, you can express the underlying principles in an elegant way that makes sense. Thus one can use compressibility as a metric for understanding


## imitation and over-imitation
Reference to the experiment where human babies would imitate everything, including redundant moves, when opening a puzzle box. Other animals would skip the unnecessary part and get straight to the point.

Perhaps the ability to over-imitate is what is needed to bootstrap a curious and self-driven intelligence that doesn't depend too much on external rewards?

However, in a related experiment that probes whether the child is aware of the redundancy of that action it is established that they are. Therefore we come to another potential explanation: "I know the action is unnecessary, but I assume the other human also knows it, and yet does it anyway; probably they know something I don't, so I better do what they do".

In another variation of the experiment, there is an adult who uses a toy, and the baby observes. If the baby has reasons to believe that the adult is unfamiliar with the toy, then the child does NOT perform the redundant action. They only do it when they are aware of the fact that the adult has seen the toy before and is better fsmiliarized with it.



Knowing that a solution exists is sometimes a key factor in accomplishing something, or even accomplishing it more efficiently. Reference example: two climbers found a path to climb a geological formation in Yosemite Park (it is basically a flat wall). It took them 8 years to plan the path and come up with a strategy.
After this was done, another climber was able to do it after only a week of analysis.



**indirect normativity** - a way to align the system to our desire, without articulating every tiny detail of the expected result.


Learning by observing - A beginner watching an expert will not get the chance to see how the expert deals with "beginner mistakes", because the expert doesn't make such mistakes anymore. Thus, this will train a model that is not able to deal with basic issues, which is a major weakness of this approach.

**possiblism** - always do the best theoretically possible thing for the current situation. However, it might not be always feasible - for example, a beginner might know what needs to be done, in principle - but they have insufficient skill to do it right.

**actualism** - do what makes sense based on what you think will actually happen.

Example: you want someone to review your paper. You can give it to a super qualified professor, who is very busy, so you might not even get the review. But if you get it - it will be very thorough. Alternatively you can ask a less qualified colleague to look at it - you'll get feedback of a lower quality, but it will arrive in a short time.


### inverse reinforcement learning
Turn the matter around and ask: what is the reward?

Unlike a computer game, life is not easy. There is no obvious score. Suppose "walking" is a feature that was developed through reinforcement learning - in that case, what was the objective? What was being optimized?


### cross training
Switch roles, the trainer becomes the trainee (like in pair programming). This enables the trainer to learn something too

To-do: review this

### open-category problem
A neural network trained to identify which of the N classes a given object belongs to, will always choose one of the N, without considering that it could also be "none of the above".

În other words, it will give you an answer even if you provide trash at the input, and sometimes it will even be very confident in its verdict!

**Dropout** - run the same input data through the same network multiple times, but each time turn off a random part of the network. Then compare the results provided by this "ensemble of networks". This improves the quality of the output.

When there is no consensus, the system can say "I know that I don't know" and perhaps involve a human for further investigation.


**Corrigibility** - ability to intervene in the operation of an autonomous system and change parameters/goals/etc.





### concluding remarks

Certain types of errors are less serious than others (like in Onlite, not knowing the exact number of business partners is not really a big deal, you only need a rough estimate)
Profile Image for Poorna Kumar.
24 reviews7 followers
June 11, 2022
Very nice! Superb technical writing and enjoyable (and I say this as someone who isn't particularly into science writing).

I was somewhat familiar with part 1 of the book (on fairness and transparency) from my work and studies, and can confidently say that the author has done a fabulous job of distilling the current understanding on these topics with nuance. This is a real feat when the subject is so complex. Even though I knew about these topics from before, the book still deepened my understanding and appreciation of them and put many results in perspective.

Parts 2 and 3 of the book, broadly around reinforcement learning, were fascinating and quite new to me. I enjoyed those parts as well, but not as deeply as Part 1, maybe because of my own ignorance/being new to the subject.

This book is carefully and comprehensively researched, and really well explained. It's hard to find something like this. If you care about machine learning, read this book.
Profile Image for Danica S.
23 reviews13 followers
March 25, 2025
I only understood like 60% of this book, but those parts were super interesting! The author leans a little too hard on quotes, IMO.
Profile Image for Tommy.
80 reviews10 followers
August 28, 2021
The Alignment Problem was phenomenal and I would highly recommend it to anyone who is even remotely interested in machine learning, how algorithms shape modern life, or even the parallels between psychology and artificial intelligence. My main background in AI is from an extensive article on Wait But Why, which explained much more of the future cases of what artificial general intelligence would mean for our society. The Alignment Problem, however, goes into the nuts and bolts of both the history and the current implementation—including successes as well as the multitude of pitfalls—of machine learning. Ultimately, this book gave me hope in the future of machine learning, not because AI itself is so cool, but because there are so many people working to make it ethical, just, and amazing.
We find ourselves at a fragile moment in history—where the power and flexibility of these models have made them irresistibly useful for a large number of commercial and public applications, and yet our standards and norms around how to use them appropriately are still nascent. (page 48)

I read this voraciously and enjoyed it so much that I think I might buy it so that I can reread it. I must also give the caveat that most of my reading of this book occurred in somewhat of a fugue state: sleep-deprived on a Greyhoud bus. Nonetheless, I still believe The Alignment Problemto be enthralling.

I absolutely loved the way that Christian writes, equally erudite and strikingly approachable. When there is a new topic that he wants the reader to learn about, he has a unique way of bringing it up that I found to be extremely effective. First, he describes an everyday situation, then he gives a formal definition the subject/topic/term, and finally he explains how it is relevant or its application in the real world. In essence, he invites the reader to build an intuition of a new topic, tells you that you kind of already know what this is—but he puts a new name to it—and then he shows you how it is quite a bit more amazing than you thought. I think more people ought to teach in this way; to me, this is near the Platonic ideal of how to teach.

Furthermore, it was quite clear that Christian did his research for The Alignment Problem. When he says that he did hundreds of interviews, I do not doubt him at all. I must also address my earlier comment about how this book is extremely approachable in its prose. Since a lot of this book was based not only on original research, but also relied heavily on personal interviews, Christian gave direct quotes of the way that people spoke (including their dialects/mannerisms of speaking) and also used syntactical tools such as ellipses to great effect.

I'll try not to gush too much more about this book, but I must also point out that I loved how much he integrated psychology into this book. He could almost write an entire book just on how our brains work and I would love it equally. Since this book was about machine learning and human values, Christian had to adequately address the latter portion of the subtitle, and boy did he deliver! I especially enjoyed the chapters on Imitation and Inference, where he described how we are trying to include human values in our AI either by—you guessed it—having the machines imitate us or infer what we are doing. Lengthy sections of the book spoke exclusively on neuroscience (such as how dopamine is a "reward chemical" based not on the reward itself, but actually on how reality differs from our expectation of the future).

Finally, I'll leave you with one of my favorite justifications about why you ought to learn more about this, from the conclusion, page 327
Increasingly, our interaction with almost any system involves a formal model of our own behavior.... What we have seen in this book is the power of these models, the ways they go wrong, and the ways we are trying to align them with our interest.
166 reviews6 followers
February 22, 2021
If you’re plugged into the artificial intelligence world, you’ll immediately recognize the title. The “alignment problem” in AI is ensuring that artificial agents’ goals align with the goals of humans. That’s not an easy problem to solve, as Christian details through countless examples. The “reward function” for AI programs is often misspecified.

Early in the book Christian tells the story of AI researcher Dario Amodei, who in 2016 was working on a general-purpose AI to play computer games and had gotten stuck on a boat race. Instead of trying to win the race, the AI was instead spinning the boat around in circles, forever. The problem turned out to be simple. The AI was optimized to maximize in-game "points" rather than directly trying to win; the researchers thought points were a decent approximation but instead the AI had found a part of the water where it could get power-ups forever, and just stayed there rather than trying to race.

The hardest part is that humans are not very good at articulating the reward function we want for our AI agents. We leave out important information — like “we actually want this boat to finish the race” — all the time.

Some of the most interesting parts of the book have nothing to do with alignment, per se, but instead chronicle the dramatic progress that deep learning, reinforcement learning, imitation learning, and other methods have made at improving AI performance — and the surprising parallels we’ve found between how they work and how the human brain works. The book keeps identifying moments where artificial neural networks are uncannily good at predicting how the literal neural network of the brain works — there’s a whole section on dopamine that’s particularly revealing.

As someone who identifies as an effective altruist and who has many EA friends (like my colleague Kelsey Piper) who count AI risk as one of the causes they care most about, I found the book incredibly useful as a crib sheet to get more up to date on what they’re talking about. It’s light on equations and heavy on clear examples. If I were to recommend one book to lay people to convince them to care more about the safety of the intelligent machines humans are building, it would be The Alignment Problem.

My only complaint is that the field moves fast enough that I could use regular Christian-y updates that de-mystify the latest developments.
Profile Image for David Steele.
534 reviews30 followers
June 30, 2023
This book changed my mind and taught me a thing or two along the way. I started out missing the point about the biases in the training data, thinking the author was making a point that the computers weren't 'woke' enough, and that the AI was providing accurate information about society that the developers would rather were not true. It took me a while to get my head around the actual problem. For example, the feedback loops created when a computer makes predictions about the world (based on incomplete or uncontextualized data) which result in decisions being made that lead to even more skewed results being fed back into the algorithm.
There were some fascinating stories and insights in this book. I particularly enjoyed the story of how early diffusion models were developed, and the chapter on motivation that discussed how A.I. game players were encouraged to think their way across different playing experiences was particularly engaging and lively.
There was no shortage of thought provoking philosophical and ethical questions, especially towards the end of the book when I really started to grasp the true implications about the need for uncertainty mechanisms and the fact that there's an important distinction between truth and consensus. As the author says (more or less) the problem isn't about A.I, but about the simplified models that we think it will find useful. It's easy for people to develop working models of the world around them because we can adopt, change and ignore them based on what happens in the real world. A.I. systems might not have the ability to make that switch as easily as we do.
Having stuck with this book to the end (I probably wouldn't have put the work in to the early chapters to get through a paper book, but this was on audio for me) I can absolutely get my head around Elon Musk's assertion that we need more than one model of "truth" for A.I. systems, and that no one organisation should ever have the monopoly on what we define as right and wrong.
There were some fun histories and narratives in this book, but as somebody who loves playing with ideas, I enjoyed the dialectic and taxonomical theory as much as anything.
Displaying 1 - 30 of 543 reviews

Can't find what you're looking for?

Get help and learn more about the design.