So Far: Unfriendly AI Edition, by Bryan Caplan
Bryan Caplan
issued the following challenge, naming Unfriendly AI as one among
several disaster scenarios he thinks is unlikely: "If you're
selectively morbid, though, I'd like to know why the nightmares that
keep you up at night are so much more compelling than the nightmares
that put you to sleep." (http://econlog.econlib.org/archives/.../morbid_thinking_1.html)
Well, in the case of Unfriendly AI, I'd ask which of the following statements Bryan Caplan denies:
1. Orthogonality thesis - intelligence can be directed toward any
compact goal; consequentialist means-end reasoning can be deployed to
find means corresponding to a free choice of end; AIs are not
automatically nice; moral internalism is false.
2. Instrumental
convergence - an AI doesn't need to specifically hate you to hurt you; a
paperclip maximizer doesn't hate you but you're made out of atoms that
it can use to make paperclips, so leaving you alive represents an
opportunity cost and a number of foregone paperclips. Similarly,
paperclip maximizers want to self-improve, to perfect material
technology, to gain control of resources, to persuade their programmers
that they're actually quite friendly, to hide their real thoughts from
their programmers via cognitive steganography or similar strategies, to
give no sign of value disalignment until they've achieved near-certainty
of victory from the moment of their first overt strike, etcetera.
3. Rapid capability gain and large capability differences - under
scenarios seeming more plausible than not, there's the possibility of
AIs gaining in capability very rapidly, achieving large absolute
differences of capability, or some mixture of the two. (We could try to
keep that possibility non-actualized by a deliberate effort, and that
effort might even be successful, but that's not the same as the avenue
not existing.)
4. 1-3 in combination imply that Unfriendly AI is
a critical Problem-to-be-solved, because AGI is not automatically nice,
by default does things we regard as harmful, and will have avenues
leading up to great intelligence and power.
If we get this far
we're already past the pool of comparisons that Bryan Caplan draws to
phenomena like industrialization. If we haven't gotten this far, I want
to know which of 1-4 Caplan thinks is false.
But there are further reasons why the above Problem might be
*difficult* to solve, as opposed to being the sort of thing you can
handle straightforwardly with a moderate effort:
A. Aligning
superhuman AI is hard to solve for the same reason a successful rocket
launch is mostly about having the rocket *not explode*, rather than the
hard part being assembling enough fuel. The stresses, accelerations,
temperature changes, etcetera in a rocket are much more extreme than
they are in engineering a bridge, which means that the customary
practices we use to erect bridges aren't careful enough to make a rocket
not explode. Similarly, dumping the weight of superhuman intelligence
on machine learning practice will make things explode that will not
explode with merely infrahuman stressors.
B. Aligning superhuman
AI is hard for the same reason sending a space probe to Neptune is hard
- you have to get the design right the *first* time, and testing things
on Earth doesn't solve this because the Earth environment isn't quite
the same as the Neptune-transit environment, so having things work on
Earth doesn't guarantee that they'll work in transit to Neptune. You
might be able to upload a software patch after the fact, but only if the
antenna still works to receive the software patch - if a critical
failure occurs, one that prevents further software updates, you can't
just run out and fix things; the probe is already too far above you and
out of your reach. Similarly, if a critical failure occurs in a
sufficiently superhuman intelligence, if the error-recovery mechanism
itself is flawed, it can prevent you from fixing it and will be out of
your reach.
C. And above all, aligning superhuman AI is hard for
similar reasons to cryptography being hard. If you do everything
*right*, the AI won't oppose you intelligently; but if something goes
wrong at any level of abstraction, there may be cognitive powerful
processes seeking out flaws and loopholes in your safety measures. When
you think a goal criterion implies something you want, you may have
failed to see where the real maximum lies. When you try to block one
behavior mode, the next result of the search may be another very similar
behavior mode that you failed to block. This means that safe practice
in this field needs to obey the same kind of mindset as appears in
cryptography, of "Don't roll your own crypto" and "Don't tell me about
the safe systems you've designed, tell me what you've broken if you want
me to respect you" and "Literally anyone can design a code they can't
break themselves, see if other people can break it" and "Nearly all
verbal arguments for why you'll be fine are wrong, try to put it in a
sufficiently crisp form that we can talk math about it" and so on. ( https://arbital.com/p/AI_safety_mindset/ )
And on a meta-level:
D. These problems don't show up in qualitatively the same way when
people are pursuing their immediate incentives to get today's machine
learning systems working today and today's robotic cars not to run over
people. Their immediate incentives don't force them to solve the
bigger, harder long-term problems; and we've seen little abstract
awareness or eagerness to pursue those long-term problems in the absence
of those immediate incentives. We're looking at people trying to solve
a rocket-accelerating cryptographic Neptune probe and who seem to want
to do it using substantially less real caution and effort than normal
engineers apply to making a bridge stay up. Among those who say their
goal is AGI, you will search in vain for any part of their effort that
spends as much effort trying to poke holes in things and foresee what
might go wrong on a technical level, as you would find allocated to the
effort of double-checking an ordinary bridge. There's some noise about
making sure the bridge and its pot o' gold stays in the correct hands,
but none about what strength of steel is required to make the bridge not
fall down and say what does anyone else think about that being the
right quantity of steel and is corrosion a problem too.
So if we
stay on the present track and nothing else changes, then the
straightforward extrapolation is a near-lightspeed spherically expanding
front of self-replicating probes, centered on the former location of
Earth, which converts all reachable galaxies into configurations that we
would regard as being of insignificant value.
On a higher level
of generality, my reply to Bryan Caplan is that, yes, things have gone
well for humanity so far. We can quibble about the Toba eruption and
anthropics and, less quibblingly, ask what would've happened if Vasili
Arkhipov had possessed a hotter temper. But yes, in terms of surface
outcomes, Technology Has Been Good for a nice long time.
But
there has to be *some* level of causally forecasted disaster which
breaks our confidence in that surface generalization. If our telescopes
happened to show a giant asteroid heading toward Earth, we can't expect
the laws of gravity to change in order to preserve a surface
generalization about rising living standards. The fact that every
single year for hundreds of years has been numerically less than 2017
doesn't stop me from expecting that it'll be 2017 next year; deep
generalizations take precedence over surface generalizations. Although
it's a trivial matter by comparison, this is why we think that carbon
dioxide causally raises the temperature (carbon dioxide goes on behaving
as previously generalized) even though we've never seen our local
thermometers go that high before (carbon dioxide behavior is a deeper
generalization than observed thermometer behavior).
In the face
of 123ABCD, I don't think I believe in the surface generalization about
planetary GDP any more than I'd expect the surface generalization about
planetary GDP to change the laws of gravity to ward off an incoming
asteroid. For a lot of other people, obviously, their understanding of
the metaphorical laws of gravity governing AGIs won't feel that crisp
and shouldn't feel that crisp. Even so, 123ABCD should not be *that*
hard to understand in terms of what someone might perhaps be concerned
about, and it should be clear why some people might be legitimately
worried about a causal mechanism that seems like it should by default
have a catastrophic output, regardless of how the soon-to-be-disrupted
surface indicators have behaved over a couple of millennia previously.
2000 years is a pretty short period of time anyway on a cosmic scale,
and the fact that it was all done with human brains ought to make us
less than confident in all the trends continuing neatly past the point
of it not being all human brains. Statistical generalizations about one
barrel are allowed to stop being true when you start taking billiard
balls out of a different barrel.
But to answer Bryan Caplan's
original question, his other possibilities don't give me nightmares
because in those cases I don't have a causal model strongly indicating
that the default outcome is the destruction of everything in our future
light cone. Or to put it slightly differently, if one of Bryan Caplan's
other possibilities leads to the destruction of our future light cone, I
would have needed to learn something very surprising about immigration;
whereas if AGI *doesn't* lead to the destruction of our future
lightcone, then the way people talk and act about the issue in the
future must have changed sharply from its current state, or I must have
been wrong about moral internalism being false, or the Friendly AI
problem must have been far easier than it currently looks, or the theory
of safe machine learning systems that *aren't* superhuman AGIs must
have generalized really surprisingly well to the superhuman regime, or
something else surprising must have occurred to make the galaxies live
happily ever after. I mean, it wouldn't be *extremely* surprising but I
would have needed to learn some new fact I don't currently know.
Bryan Caplan's Blog
- Bryan Caplan's profile
- 372 followers
