Kindle Notes & Highlights
Read between
November 28, 2024 - January 6, 2025
This is also an exponential, 2n. Since there are sixty-four squares on a chess-board, the last square would require 264 = 1.8446744 · 1019 grains of wheat, which is about the number of atoms in the universe.)
The purpose of a language model is to remove from consideration word sequences that might correspond to the sounds but make no sense. The classic problem is homophones—a group of words with different spellings and different meanings, but that sound exactly the same.
You may remember in our discussion of learning for computer vision, we said that it was important to divide our research data, such as sets of labeled images, into two or three subsets, one each for training, validation, and testing.
A common joke at the time was that the saying “The spirit is willing, but the flesh is weak” (but expressed in Russian) was translated into English as “The vodka is good, but the meat is rotten.”
The law in Canada requires that their parliament publish its proceedings in both of their official languages, French and English. Also, very early on, the government made the proceedings available online.
Language models have the nice feature of being practically self-evaluating. We have introduced language models as a way of improving machine translation. But they are a small part of MT, and evaluating MT is difficult. (The best way is using them to translate the same set of sentences and simply asking a person which did the better job.)
While I found grammar useless, I have spent much of my academic career trying to teach it to computers. The real purpose of grammar is not to improve your writing but, rather, to help construct the meaning of sentences from the meanings of their parts.
“All grammars leak,” by which he meant that it was not possible to write a complete grammar for a natural language.
Perhaps even worse, as you include more rules in the grammar, sentences that previously were unambiguous suddenly had extra structures that made little or no sense.
The solution came from an unexpected direction. In 1994, a group at University of Pennsylvania led by Mitchell Marcus published the Penn treebank, a collection of one million words of text from the Wall Street Journal together with the structure of the sentences in the form of syntactic trees [61]. From the treebank, it is possible to read off all of the context-free rules necessary to assign the correct tree for all the sentences therein. This pretty much solved the grammar leakage problem.
Vacuum tubes and transistors can act as switches, and it was a big deal when transistors replaced vacuum tubes in both radios and computers.
It may be possible to buy an individual transistor these days, but don’t look inside your laptop or cell phone; they are there, but invisible, because they have been miniaturized so millions can fit on the chips.)
In the early days of the field, AI researchers used much the same computer hardware as everyone else.
“Matrix” is just a fancy word for a two-dimensional “array” or table.
So, any machine designed for computer gaming has a graphics processing unit, or GPU, besides its standard CPUs. You can think of a GPU as 1,000 slow CPUs—say, one-tenth the speed—but because there are so many of them, when they are set to a task for which they are designed—matrix operations are the most common—they are 100 times faster than CPUs.
Since then, AI has become a booming business, with a multicompany race to build the next generation of still faster AI hardware. The basic idea is to make the hardware increasingly specialized to AI tasks. One we have seen is convolution. Another is a recent scheme called transformers, to be discussed in chapter 10. Since these processors are specialized for AI and not graphics, the new term for this kind of processing unit is AI accelerator. Accelerators typically have the equivalent of several GPU processors.
After all, visual information comes as light intensities, which are just numbers, and numbers are at the center of how perceptrons, neural networks, deep learning, and so on all work.
There is a famous saying in linguistics, “You shall know a word by the company it keeps,” so similar words have similar neighbors [108].
In 2016, Google started replacing its statistical MT system, which was based on the technology described in chapter 7, with neural network methods. (I remember chatter on the web because people noticed their translations improving from one week to the next.)
As noted earlier, Go is a two-player game of perfect information. It is also a board game, though crucially, the board has more positions, 19 × 19 = 361 compared to 8 × 8 = 64 for chess.
Go has been played in China for about 2,500 years—it makes Western civilization look very young indeed.
That the key to DeepMind’s success was its use of deep learning [97] had a major impact on the field, your author most definitely included.
Instead AI researchers, not just at MIT but also at Stanford and CMU, pursued planning or problem solving.
Contrast this with reinforcement learning. In RL, the program is given a reward function, not a goal.
In 2017, in an attempt to improve machine translation, a group at Google published a new NN model for MT. They called it the transformer model, and the paper was titled “Attention Is All You Need” [104]; see figure 10.1. While transformers were initially created to improve machine translation, they were quickly adapted to language modeling, and that is where they have had the largest impact. Thus, we present the simplified version that was adopted early on by the group at OpenAI for LMs.
GPT stands for generative pretrained transformer language model.
When you use a language model to generate text, you first compute the probability of all possible next word pieces.
Another possibility is to choose a word piece according to the distribution proposed by the LM.
In the paper announcing GPT-2, the OpenAI group make the case that language models are by their very nature multitask programs, instead of needing to design separate programs, one per task.
Indeed, two subsequent models, one from OpenAI, GPT-3 with 175 billion parameters in 2020 [10], and one from Google, PaLM with 540 billion (in 2022) [16], have racked up the best scores yet.
It’s unclear how we’d distinguish “real understanding” from “fake understanding.” Until such time as we can make such a distinction, we should probably just retire the idea of “fake understanding.”
The idea is that since my laptop does not have an inner life, neither does LaMDA. The philosopher David Chalmers has been a proponent of this idea [13].
As opposed to thriller movies with science backgrounds, it is very unusual for any small group of scientists to be that far ahead of the rest of the field that others cannot replicate their work if they really care to, and the protein structure prediction problem is important, so the others really cared to! Knowing that something can be done is half the battle, and other groups’ results started improving rather quickly.
since the start of 2023 saw more publicity and controversy about AI than all the previous sixty-five years combined.
Suddenly, it became apparent that this was no longer your grandmother’s AI.
Explaining how ChatGPT differs from GPT-3 is complicated. ChatGPT starts with GPT-3 as its base and, thus, can be correctly described as an LLM. However, it moves beyond that description in that it also uses reinforcement learning (see section 9.3).
GPT-4 followed a few months later. Besides being much more accurate than ChatGPT, it has one significant addition—it can accept images as part of the prompt.
One final note: besides writing, another occupation that requires putting down one symbol after another is programming, and both ChatGPT and GPT-4 have been trained on not just language texts but also computer programs. GPT-4, in particular, is pretty good. I do not think programmers are in danger of mass layoffs, but it will change the nature of the job. I know it is already changing college-level courses on the topic.
Most AI researchers disagree with me, but in my estimation, AI has little to show for its first fifty years.
Furthermore, large language models are impressive, but their abilities have large gaps. Common sense reasoning is a big one, so there are still some other ingredients required.
To add fuel to the fire, it asks, if this is not agreed to voluntarily, that “governments should step in.” This is scary stuff—not AI, but the idea that in the twenty-first-century, there are calls to put engineering and scientific researchers in jail.
AI agents are going to be built, not evolved, so unless their builders go to a lot of trouble, AIs will have no such compulsions.
As for me, I tend to side with the cognitive scientist Steven Pinker who said about the AI apocalypse in general: It depends on the hypothesis that humans are so gifted that they can design an omniscient and omnipotent AI, yet so moronic that they would give it control of the universe without testing how it works [79]. To which I would only add, “or including an off switch.”
The physical aspects of robotics is hard—in my estimation, much harder than the rest of AI, and to a large degree separate. I had a reason for not including robotics in this history.
Large language models in particular are telling us something profound about the nature of understanding in both computers and people, even if we don’t understand yet what it is telling us.

