learn from carefully hand-labeled data. Quite often the quality of the AI’s predictions depends on the quality of the labels in the training data. However, a key ingredient of the LLM revolution is that for the first time very large models could be trained directly on raw, messy, real-world data, without the need for carefully curated and human-labeled data sets. As a result almost all textual data on the web became useful. The more the better. Today’s LLMs are trained on trillions of words. Imagine digesting Wikipedia wholesale, consuming all the subtitles and comments on YouTube, reading
...more

