Smaller models do relatively well when the task is something narrow like using historical data to predict temperatures. But language is fundamentally different. Because the number of ways to start a sentence is essentially infinite, even if a transformer has been trained on hundreds of billions of tokens of text, it can’t simply memorize verbatim quotes to complete it. Instead, with many billions of parameters, it can process the input words in the prompt at the level of associative meaning and then use the available context to piece together a completion text never before seen in history. And
...more

