The model learns to associate words with visual concepts. They then start with a randomized background image that looks like old-fashioned television static, and use a process called diffusion to turn random noise into a clear image by gradually refining it over multiple steps. Each step removes a bit more noise based on the text description, until a realistic image emerges. Once trained, diffusion models can take just a text prompt and generate a unique image matching that description. Unlike language models that produce text, diffusion models specialize in visual outputs, inventing pictures
...more