Just like LLMs, these tools had been in development for years, though only recently did the technology allow for them to become truly useful. Rather than learning from text, these models are trained by analyzing lots of images paired with relevant text captions describing what’s in each picture. The model learns to associate words with visual concepts. They then start with a randomized background image that looks like old-fashioned television static, and use a process called diffusion to turn random noise into a clear image by gradually refining it over multiple steps.