A History of Image-Generation AI — A Chronicle of "It's Not There Yet"

Posted in December 2025 — adapted and polished from a series I ran on my Instagram stories.

These days there are an overwhelming number of AI tools for creators. So many, in fact, that it’s dizzying. Having watched people react to these tools up close, I’ve found that responses to AI technology tend to fall into roughly three camps.

The first camp adopts the technology quickly and makes it their own. The second waves it off with a “meh, it’s not there yet” and refuses to take it seriously. And the last is somewhere in the middle, without any strong opinion either way.

Of these, I think the people in the second camp can actually be more dangerous than the people in the third. Using a range of AI services and getting a realistic sense of what doesn’t work yet is great, and important. Someone who has actually tried the tools is better off than someone who hasn’t. But concluding that because something doesn’t work right now, it has no future value either — that’s where the problem lies.

Now, you might assume I think this way because I’m boundlessly optimistic about the future of AI. Wrong. If anything, the opposite is true. I studied AI for quite a while, going all the way through a PhD program in graduate school, watching the field develop the whole time. And the more I studied AI, the more I kept thinking the same thing:

Is there really anything left to improve here?

2013 — VAE: “What’s this even good for?”

In 2013, Auto-Encoding Variational Bayes (VAE) appeared. The images AI produced back then barely qualified as images. They were tiny (it could just about generate 28×28 grayscale images), blurry, and smeared — at best it could mimic MNIST, those black-and-white handwritten digits, or conjure up some wobbly suggestion of a face.

Back then people said, “Elegant in theory, sure — but what’s this even good for?” They saw it purely as a math problem, not as a tool for creation. At that point I hadn’t even started university yet.

MNIST digits reconstructed by a VAE — A VAE’s reconstruction of MNIST handwritten digits — blurry and smeared. Source: mdhabibi/DeepLearning-VAE · Paper: Auto-Encoding Variational Bayes (Kingma & Welling, 2013)

2015 — DCGAN

Two years later, in 2015, came DCGAN (Deep Convolutional GAN), which brought GAN-style training to convolutional networks.

As the GAN approach stabilized, the images got a little better. But the quality was still horrendous. Training was so unstable that one wrong move and the model would spit out nothing but noise, and training on large images was out of the question.

A face generated by DCGAN. Source: carpedm20/DCGAN-tensorflow · Paper: Radford et al., 2015

2018 — PGGAN, and then StyleGAN

Three years on, in 2018, GANs hit their golden age. The Progressive Growing of GANs (PGGAN) paper proposed training by starting at low resolution and progressively growing the image to high resolution. This approach could produce high-quality 1024-pixel images. The results were fairly impressive — but still not at a level you could actually use for anything.

A face generated by PGGAN — PGGAN. Source: tkarras/progressive_growing_of_gans · Paper: Karras et al., 2017

Then came a paper out of NVIDIA titled A Style-Based Generator Architecture for Generative Adversarial Networks — the arrival of StyleGAN. The authors proposed a way to build a better latent space and a way to inject style at each layer, which improved image quality by leaps and bounds. This technology could render the face of a person who doesn’t exist, down to the finest peach fuzz. The FFHQ dataset used for training at the time consisted of 50,000 face images — and by the standards of the day, that was a staggering number. For comparison, a recent image dataset like LAION-5B contains five billion images.

Faces generated by StyleGAN — StyleGAN. Source: NVlabs/stylegan · Paper: Karras et al., 2018

Even so, people’s reaction at the time went something like this:

“Wow, the faces actually look real. But that’s about it, right? The backgrounds are mush, and it can’t handle a complex scene with multiple objects at all. It’s basically a passport-photo generator.”

And they were right. StyleGAN trained well on FFHQ, where the data is neatly aligned, but generating images from unaligned data was still really hard, and the quality of those images wasn’t great. To anyone’s eye, they looked uncanny and fake.

2020 — VQGAN

In 2020, the mood started to shift. A paper called Taming Transformers for High-Resolution Image Synthesis (VQGAN) appeared. It brought Vector Quantization and the transformer architecture into image generation, and that turned out to be a wonderful idea. The model produced pretty decent images. And yet, still, the thought was: “Cool — but where would I even use this?”

Images generated by VQGAN — VQGAN. Source: CompVis/taming-transformers · Paper: Esser et al., 2020

Around this time I played around with VQGAN myself. It was just a toy for making fun little images.

Enter Diffusion — DDPM

While GANs were stalling, a paper inspired by physics came out: Denoising Diffusion Probabilistic Models (DDPM). This was the arrival of what we now call diffusion models. By building up an image through the gradual removal of noise, this approach was far more stable than GANs and produced far more varied images.

But still — it only worked on small images, and the generation quality wasn’t good. What on earth are you supposed to do with a crummy 256×256 image?

Samples generated by DDPM (CelebA-HQ faces, LSUN churches and bedrooms, CIFAR-10). Source: hojonathanho/diffusion · Paper: Ho et al., 2020

2021 — GLIDE and DALL·E

In 2021, papers like GLIDE started to appear, proposing ways to guide diffusion models — which until then had generated images at random — using CFG. Around the same time, OpenAI and other big-tech companies launched services like DALL·E. Ask one of these services to draw an “avocado-shaped chair,” and it would draw it for you.

An 'avocado armchair' generated by DALL·E — The “avocado armchair” that became DALL·E’s signature image. Source: OpenAI / MIT Technology Review · DALL·E paper: Ramesh et al., 2021 · GLIDE paper: Nichol et al., 2021

It was neat, sure, but the resolution was still way too low and everything looked cartoonish. You couldn’t control the details, either. In a word, it was just a fun toy. Finding something like creativity — long considered the exclusive domain of humans — in an AI model was a little startling, but the quality was still less “artwork” and more “didn’t even rise to the level of an internet meme.”

2021 — Stable Diffusion

And then came High-Resolution Image Synthesis with Latent Diffusion Models (LDM, 2021) — the Stable Diffusion we all know. From this point on, AI began to understand text, draw complex high-resolution pictures, and churn out fairly convincing results. Bit by bit, some people started to get scared.

Images generated by Stable Diffusion — Stable Diffusion (LDM). Source: CompVis/stable-diffusion · Paper: Rombach et al., 2021

Even so, some people reacted like this: “Impressive. But look, there are six fingers. I can spot it at a glance. And it can’t write text either. Detail is still where humans win. Plus it’s too heavy to run on my computer.”

2022 — DiT, onto the Transformer

In 2022, Scalable Diffusion Models with Transformers (DiT) arrived. Now diffusion models ditched the old UNet structure and started running on top of the transformer architecture. With papers like Video Diffusion Models, diffusion began to make its way into video and audio generation too. It was around this point that creators themselves started to take an interest in AI.

Images generated by DiT — DiT. Source: facebookresearch/DiT · Paper: Peebles & Xie, 2022

2024 — MMDiT, Handling Text and Image Separately

If DiT brought the transformer into diffusion, what came next was MMDiT (Multimodal Diffusion Transformer). This is the architecture adopted by Stable Diffusion 3: rather than lumping text and image together, it processes each with its own separate set of weights and then lets them meet in attention. On top of that, it adds a straighter training method called Rectified Flow.

The result? Things that had stubbornly refused to work started working. Fingers came out fine, and text in the images is written reasonably legibly now. Most of today’s hot models, FLUX among them, belong to this family. That weakness people mocked just a few years ago — “it can’t even write text” — has been patched over, just like that.

An 'astronaut riding a horse' generated by Stable Diffusion 3.5 (MMDiT) — A generation example from Stable Diffusion 3.5 (MMDiT). Source: Wikimedia Commons · Paper: Esser et al., 2024 (SD3)

And the Present

Time passed, and here we are at the present. The shelf life of the phrase “it’s not there yet” — from the blurry grayscale digits of 2013 to Nano Banana in 2025. It took a little over ten years.

Why “It’s Not There Yet” Is Dangerous

At every single stage, people said “it’s not there yet,” “this doesn’t work.” I said it too. “What more is there to come up with here?” “Is there even any room left for the technology to improve?” At the time, those looked like fairly accurate judgments. But not anymore.

This is exactly why the second camp — the people who dismiss the technology — are dangerous. They mistake the current limitations for the technology’s inherent limitations. The moment you conclude “AI can’t draw hands” and tune out, it takes time to catch up once the technology has moved on to the next stage. By the time you look back, the gap will already have widened enormously. If you have no need to catch up, then fine, don’t — but how many people are actually in that position?

What we ought to be doing right now isn’t sneering at what doesn’t work, but thinking hard about what would change if the thing that doesn’t work today gets solved in six months, a year, ten years. Knowing what doesn’t work is very important. But fixating only on what doesn’t work and retreating back into your old, comfortable, familiar way of doing things — that is extremely dangerous.

What Has Been Stacked Up

In an interview, OpenAI co-founder Ilya Sutskever said that the difference between the capital being poured in back when we were making black-and-white images and the capital being poured in now is beyond imagination. Today’s situation didn’t just appear out of nowhere in a sudden poof!

As we’ve seen, everyone did the very best they could with what they had at the time, and that got stacked up, layer upon layer, until we arrived here.

There were countless skeptics. I was one of them too.