The State of AI Image Generation 03/2024

A robot is painting a scene on a canvas

A few months ago, I launched my new business, which ushered me into a highly creative period of my life – a period that continues as I write these lines. I have been working on presentations, webinars, and courses, all of which shared a common need:

A need for an abundance of images and visuals – images, backgrounds, thumbnails, and more.

If this period had occurred a year ago, I would likely have scoured various free image databases online, attempting to find the perfect image for each scenario I was working on. From firsthand experience, I can attest that this method is inefficient. I did precisely that a few years ago when I posted on my newsletter, and it was a tedious process. It was challenging to find the exact image that resonated with the scenario or the topic of my posts.

But then, GenAI entered the realm of image generation. Yes, it existed before, but it wasn’t as accessible or mature as it is today.

This time, I decided to take a different approach. Since I have a subscription with OpenAI, I went all-in on GenAI for image generation, producing most of the visuals I needed using this technology.

As a result, I’ve spent quality time with ChatGPT over the last couple of months, but I also decided to test some other providers to ensure I wasn’t missing out on anything.

In this post, I will share key highlights from my findings, comparing the various services, their strengths, and their limitations.

Although I have a technical background and have managed data science teams in the past, making me familiar with the engineering concepts behind these technologies, I will focus on my experience as a regular user of these services. This post is intended for those curious about the current state of these technologies, with an ‘average’ technical background or above (hence, if you are looking for a deep engineering debate and models comparisons – this is not the post for you.)

Services Overview

OpenAI

As I noted – I have a subscription with OpenAI which provides me with access to GPT 4 and Dall-E 3. Dall-E is the engine producing the images, while requests are made through the ChatGPT interface. GPT-4 translates these requests to Dall-E, which then generates the best images possible based on the GPT’s input.

It’s a very interesting concept, where a human ‘feeds’ a machine with an input, and that machine, in turn, feeds a second machine with an altered and more precise input in order to optimize the chances of producing the exact image the human (user) wanted.

We’ll soon see how this has worked out.

MidJourney

Now, the second service I wanted to test is MidJourney as I heard some good stuff about it. To give you a spoiler, though – as its name suggests – my journey with MidJourney ended prematurely, because the guys behind this service did practically everything they could so I won’t be able to try it. 

For those interested in the details, I’ve included my experiences in the appendix at the end of this post.

Let’s move on.

Stability.ai

I read a nice review on Stability.ai so I decided to try it using their DreamStudio interface.

It provides you with enough free credits to test their technology and their pricing moving forward after you exhaust the free credits is very decent ($10 for enough tokens to produce thousands of additional images).

Muse AI

I stumbled upon Muse AI, possibly while searching for MidJourney. Their interface is user-friendly, but they provide only 10 free daily credits for testing. Thus, I used Muse AI mainly when other services did not meet my needs.

There are other services out there, but I’ll focus on these for this post.

What I’ve tried and what I got in return

By default, I used ChatGPT (hereafter referred to as ‘GPT’) as my starting point, occasionally trying other engines. Let’s review some scenarios.

One important comment though – all images were produced in high resolution by the designated service. I intentionally reduced their resolution and size for this article (so the page will load fast enough). 

The Chinese sage

I wanted to experiment with a new theme for my site featuring a Chinese sage character who would serve as a guide in several parts of my site. I opened with something simple:

“Create an image of an old Chinese sage in white robes who is bowing in gratitude”

GPT:

Chinese sage bowing in thank you gesture

That is a decent image that catered to what I asked for. But I did want to see what the other services would deliver. So I asked them the same.

DreamStudio:

(Produced 4 images)

Muse AI:

I think that for this quite straightforward request – MuseAI delivered the best result on the first attempt. What do you think?

Anyway – I then ran into a problem that I already ran into about 6 months ago, when I was working on a (fantasy) book and I wanted to provide a visual face to my characters. I just couldn’t create a consistency with the images created. Meaning – I iterated with GPT until I managed to create the character that I wanted, but then, when I wanted to generate an image with the same character, but in a different posture, I just couldn’t make it work.

Back to my Chinese sage – since I wanted this dude to be a character I’d use as a theme on my site, I needed him to look the same. But again, GPT failed to reproduce the exact same face, or even body.

Here we can see other variations of the supposedly same sage. The two images below are quite close in terms of resemblance to the same person but are quite different from the original sage. And even those – it took me several iterations to get to.

Welcoming sage

The two below indeed resemble the same person but have nothing to do with the previous sages. It’s safe to say that they are not even Chinese.

Now, even though I managed to get some characters to look the same after several iterations, when I came a day later and the chat was cleared, there was no chance to reproduce those at all. I’ll summarize my conclusions on this limitation towards the end, so for now, just remember this and move on to the next thing.

‘Stick to the plan!’

I was working on my webinar and wanted an image of a military officer shouting “Stick to the plan!” Here is my prompt:

“A military officer is pointing and shouting ‘stick to the plan’”

GPT

Bingo – exactly what I had in mind. Let’s see how the others were doing…

Dreamstudio

Muse AI

Nope. Not good enough. Pointing? Yes… but towards where? And where is the bubble (or any other visual aid) that shows what he says? 

Moving on.

The treasure chest of data

I was working on a slide discussing data-driven processes and needed an image that shows the data as the real treasure. Here is the prompt I used:

“create an image for a presentation where a pirate is picking numbers and mathematical signs from a treasure chest in his both hands. Meaning – we’re replacing the classical image of a pirate being ecstatic when he collects with his both hands gold coins from a treasure chest with the same image, but instead of gold coins, the pirate holds numbers and mathematical signs”

GPT

DreamStudio

Muse AI

Ok… what do you think? 

While none of those is awesome, in my humble opinion, GPT won again because it nailed down both the emotion I was trying to capture and the treasure chest of data. I think it could have fine-tuned the numbers and mathematical signs and made it all clearer, but we’ll soon see that when it comes to typography – ‘Houston, we have a problem’ across the board. 

DreamStudio was quite disappointing as none of the images captured the message I was trying to convey. Muse AI is the most visually appealing one, but it totally failed to capture the emotion, and the chest… well, the lid of the chest is just floating in the air. What can I say…

In none of the images, by the way, is the pirate holding the coins in both hands as I’ve asked.

The clueless product manager

Again, for my webinar about the top 5 mistakes when writing PRDs, I asked my new friends to create an image of a clueless product manager. Here is the prompt:

“Create an image of product manager shrugging as if he or she were asked why the feature is important but they have no idea why”

GPT

DreamStudio

Muse AI

To me, the winner is obvious – GPT nailed it again. It captured exactly the impression and gesture I was aiming for. Muse AI is ‘ok minus’, and DreamStudio is just bad.

Measuring success

This time, I wanted an image that conveys the message that we need to measure success. I thought it should be quite straightforward. Man… I was wrong…

Here is my prompt:

“An engineer is measuring the word ‘success’ with a measuring tape”

GPT

The first attempt by GPT was overall good, but the word ‘Success’ was misspelled. Here is my follow-up conversation with GPT:

Nope dude. You didn’t. It’s still misspelled. Saving you several iterations, eventually, it got it right:

DreamStudio

Muse AI

Aside from the fact that GPT is the clear winner again because the other services didn’t bother to draw the engineer, I think we’re on to something here: There seems to be a clear limitation of GenAI image generation tools when it comes to rendering words. All services have misspelled the word ‘success’ and took me several iterations until GPT nailed it, though given what I’ve seen – it seems like pure luck. We’ll talk about it more when I summarize.

Creating your own opportunities

We’ll wrap it up with an image I was trying to produce for a post about creating your own opportunities and thinking out of the box. Here is my original prompt:

“A square image in which there is a builder who constructs the word ‘opportunity’”

GPT

Nope. Let’s try again. Second attempt:

Still misspelled. Let’s try again:

Ok… clearly you’re not looking at what your buddy, Dall-E is sending back from its studio…

Eventually I gave up. I went with a different prompt:

“Create an image of a woman who is shuttering a box she’s trapped in. It should deliver the message of ‘thinking outside of the box’”

It took several attempts, but eventually I got something that I really liked:

Let’s see how the others were doing with my original prompt:

DreamStudio

Note that none of these images got the word spelled correctly.

Muse AI

Aside from the fact that this engineer is not constructing anything and the word is just written floating in the air – it is, again, misspelled.

Summary

It was really fun spending time with all of these tools. I learned a lot. Here is my summary:

  1. From the 3 services I’ve tested – GPT is the clear leader. It captures properly the emotion, the action and the overall message. It’s far from perfect, of course, as often I had to ask it to regenerate or take a different approach with my prompt, but still – it delivered the overall best results, by a margin, from the other competitors.
  2. In my opinion, what makes GPT great is: First – having a superior technology (and that makes sense given the huge funding and resources OpenAI has) and second: the mediation GPT-4 is doing by ‘feeding’ Dall-E with an input is can properly digest turns out to be a winning approach. I guess the engineers at OpenAI observed that the users don’t know how to talk the language of Dall-E directly and worked around that very cleverly, I must say.
  3. A very clear technology limitation that exists across the board – the challenge to render letters and symbols. It amazes me how these engines can render such realistic characters, and yet fail on writing words. I guess the engines are not optimized towards this, but I also guess that this is just a matter of time. I’m pretty sure that if I do a follow up on this post a year from now – this problem won’t exist. Let’s see.
  4. Another issue I observed from the problem raised in the previous bullet is the fact that all these services can’t properly assess the images they produced. GPT was very confident to declare that the word is now properly spelled (time after time), when in fact – it wasn’t. This is another limitation that I guess time will fix. After all – GPT-4 should be able to analyze images, but it seems it’s not instructed to carefully examine the images Dall-E produces. (By the way – a really hilarious Tiktok on the matter, that I accidentally stumbled upon yesterday – can be viewed here)
  5. Last – the inability to maintain true consistency with creation of characters is a real problem for anyone who needs a series of images with the same characters. I think this one will take some time to solve. I did read on the OpenAI forums that users were able to generate a few consistent images when they reused the ‘seed’ for the images creation. The ‘seed’ is a random generator number used for making sure the generated images are each unique and different. It means that by design the consistency cannot be maintained. It also means it will require some redesign of the algorithm and this is why I think it’s gonna take time. Anyway – even if you wanted to take this approach – you can’t anymore. I asked ChatGPT about it and it told me the seed is no longer available to users… so tough luck.

So, at the end of the day – what have I learned? I’ve learned that for $20 a month I got access to enormous technological power. I believe I’ve abused OpenAI machines this month so much that they definitely lost money on me.

Seriously though – the new possibilities image generation provides us at such a reasonable price are a bit scary, to be honest. Yes, there are limitations, but I don’t see any true reason why these limitations won’t be completely destroyed within the next couple of years (probably much before).

That’s it!

I hope you enjoyed this post. 

If you did, share it with your friends and even write back to me so I’ll know. Of course – I encourage you to reach out to me with any feedback, and also with ideas for future posts.

Here are the usual links below, and the promised appendix.

The archive of all of my posts is here.

If you think your friends/peers can enjoy this newsletter as well – invite them to subscribe on this page.

Appendix – My experience with Midjourney

Here is my (short) journey with MidJourney.

The first challenge starts on Google’s homepage when you search for ‘midjourney’.

There are plenty of sponsored results named ‘MidJourney’, each leading to a different service, and none of them (of the sponsored results) is actually leading to the real service. This is a very nasty technique to disguise yourself as a different service, and I’m not sure why Google allows it.

But anyway – if you skip all these sponsored links you get to the real thing (“www.midjourney.com”) and then you understand why it’s so hard to find.

You see, on their homepage these guys don’t discuss any image generation technology or service. In fact – this is their opening text:

“Midjourney is an independent research lab exploring new mediums of thought and expanding the imaginative powers of the human species.

We are a small self-funded team focused on design, human infrastructure, and AI. We have 11 full-time staff and an incredible set of advisors.”

To be honest – on my first two attempts to find these guys – when I landed on their site I immediately hit ‘back’ and went back to the search results. I told myself there is no way this is the site I’m looking for.

It’s only after I noticed that everyone else on the search results is a fraud and I had to do some reading on the web that I got convinced that yes – those are my guys.

To make a long story short – this team is really trying to make you not test their service. 

Aside from being hard to find – you can only try the service on their Discord server. Now, I’m quite familiar with Discord since I am a gamer and I also established the EPM community on Discord as well (link at the end). Still, it’s a super weird choice and just increases the barriers to the service, as for many people – Discord is NOT something they are familiar with.

Now, once you join their server, you are engaged with their Discord bot which asks you to agree to their terms of service before moving on. Fine… I approve.

Once you do that – it asks you to choose a package.. And guess what – there is no free trial or a free package. You actually have to spend money just to try it.

At this point I gave up.

Again, I did read some posts on the web that claimed their technology is decent. Still when it comes to productization, pricing and marketing of their service – these guys are getting -10 on the scale of 0..10.

Midjourney guys – if, by any chance, you are reading this – please take the money you’ve earned from your non-free tiers and hire a decent frontend developer. I’ll help you with the productization pro-bono… 

Seriously guys… this has been one of the lamest journeys I’ve been through to get to use a product, and I’ve tried hundreds of products in my life…

Liked it? Why not share it?
Scroll to Top