• Syntha AI
  • Posts
  • Midjouney Describe: generate descriptions for your images

Midjouney Describe: generate descriptions for your images

Image captioning. DALLE in Microsoft Edge. LLM for commercial use.

We have already seen many good models for image generation. Typically, they take a text prompt as input and provide an image as output. But what if you have an image and want a description for it? For instance, you may want to write a description for each product in your online store.

This problem is called image captioning. Recently, Midjouney announced a "describe" feature that generates image captions. All you need is to write a command /describe and upload your image.

I don't know how exactly Midjouney's describe feature works, but there are some general approaches for image captioning. Usually, the process is as follows:

  1. You have an image as input.

  2. You use a Convolutional Neural Network to transform this image into a set of numbers, which is called an embedding. For example, an image can be represented as a set of 1000 numbers that form a vector. Each image will be represented by its own set of 1000 numbers.

  3. This embedding is used as input to the second part of a neural network, which can be a transformer (like in GPT, Generative Pretrained Transformer) or other language models.

  4. This second part now acts as a standard language model, generating word after word. I described this process in more detail here.

If you are interested in open-sourced image captioning implementations, take a look at LAVIS library.

Midjouney Describe Examples

Now, let's look at some examples of how Midjourney's describe feature. I took a couple of photos and here are the results.

Input image

1️⃣ pink cherry blossoms on sidewalks at durani park in cambridge, massachusetts, in the style of panasonic lumix s pro 50mm f/1.4, close up, delicate floral studies, close-up intensity, delicate flowers, hanging scroll, manapunk --ar 3:4
2️⃣ some flowers are blooming on a branch by a city, in the style of nikon d850, cherry blossoms --ar 3:4
3️⃣ pink cherry blossom branches in front of building, in the style of nikon d850, yankeecore, chen zhen, tokio aoyama --ar 3:4
4️⃣ white cherry blossomed in the trees, with skyscrapers, in the style of dark pink and light brown, nikon d850, hanging scroll, close-up, yankeecore, pink, happenings --ar 3:4

Input image

1️⃣ a brown dog is seen standing down from walking with two people that are wearing shoes, in the style of poodlepunk, dark amber and azure, stockphoto, site-specific, precisionist lines, #vfxfriday, urban and edgy --ar 3:4
2️⃣ a person wearing flipflops and a dog on a leash, in the style of poodlepunk, precisionist lines, precisionist, close-up intensity, earthy textures, 32k uhd, street-savvy --ar 3:4
3️⃣ the dog is looking out toward people, in the style of precisionist lines, durk and gritty --ar 3:4
4️⃣ a golden retriever dog laying on a sidewalk, in the style of poodlepunk, precisionist lines, dark red and light blue, 32k uhd, forced perspective, mountainous vistas --ar 3:4

As you can see, the descriptions are more like text prompts rather than real descriptions. If you need more natural captions, take a look at the "Tool of the Week" section.

How to use Image Captioning & Midjouney Describe

  • You can create a website that will automatically generate product descriptions for products.

  • You can do prompt-based image editing. First, Midjouney will describe a given image (give you the prompt). Then you edit this prompt and use it to generate a new image. It will not allow you to edit a particular part of the image. Instead, you can modify the concept. For example, "Dog sitting on a street" can become "Dog sitting on a beach."

Tool of the week

SceneXplain is an AI-powered service that generates textual descriptions of images, i.e. image captions. It is available through a web interface and API.

News of the week

Microsoft brings image generator to their browser Edge

Databricks open-sourced their new large language model (LLM) available for commercial use

Dolly 2.0 is a new instruction-following LLM that is now the first to be open-sourced and available even for commercial use. Although it is not as good as ChatGPT, this model has been fine-tuned on a new, high-quality human-generated dataset sourced from Databricks employees. It allows organizations to create, own, and customize powerful LLMs without paying for API access or sharing data with third parties.

The company has also released a new databricks-dolly-15k dataset, which is an open-source collection of 15,000 high-quality human-generated prompt and response pairs specifically designed for instruction tuning of large language models.

Findings of the week

Thank you for reading my newsletter. I would greatly appreciate any feedback you have: just reply to this email.

If you like my newsletter, feel free to share it on Twitter or just send a direct link to this post: https://syntha.beehiiv.com/p/midjouney-describe-generate-descriptions-for-your-images.