• Syntha AI
  • Posts
  • How to generate video with AI: latest methods

How to generate video with AI: latest methods

Generate video from text. Tune-A-Video, Gen-1. Synthesia

Recently appeared models for image generation (like Midjouney, Stable Diffusion, DALLE 2) demonstrated impressive results. However, there exists another interesting topic of research, that is not covered yet — video generation. The video domain is much harder, especially video generation. Besides the fact that the model needs to make every frame to be of high quality, it also needs to make sure that each frame is smoothly connected to the neighboured frames.

In today’s post we are going to look into new methods for video generation:

  • Tune-A-Video, which utilises an existing image model and fine-tunes it for video generation;

  • Gen-1, which is able to modify an existing video according to a text prompt or an image.

I will also tell you about a startup for video generation, that could be used for marketing or educational videos.

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

The recent success of generative models for images leads to the question: can we use the same models for video generation? I mean not just the model’s structure, but a trained model itself.

While it looks like a very sophisticated topic, the authors of Tune-A-Video found an answer. Their new method uses a large pre-trained diffusion text-to-image model (like Stable Diffusion). They use only one text-video pair for fine-tuning an image model.

Once fine-tuned, one can use similar text prompts to generate videos. For example, replace an object in the prompt (i.e. ”a man” → “a dog”) or location (i.e. ”on the beach” → “on the street”).

Training video. Prompt: "A rabbit is eating a watermelon". Image source: https://github.com/showlab/Tune-A-Video

Model’s output. Prompt: "A rabbit is eating a pizza". Image source: https://github.com/showlab/Tune-A-Video

Model’s output. Prompt: "A tiger is eating a watermelon". Image source: https://github.com/showlab/Tune-A-Video 

I would refer to their website for more examples of generated videos. While the quality is not as good as for images, this is an important step towards realistic video generation.

Gen-1: Structure and Content-Guided Video Synthesis with Diffusion Models

Gen-1 is a new model for video editing. It allows changing a video based on an image or a text prompt. There are multiple modes the model can work with.

Stylization mode

The stylization mode allows applying a style of an image to an existing video. It can be useful if you want to create a cartoon from your pre-recorded footage.

Storyboard mode

In the storyboard mode, the model modifies mockups video based on a prompt into a stylized video.

Mask mode

The mask mode is similar to image editing with Stable Diffusion but designed specifically for videos. One should draw a mask for an object and give the model a text prompt. The model will modify the image according to the prompt.

Render mode

Render mode turns renders into realistic videos using an image or a text prompt.

Customization mode

The customization mode allows fine-tuning of the model based on a small set of images. It makes it really powerful since one can use images of the same object from different views and angles. It can potentially lead to better quality.

This is the best model for creating videos I have ever seen. However, it still has some artefacts. For example, the textures are not smoothly transitioned from frame to frame. Also, the model has a problem when the camera is moving. Despite those limitations, I see the great potential for it to be used in real-world applications today.

Here is what can be created using Gen-1 model (click on the tweet below and watch the video):

How to use video-generating models

Well, here I see the following potential ideas:

  • Create services for video editing. In the existing video, such a service could potentially change objects, background or style of the video based on a text prompt.

  • Create services for video content creation. Make it easy to generate ads video based on mockups. Ads text could be generated by GPT-3 or ChatGPT. Text-to-speech services could be used to generate voice from the text.

  • Create cartoons based on the self-recorded videos. Turn real videos into cartoons with the Stylization mode of Gen-1.

I should state that the current quality of video generation sometimes is not good enough. However, knowing these methods will allow one to stay on track with the latest technology and be prepared for a powerful high-quality video generative model. And it is just a matter of time.

Tool of the week

Synthesia allows one to create videos from text. Generated videos are mostly talking heads. The AI avatar in the video will pronounce your speech. Such a service can be very useful in the following cases:

  • If you want to provide your clients with personal feedback videos.

  • If you want to create how-to videos without recording a human.

  • If you want to create marketing videos for your website.

Their technology includes AI for avatar generation and voice synthesis.

Thank you for reading my newsletter. I would greatly appreciate any feedback you have. Additionally, if there are any topics you would like to see covered, please let me know. You can reply to this email.