How to generate visual novels with Stable Diffusion — prompt structure included!

Let's say you have a text, and you wish to enhance it with illustrations or even transform it into a short animated movie. The process of hiring illustrators and voice actors is both costly and time-consuming.

Furthermore, their individual styles might not align with your vision. Fortunately, thanks to Generative AI, you can swiftly generate and experiment with illustrations and text-to-speech processing.
‍

‍

In this guide, I will take you through the process of creating visual novels. We will delve into how to break down unstructured text into scenes, how to incorporate specific actors into a scene, and so on.

Let's commence our storytelling journey.

Compose story prompt structure with ChatGPT

First, we must divide the unstructured text into scenes. This will result in each scene possessing a similar structure, making it possible to customize and visualize each individual scene.

To accomplish this task, we will employ OpenAI text models. Utilizing GPT-3 should suffice for handling the task. The most challenging aspect of this task is to supply an effective prompt. For instance, we could use the following:

‍

Please groom the following story into three scenes while maintaining
its original meaning and intent as closely as possible word by word.
Also before each scene put in square
brackets are taken from scene text with detailed descriptions about the place where the scene coming from,
example [environment: room in the hotel, red carpet, white walls, raining outside].
Also before each scene put in square
brackets are taken from the scene text main character in the scene,
example [main-character: woman]

‍

The generated result can then be parsed using regular expressions:

‍

[environment: hillside with sheep grazing][entities: woman]

‍

Alternatively, we can provide a prompt to generate a valid JSON response:

‍

Please split text to 3 scenes and provide output in valid JSON format
{
scenes: [
{ environment: "", character: "", emotion: "", sceneText: "" },
{ environment: "", character: "", emotion: "", sceneText: "" },
{ environment: "", character: "", emotion: "", sceneText: "" }
]
}

‍

Once we have structured the scenes, we are prepared to visualize them.

Gererate novel scenes with Stable Diffusion and Dreambooth

After we've structured the scenes, we need to establish a mechanism for image generation. The primary requirements for the image generation model include:

The model should provide an API for image creation. This is why the Midjourney model isn't suitable, as it currently lacks an API at the time this article was written.
Consistency in character design and visual styles is critical. Characters in visuals should maintain a consistent appearance. For this reason, for instance, the Stability.ai REST API is not a suitable choice. Its use of Stable Diffusion can lead to dramatic output changes even with minor adjustments to the prompt or seed.
‍

In light of the requirements stated above, an ideal choice would be a custom open-sourced Stable Diffusion model with Dreambooth fine-tuning.

Generate consistent characters with fine-tuning

To maintain consistent characters in the Visual Novel, we should fine-tune the Stable Diffusion model. Dreambooth can assist in this process by enabling you to train a model using several images. You can experiment in Google Colab until you achieve satisfactory results. Please note that a unique model will need to be created for each character.

There are alternative approaches to fine-tuning your model. However, we have tested Dreambooth, and it has proven to be quite effective.

After we created a set of characters it is time to place them on the scene.

‍

‍

Control poses of characters

Once you have a model that produces a consistent character, you can start posing them. To achieve character posing, try using the ControlNet openpose model. There isn't an open Google Colab notebook available to apply your custom Dreambooth model to ControlNet. You will need to write some Python code to handle this task.

Let’s make our characters tell some words.

Speech generation

To generate a voice for the characters, we recommend using Elevenlabs. This service enables you to create a custom voice and generate speech audio from text input.

Visual novel effects

To render your static images more dynamically, consider providing an animation with a parallax effect. We have discovered a solution that generates parallax motion. With some code modifications, you can manage the animation's duration and type. As a result, we will have a short video file that contains parallax animation.

Let’s put everything together.

‍

Putting everything together

To concatenate small, short animation videos, you can utilize the FFmpeg library.

Here's what we accomplished using FFmpeg:

We added speech to each clip.
Subtitles were integrated into the clips.
The individual clips were merged together.
We incorporated background music into the movie.

The result you can see below:

‍

Deployment

As a result, we have several Dreambooth models for character generation and a model for parallax animation. For deployment, we selected the Replicate service. With Replicate, we only pay for the generation time, making it a more cost-effective option compared to a custom host.

However, creating a Replicate build is not entirely straightforward. It requires a compute instance with CUDA installed. To accommodate this requirement, we conducted deployments from specialized GPU instances in Google Cloud.

Results

Thus, we have mastered the generation of all assets for a Visual Novel, including scenes, images, speech, and animation effects. Additionally, we've identified an optimal location for model deployment.

As a bonus, allow me to share some constraints and issues that we encountered:

Displaying more than one character on the scene is challenging given that we have one Dreambooth model per character.
Elevenlabs has stringent rate limits. Network errors can occur with a large number of scenes.
Replicate places each query in a queue, making it challenging to scale. With a large user base, you will need to resort to a custom host.

‍

Thank you for reading! I hope you found this article helpful.