In notebook_02.ipynb tried mutliple pipelines of different combination of control net conditioning such as :
- depth
- depth, canny
- depth, normal
- depth, canny, normal
- depth , normal , canny
- canny, depth, normal
Some input depth images, such as 168x168 or 177x177, result in distorted outputs, sometimes even generating NSFW content. Additionally, a depth image with a resolution of 2668x2668 produces very noisy outputs. Therefore, decided to resize all inputs to a standard size of 512x512.
Also normalized input depth image to (0,255)
- Normalize each depth image such that its pixel values are scaled between 0 and 255. This normalization ensures that all depth images are on a comparable scale for visualization and processing.
- Resize all depth images to a standard size of 512x512xch pixels to ensure uniform input to the model and ease of comparison across various images. Using this fixed size simplifies the processing pipeline.
- Convert depth image to normals, then use ControlNet pipeline to apply depth conditioning followed by normals conditioning for improved generation.
Here for each text prompt, provide three set of images
- one is input depth image, canny edge image followed by normal image extracted from given depth image.
- generated output image
- comparison shown between input depth image vs depth image extracted from generated output image
Also extracted depth from generated output using Monocular depth estimation and verified visually. It is visually almost same.
See final_pipeline.ipynb for the final output.
In aspect_ratio.ipynb notebook, it was demonstrated that for a given text prompt, we have two different depth images: one with an aspect ratio of 1:1 and the other with an aspect ratio close to 16:9.
Yes we can generate different aspect ratio image from Stable Diffusion.
This is 1:1 aspect ratio image
This is 16:9 aspect ratio image
The Stable Diffusion model is optimized for generating images with a 1:1 aspect ratio. In this case, a 1:1 aspect ratio depth image was created from an original 16:9 aspect ratio depth image. As a result, there is no distortion or weird output, since the image was not stretched or compressed—only cropped.
Cropping a depth image from 16:9 to 1:1 typically results in a more focused and condensed output, but it may reduce the richness and context of the generated image by removing peripheral details.
In this instance, the 1:1 aspect ratio output appears more detailed compared to the 16:9 aspect ratio output.
In this reducing generation latency notebook, we demonstrate methods to quickly reduce generation latency. The image below was generated with the default number of inference steps (i.e., 50) and default resolution, taking 3.36 seconds to produce.
One way to reduce generation time is by decreasing the number of inference steps in the Stable Diffusion pipeline. In this case, the generation time was reduced to 2.1 seconds.
Another approach is to reduce the resolution of the generated output image, which resulted in a generation time of 2.7 seconds.
From a visual inspection, we can conclude that reducing latency also reduces the quality of the generated image.