This repository contains an implementation of the Stable Diffusion model for image generation. It is trained on the Flowers102 dataset. Everything is implemented from scratch using PyTorch.
Here are some samples generated by the model after training for 500 epochs. It can be seen that the model is able to generate realistic looking flowers (and some that look like organic waste).
To get started with this project, follow these steps:
-
Clone this repository:
git clone https://github.com/ProfessorNova/Stable-Diffusion.git cd Stable-Diffusion -
Set up Python Environment: Make sure you have Python installed (tested with Python 3.10.11).
-
Install PyTorch: Visit the PyTorch website for proper PyTorch installation based on your system configuration.
-
Install Additional Dependencies: There are two additional dependencies required for this project.
tqdmis used for progress bars andmatplotlibis used for plotting the results during inference.pip install tqdm matplotlib
-
Run the Pretrained Model: To generate images using the pretrained model, run the following command:
python sd_inference.py
This will generate eight images and plot them using matplotlib.
For getting a better understanding of how Stable Diffusion works and how it is implemented in this repository, I created a jupyter notebook (notebook.ipynb) which explains the fundamentals of Stable Diffusion together with the code. It also shows a creative way to generate images using a hand-drawn sketch of a flower.
Have a look at the unet.py file in the lib folder of this repository if you want to see the details of the model.
-
Noise Embedding
- We first map the scalar noise level (a single float) into a high-dimensional embedding
(1×1×64). - This embedding will be broadcast, upsampled, and fused with feature maps in the decoder, so the network “knows” how much noise to remove at each spatial location.
- We first map the scalar noise level (a single float) into a high-dimensional embedding
-
Encoder (DownBlocks)
- The noisy image
(128×128×3)is first processed by aConv2Dlayer to lift it into a(128×128×64)feature map. - We then apply a series of DownBlocks, each of which:
- Halves the spatial resolution (e.g.
128→64,64→32, …) - Increases the number of channels (e.g.
64→128→256→512→1024) - Uses residual connections internally to ease gradient flow and preserve information.
- Halves the spatial resolution (e.g.
- At each stage we save the output feature map for later skip connections.
- The noisy image
-
Bottleneck (ResidualBlock ×2)
- Once we reach the smallest spatial size (
8×8), we apply two ResidualBlock at constant channel width (1024). - These deepen the network’s representation power without further downsampling.
- Once we reach the smallest spatial size (
-
Decoder (UpBlocks)
- We then reverse the process with a series of UpBlocks:
- Upsample spatially (e.g.
8→16,16→32, …) - Reduce channel width symmetrically to the encoder (e.g.
1024→512→256→128→64) - Concatenate with the corresponding encoder feature map (the skip connection) at the same resolution
- Fuse via convolution and residual connections
- Upsample spatially (e.g.
- This combination of coarse, high-level features with fine, low-level details allows precise reconstruction of the denoised image.
- We then reverse the process with a series of UpBlocks:
-
Final Convolution
- After the last UpBlock (back to
(128×128×64)), a simpleConv2Dlayer reduces the channels to3, yielding a predicted noise map(128×128×3).
- After the last UpBlock (back to
Visualization:
To train the model from scratch, run the following command:
python sd_train.pyThis will start the training process. The model will generate samples after every epoch and save them in the output_sd folder by default.
Here are some images generated during training:
-
Epoch 1:
It is just pure noise at this point.
-
Epoch 10:
The model is starting to generate some larger blobs.
-
Epoch 50:
You can see some flower-like structures starting to form.
-
Epoch 100:
Colors are getting more vibrant and the shapes are more defined.
-
Epoch 300:
Now you can really spot the flowers. But some still look very weird.
-
Epoch 500:
Now almost all images look like flowers. Some are very realistic, some are not.
This project was highly inspired by the keras example Denoising Diffusion Implicit Models by András Béres.







