Skip to content

ashok-arora/sdxl_naruto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Naruto-Style SDXL LoRA (Trained Under 16GB VRAM Constraint)

This project demonstrates how I fine-tuned Stable Diffusion XL (SDXL 1.0 Base) using LoRA to generate images in the Naruto anime style, all within a strict 16GB VRAM limit so it can run on the free-tier Colab instance.

⭐ Visual results are provided in the final section. ⭐

Link to Colab notebooks:

Directory Structure

ashok@penguin:~/sdxl_naruto$ tree -L 2
.
├── inference.ipynb                         <= Inference file
├── LICENSE
├── naruto_lora
│   ├── checkpoint-*                        <= Saved model checkpoint 
│   ├── logs                                <= Training logs
│   └── pytorch_lora_weights.safetensors    <= Final model checkpoint
├── output.png                              <= Base vs finetuned model
├── README.md
└── train.ipynb                             <= Training file

naruto_lora folder is also available as a zip file in gdrive here.

Techniques used in the Approach

My goal was to adapt SDXL 1.0 Base to generate images in the Naruto anime style using LoRA fine-tuning.

SDXL has three main components. The UNet-XL performs denoising and is memory intensive due to large attention blocks. The dual text encoders, CLIP ViT-G and OpenCLIP, provide rich text embeddings. The improved VAE-XL handles high-resolution encoding and decoding and is also memory heavy during spatial convolutions.

The full model is large, with approximately 3.5 billion parameters, two text encoders, a high-resolution UNet and VAE, and a native 1024×1024 output resolution. On a 16GB GPU, loading the base model at full resolution causes out-of-memory errors. Using attention slicing and VAE tiling reduces memory usage and allows SDXL to load successfully for inference.

The following sections describe the techniques used for LoRA fine-tuning, with technical explanations in general and the motivation specific to this 16GB GPU setup.

Technique 1: Attention Slicing

Code used:

pipe.enable_attention_slicing()

Technical meaning: Attention is normally computed over the full hidden dimension $H$. With slicing, SDXL splits the attention operation into smaller chunks of size $\frac{H}{k}$. This reduces the peak memory requirement from

$$ 𝑂(𝐻^2) \rightarrow 𝑂(𝐻^2/𝑘) $$

Motivation behind using it: UNet cross-attention uses up the entire VRAM. Slicing reduces peak memory enough for SDXL to load on a 16GB GPU.

Technique 2: VAE Tiling

Code used:

pipe.enable_vae_tiling()

Technical meaning: The SDXL VAE normally processes the entire spatial feature map at once, which creates large $O(H×W)$ memory spikes. Tiling breaks the image into smaller patches and processes them sequentially, keeping peak memory low.

Motivation behind using it: The VAE becomes very memory-heavy at 512px and above. Tiling prevents out-of-memory errors during both encoding and decoding, which is essential for running SDXL on a 16GB GPU.


Full SDXL fine-tuning is known to require more than 30GB VRAM from community benchmarks and training docs. My 16GB session crashed instantly when I attempted it, confirming the expected requirement. This made the following techniques necessary:

Technique 3: LoRA solves this by training only low-rank matrices:

Code:

# L1225–L1232 in diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py
    unet_target_modules = ["to_k", "to_q", "to_v", "to_out.0"]
    unet_lora_config = get_lora_config(
        rank=args.rank,
        dropout=args.lora_dropout,
        use_dora=args.use_dora,
        target_modules=unet_target_modules,
    )
    unet.add_adapter(unet_lora_config)

Technical meaning:

The UNet is the main denoising model inside SDXL, loaded from the unet subfolder of the checkpoint. The DreamBooth LoRA training modifies only the UNet’s attention layers using low-rank adapters.

$$ W' = W + BA $$

During training, only matrices A and B are updated. This keeps SDXL’s large weights frozen while allowing efficient adaptation.This design keeps memory usage low, avoids altering SDXL’s massive UNet, enables strong style learning from small datasets, and allows fast, stable updates on limited VRAM.

Motivation Fully fine-tuning SDXL’s UNet is not possible on a 16GB GPU because the attention blocks and optimizer states exceed available VRAM. LoRA solves this by training only a small subset of parameters inside the attention projections. This lets SDXL learn new styles or concepts without touching the full model, making DreamBooth training feasible and efficient.

Technique 4: Using limited dataset

Code:

# Load HF Naruto dataset
dataset = load_dataset("lambdalabs/naruto-blip-captions", split='train')

# Limit to first 75 items
dataset = dataset.select(range(75))

Technical meaning:

The full dataset contains more than 1000 Naruto-style images with BLIP captions, but I used only the first 75 images. For LoRA fine-tuning, the useful signal comes from style statistics such as line thickness, shading, and color palettes rather than identity variation. Since anime art has low intra-class variance, this small subset still captures the overall style distribution, allowing LoRA to learn the style effectively without the full dataset.

Motivation behind using it:

Using only 75 images reduces RAM and VRAM usage, which is good on Colab and other low-memory setups. Anime datasets are highly repetitive, so a small subset is sufficient for style learning. Since LoRA focuses on style rather than identity, larger datasets provide little additional benefit. Limiting the dataset also reduces CPU data pipeline load, keeps training lightweight, enables stable 1200-step fine-tuning within the 16GB VRAM limit, and minimizes the risk of overfitting on repeated frames.

Technique 5: Lower Resolution Training (512px)

Code:

  --resolution 512 \

Technical meaning:

In diffusion models, UNet memory usage scales quadratically with spatial resolution:

$$ VRAM∝ (H×W) $$

So reducing resolution from 1024 to 512 cuts both height and width in half leading to 4x reduction. Reducing the input resolution directly lowers UNet activations, attention map sizes, intermediate feature maps, and overall batch memory footprint. SDXL’s UNet is extremely heavy, with multiple 4× up/down blocks and large attention heads, so lowering the resolution reduces peak VRAM more than any other single change.

Motivation:

Using 512px instead of 1024px is essential for training on a 16GB GPU. Higher resolutions cause OOM even with memory-saving techniques. A smaller resolution also allows faster batch processing, more stable training loops, and efficient LoRA fine-tuning without compromising the quality of the Naruto anime style.

Technique 6: Mixed Precision: bf16

Code used:

--mixed_precision bf16

Technical meaning:

Bfloat16 (bf16) stores tensors in 16 bits while keeping an 8-bit exponent, providing a larger dynamic range than fp16. This reduces activation memory by roughly half compared to fp32, prevents NaN spikes and gradient underflow, and works seamlessly with PyTorch AMP on modern GPUs.

Motivation for using bf16:

Fp16 is unstable for SDXL’s UNet due to large attention layers. Using bf16 enables stable LoRA fine-tuning within 16GB VRAM, maintains valid gradients over 1200+ steps, and avoids crashes without increasing memory usage.

Technique 7: 8-bit Adam

Code used:

--use_8bit_adam

Technical meaning:

The standard Adam optimizer stores three fp32 tensors per parameter: weights, first moment (m), and second moment (v), which triples the optimizer’s memory usage. 8-bit Adam compresses the m and v states into int8, reducing memory by roughly 70 percent, speeding up parameter updates, and lowering host-to-device transfer.

Motivation for using 8-bit Adam:

Full Adam does not fit on a 16GB GPU with SDXL. Using 8-bit Adam allows LoRA to retain Adam’s adaptive updates while significantly reducing VRAM usage and maintaining stable training.

Technique 8: Batch Size + Gradient Accumulation

Code used:

--train_batch_size 2
--gradient_accumulation_steps 2

Technical meaning:

Gradient accumulation allows simulating a larger effective batch size without storing all samples in memory at once. Gradients are computed on smaller batches, accumulated, and then used for a single optimizer step. This reduces memory usage while providing more stable gradient updates

Motivation for these specific values:

For this training, a batch size of 2 and gradient accumulation of 2 were used. This creates an effective batch of 4 while keeping GPU memory within 16GB. It ensures stable gradients without causing out-of-memory errors and allows LoRA fine-tuning to complete 1200 steps efficiently.

Technique 9: Using same instance prompt

  --instance_prompt "Naruto style anime character" \
  --class_prompt "anime character" \

Technical description:

The dataset contains BLIP-generated captions, but they describe content rather than style. For style LoRA, the model needs to learn line-art, color palette, and shading patterns. Content captions can introduce noise and unnecessary text-conditioning, and irrelevant captions may push training in the wrong direction.

Motivation for ignoring captions:

In this project, the goal was to learn the Naruto anime style, not a new character or object. Using captions could bias the model toward specific poses or backgrounds, add noise due to imperfect BLIP captions, and increase gradient variance. Ignoring captions simplifies training and ensures stable, style-focused fine-tuning.

Visual Results

Output Samples

About

Resource-efficient fine-tuning of Stable Diffusion XL (SDXL 1.0) on the Naruto BLIP-Captions dataset within a 16GB VRAM budget

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors