This repository contains the complete project for the Ayna ML Assignment. The goal is to train a conditional UNet model to color a polygon based on its shape (from an image) and a desired color (from a text prompt). The entire implementation, from data loading to training and inference, is contained within a single Google Colab notebook.
- Project Overview
- Key Features
- Dataset Strategy
- Model and Training Methodology
- Experiments Summary
- Key Learnings
- How to Run
The core of this project is a deep learning model that performs conditional image-to-image translation. It takes two inputs:
- An image of a polygon outline.
- A text prompt specifying a color (e.g., "blue", "red").
The model's output is an image of the input polygon filled with the specified color. This is achieved using a UNet architecture conditioned on text embeddings generated by OpenAI's CLIP model.
- Conditional UNet: Utilizes the
UNet2DConditionModelfrom Hugging Facediffusers, which injects conditional information via cross-attention layers. - CLIP-Powered Text Conditioning: Employs the
openai/clip-vit-base-patch32model to transform color names into rich semantic embeddings. - Advanced Loss Function: A composite loss combining pixel-wise (MSE), perceptual (LPIPS), structural (SSIM), and a domain-specific Color Loss in the LAB color space to achieve high-fidelity results.
- Sophisticated Training Schedule: Uses an
AdamWoptimizer with a linear warmup and cosine decay learning rate scheduler for stable and effective training. - Robust Data Pipeline: Leverages datasets from the Hugging Face Hub, combining synthetic and augmented data with on-the-fly transformations to ensure model generalization.
- Comprehensive Experiment Tracking: All experiments are logged to Weights & Biases for detailed analysis and reproducibility.
A robust dataset was created by combining two sources from the Hugging Face Hub: bhavya777/synthetic-colored-shapes and bhavya777/augmented-colored-shapes.
To ensure the model generalizes well, an on-the-fly augmentation pipeline was used during training. This included paired geometric transforms (random flips and rotations) and color jitter. This strategy was proven critical after an early model trained only on white backgrounds failed completely when tested on a noisy black background.
The generator is a UNet2DConditionModel conditioned on 768-dimensional text embeddings from a CLIPTextModel. The text embeddings guide the UNet's decoding path via cross-attention, allowing the model to infuse the correct color information at multiple feature map scales.
-
Optimizer:
AdamW -
Learning Rate:
$1 \times 10^{-4}$ (Base) -
Weight Decay:
$1 \times 10^{-4}$ - Scheduler: Linear warmup for the first 10% of steps, followed by a cosine decay schedule.
- Gradient Clipping: Max norm of 1.0 to prevent exploding gradients.
The final loss was a weighted sum of four components, designed to balance different aspects of image quality:
$ \mathcal{L}{total} = 1.0 \cdot L{MSE} + 0.5 \cdot L_{LPIPS} + 0.2 \cdot L_{SSIM} + 0.3 \cdot L_{Color} $
The custom
A total of 8 models were trained to systematically find the optimal configuration. The detailed qualitative results for each epoch can be found in the final PDF report.
| Model | Key Change / Loss Function | Architecture Details | Parameters | Epochs | W&B Link | Hugging Face Hub |
|---|---|---|---|---|---|---|
| 1 | Baseline: MSE Only | (32,64,128), layers=1 |
68,148,747 | 5 | Link | Link |
| 2 | Added LPIPS | (32,64,128), layers=1 |
68,148,747 | 5 | Link | Link |
| 3 | Added SSIM | (32,64,128), layers=1 |
68,148,747 | 5 | Link | Link |
| 4 | Added Color Loss | (32,64,128), layers=1 |
68,148,747 | 5 | Link | Link |
| 5 | Deeper UNet | (64,128,256), layers=2 |
89,723,811 | 5 | Link | Link |
| 6 | Longer Training (10 Ep) | (32,64,128), head_dim=8 |
68,395,939 | 10 | Link | Link |
| 7 | Longer Training (15 Ep) | (32,64,128), head_dim=8 |
68,395,939 | 15 | Link | Link |
| 8 | Final Architecture | (64,32,64), head_dim=12 |
65,212,403 | 15 | Link | Link |
- A Custom Loss is a Game-Changer: The custom LAB-based Color Loss was the single most impactful change, directly addressing the core task of accurate color reproduction where other losses failed.
- Augment for Generalization: On-the-fly data augmentation is essential for building robust models that perform well on data outside their immediate training distribution.
- Systematic Experimentation is Crucial: The progression through the 8 models clearly shows how iterative improvements to the loss function and architecture lead to a superior final result.
- Sophisticated Training Works: The combination of a warmup-plus-cosine-decay scheduler, AdamW optimizer, and gradient clipping created a stable training environment that allowed models to converge effectively.
- Click the "Open in Colab" badge at the top of this README to launch the notebook in Google Colab.
- The notebook is self-contained. Run the cells in order from top to bottom.
- Dependencies will be installed, data will be downloaded, the model will be trained, and inference examples will be shown.