A deep learning project for colorizing polygon shapes based on text descriptions using Conditioned UNet implementation by diffusion model UNet2DConditionModel and CLIP text encoders.
This project implements a polygon colorization system using:
- Diffusers U-Net: For generating colored polygon outputs
- CLIP Text Encoder: For processing color text descriptions
- WandB: For experiment tracking and visualization
- PyTorch: For the deep learning framework
The model takes grayscale polygon images and color text descriptions as input, and generates colored polygon outputs.
polygon-colorizer/
├── main.py # Main training script
├── README.md # Readme file
└── dataset/ # Dataset directory
├── training/
│ ├── inputs/ # Grayscale polygon images
│ ├── outputs/ # Colored polygon images
│ └── data.json # Metadata with input/output mappings
└── validation/
├── inputs/
├── outputs/
└── data.json
Clone the repo (for running locally):
git clone https://github.com/sprnjt/polygon-colorizer.git
cd polygon-colorizerInstall the required dependencies:
pip install wandb diffusers transformers accelerate torch torchvisionFor Kaggle Notebooks:
- Click "Add-ons" → "Secrets" → "Add a new secret"
- Label:
wandb_api_key - Value: Your actual WandB API key
For Local Development:
- Set your WandB API key as an environment variable or pass it as a command line argument
- Download the Ayna dataset from the provided source
- The dataset contains two main folders:
trainingandvalidation - Each folder has:
inputs/: Grayscale polygon shapesoutputs/: Completed colored polygonsdata.json: Metadata with fields likeinput_polygon,colour, andoutput_image
- Download the dataset locally
- Organize it according to the structure shown above
Run the main training script:
python main.py --data_dir /path/to/your/dataset --output_dir ./models--data_dir: Path to the root dataset directory (required)--output_dir: Directory to save the best model (default:./best_unet_model)--learning_rate: Optimizer learning rate (default:1e-4)--batch_size: Batch size for training (default:8)--num_epochs: Total number of training epochs (default:130)--image_size: Image resolution (default:128)--device: Device to use (default:cuda)--text_encoder_model: CLIP model ID (default:openai/clip-vit-base-patch32)--wandb_project: WandB project name (default:polygon-colorizer-diffusers)--wandb_entity: WandB entity/username--wandb_api_key: Your WandB API key
python main.py \
--data_dir ./data \
--output_dir ./models \
--batch_size 16 \
--num_epochs 100 \
--learning_rate 5e-5 \
--wandb_project "my-polygon-colorizer" \
--wandb_api_key "your_api_key_here"The model consists of:
- CLIP Text Encoder: Processes color text descriptions
- U-Net with Cross-Attention: Generates colored polygon outputs
- Conditional Generation: Uses text embeddings to condition the image generation
Download pre-trained model weights from: Google Drive
Model on Kaggle: Kaggle
Model on HF: Hugging Face
For complete training and inference examples, refer to the Kaggle notebook: Kaggle Notebook
Track training progress and view experiment details: Wandb Report Link
For detailed project documentation and analysis: Docs
The data.json file in each split folder contains metadata with the following structure:
[
{
"input_polygon": "path/to/input/image.png",
"colour": "red",
"output_image": "path/to/output/image.png"
}
]- Data Loading: Loads polygon images and color text descriptions
- Text Encoding: Uses CLIP tokenizer and encoder to process color names
- Image Processing: Converts images to tensors and normalizes them
- Model Training: Trains the U-Net with cross-attention to the text embeddings
- Validation: Evaluates model performance and logs sample predictions
- Model Saving: Saves the best model based on validation loss
The training process is monitored through WandB, which tracks:
- Training and validation losses
- Sample predictions with input, ground truth, and generated outputs
- Model hyperparameters and configuration
- Python 3.7+
- PyTorch 1.8+
- CUDA-compatible GPU (recommended)
- WandB account for experiment tracking