Natural language search engine for fashion images using CLIP, VLMs, and hybrid retrieval.
- 5 Search Approaches: Fine-tuned CLIP, Hybrid, VLM, and enhanced versions
- Multi-attribute queries: Colors, clothing types, environments
- Sub-100ms search latency with caching and two-stage retrieval
- Gradio web interface for easy interaction
pip install -r requirements.txt# Approach 2 (Fine-tuned CLIP - Recommended)
cd approach2_finetune_clip
python index_images.py --image-dir ../fashion-data
# Approach 3 (Hybrid) - from project root
python index_images.py --image-dir ./fashion-data
# Approach 4 (VLM) - from project root
python approach4_vlm/index_images.py --image-dir ./fashion-data
# Enhanced Approach 3 - requires base A3 indexed first
python approach3_enhanced/build_indices.py
# Enhanced Approach 4 - requires base A4 indexed first
python enhanced_approach4/build_indices.py# ═══════════════════════════════════════════════════════════
# Approach 2 (Fine-tuned CLIP)
# ═══════════════════════════════════════════════════════════
cd approach2_finetune_clip
python search.py "crimson red blazer"
python search.py --interactive
python search.py --run-eval
# ═══════════════════════════════════════════════════════════
# Approach 3 (Hybrid) - from project root
# ═══════════════════════════════════════════════════════════
python search.py "red dress in park"
python search.py --interactive
# ═══════════════════════════════════════════════════════════
# Approach 4 (VLM) - from project root
# ═══════════════════════════════════════════════════════════
python approach4_vlm/search.py "elegant evening gown"
# ═══════════════════════════════════════════════════════════
# Enhanced Approach 3 - from project root
# ═══════════════════════════════════════════════════════════
python approach3_enhanced/search_enhanced.py "blue jacket in office"
python approach3_enhanced/search_enhanced.py --benchmark
# ═══════════════════════════════════════════════════════════
# Enhanced Approach 4 - from project root
# ═══════════════════════════════════════════════════════════
python enhanced_approach4/search_enhanced.py "casual summer outfit"
python enhanced_approach4/search_enhanced.py --benchmark
# ═══════════════════════════════════════════════════════════
# Web Demo (All approaches)
# ═══════════════════════════════════════════════════════════
python demo.pyThe system uses the fashion-data dataset from Hugging Face, containing 617 fashion images generated using Google Whisk.
| Approach | Description | Best For |
|---|---|---|
| Approach 2 | Fine-tuned CLIP ViT-L/14 | Fashion-specific queries |
| Approach 3 | CLIP + Color + Scene hybrid | Balanced accuracy |
| Approach 4 | InternVL3 VLM captions | Interpretable results |
| Enhanced A3 | + Hierarchical index + Cache | High-scale search |
| Enhanced A4 | + Keyword index + Cache | Fast VLM search |
Approach 2: Fine-tuned CLIP
Approach 3: Hybrid Search
Approach 4: VLM Caption-based Search
| Approach | Cold Query | Warm Query | Speedup |
|---|---|---|---|
| Approach 2 | 50-100ms | 50ms | - |
| Approach 3 | 200-300ms | 200ms | - |
| Approach 4 | 800ms | 800ms | - |
| Enhanced A3 | 220ms | 20ms | 10x |
| Enhanced A4 | 80ms | 20ms | 40x |
├── approach2_finetune_clip/ # Fine-tuned CLIP
│ ├── best_model/ # Fine-tuned model weights
│ ├── index_images.py # Index command
│ └── search.py # Search command
├── approach3_enhanced/ # Enhanced hybrid search
│ ├── build_indices.py # Build enhanced indices
│ └── search_enhanced.py # Search command
├── approach4_vlm/ # VLM caption-based search
│ ├── index_images.py # Index command
│ └── search.py # Search command
├── enhanced_approach4/ # Enhanced VLM search
│ ├── build_indices.py # Build enhanced indices
│ └── search_enhanced.py # Search command
├── indexer/ # Feature extraction (A3)
├── retriever/ # Search engine (A3)
├── demo.py # Gradio web interface
├── index_images.py # Index command (A3)
└── search.py # Search command (A3)
- Models: CLIP ViT-L/14, InternVL3-1B, sentence-transformers
- Vector DB: ChromaDB
- Frameworks: PyTorch, Transformers, Gradio



