BeautyScorer: A Permutation-Invariant Beauty Scoring Neural Network

Overview

BeautyScoreModel is a deep learning model designed to predict beauty scores (as integers from 1 to 9) for individuals based on up to 9 photos and corresponding face crops. The model treats the problem as a multi-class classification task, outputting probabilities over 9 classes. It is built using PyTorch and leverages pre-trained MobileNetV3 for feature extraction, transformer encoders for aggregating variable-length inputs, and an MLP head for final prediction.

Key features:

Input Handling: Supports variable numbers of photos and faces (up to 9 each), with padding and masks to handle fewer inputs.
Permutation Invariance: The order of photos/faces does not affect the output, making it robust to shuffling (a "hidden gem" for set-based inputs like photo collections).
Efficiency: Uses lightweight components for faster training/inference.
Adaptability: Fine-tuned for beauty scoring, with options for handling class imbalance and further fine-tuning on expanded datasets.

This repository includes the model implementation, preprocessing utilities, training/fine-tuning scripts, and inference examples.

Architecture

The model processes photos and faces separately before fusing their representations. Here's a high-level breakdown:

Feature Extraction (MobileNetV3):
- A pre-trained MobileNetV3 (features only, no classifier) extracts spatial features from each image.
- Input: RGB images resized to 224x224.
- Output per image: A feature map [960, 7, 7], pooled to [960] via adaptive average pooling.
- Technical Choice: MobileNetV3 is lightweight (efficient convolutions with depthwise separables) and pre-trained on ImageNet, providing strong generic features.
- Peculiarity: Shared extractor for photos and faces, but separate downstream processing allows specialization.
Projection Layers:
- Linear layers map the 960-D features to an embedding dimension (default: 512).
- Technical Choice: Dimensionality reduction improves efficiency (transformer attention scales quadratically with dimension) and acts as a learnable adapter for task-specific features (e.g., beauty-related traits vs. ImageNet objects).
- Why Not Skip?: Direct 960-D input would increase compute; projection creates a bottleneck for better generalization.
Transformer Encoders:
- Two separate encoders (one for photos, one for faces) aggregate embeddings.
- Input: CLS token + sequence of embeddings [batch, 10, 512] (CLS + up to 9 photos/faces).
- Uses self-attention (4 layers, 8 heads) with padding masks to ignore invalid inputs.
- Output: Aggregated vector from the encoded CLS token [batch, 512].
- Technical Choice: Transformers treat inputs as sets, enabling context-aware aggregation. No positional encodings ensure permutation invariance—a key "hidden gem" where shuffling photos (with masks) yields identical outputs.
- Peculiarity: Masks handle variable lengths (e.g., 3 photos + 6 faces); attention ignores padding, focusing on relevant data.
MLP Head:
- Concatenates photo and face vectors [batch, 1024] and regresses to 9-class logits via a 2-layer MLP (1024 → 512 → 9).
- Technical Choice: Simple non-linear fusion; dropout (0.2) for regularization.

Data Flow Summary:

Inputs: photos_tensor [B,9,3,224,224], photos_mask [B,9], faces_tensor [B,9,3,224,224], faces_mask [B,9].
Flatten & Extract: Per-image features via MobileNetV3 + pooling → [B,9,960].
Project: → [B,9,512].
Aggregate: Prepend CLS, encode with transformer (masked) → [B,512] per branch.
Fuse & Predict: Concat → MLP → [B,9] logits.

The model is permutation-invariant because transformers use attention (no order bias) and masks ensure only valid inputs matter.

Peculiarities and Hidden Gems

Variable Inputs: Up to 9 photos/faces; fewer are padded with zeros and masked (False in mask). This allows flexibility without fixed-size assumptions.
Permutation Invariance: No positional encodings in transformers—photos can be shuffled without changing outputs. Tested via randomization: outputs match in eval mode (dropout disabled).
Separate Photo/Face Processing: Enables different contributions (e.g., photos for composition, faces for details), even if counts differ.
Classification Over Regression: Switched to classification (scores 1-9 → classes 0-8) for better handling discrete scores; uses cross-entropy loss.
Class Imbalance Handling: Computes inverse-frequency weights; caps to avoid instability.
Fine-Tuning Strategy: Unfreeze later MobileNetV3 layers; use AdamW, cosine annealing scheduler, label smoothing, mixed precision, and early stopping for SOTA performance on expanded datasets.

Installation

Clone the repository:

git clone https://github.com/alessiosavi/BeautyScorer.git
cd BeautyScorer

Install dependencies:

pip install torch torchvision pandas tqdm scikit-learn

Usage

Preprocessing

Use load_data to load and normalize images.

Inference

CONF = utils.load_conf("conf.yaml")[0]
model = BeautyScoreModel(conf = CONF)
model.load_state_dict(torch.load('model_v8_classes_big_finetuned_state_dict.pt', weights_only=True))

score, probability = score_person(person_id, model)
print(
    f"Person: {person_id} -> Score: {score} | Prob: {probability} | RealScore: {batch['scores'][idx]}"
)
utils.show_person(person_id, 1)

Training/Fine-Tuning

See train() function for full script. Example:

CONF = utils.load_conf("conf.yaml")[0]
new_data = [{"id": "person1", "score": 5, "photos": glob("path/to/person1/*")}, ...]
raw_ds = dataset.BeautyDataset(raw_dataset)
raw_dl = DataLoader(
    raw_ds,
    batch_size=BATCH_SIZE,
    shuffle=True,
    pin_memory=True,
    collate_fn=raw_ds.collate_fn,
)
model = BeautyScoreModel(conf = CONF)
model_params = list(filter(lambda p: p.requires_grad, model.parameters()))

class_weights = utils.compute_class_weights(df)
criterion = nn.CrossEntropyLoss(weight=class_weights.to(device), label_smoothing=0.1)
optimizer = optim.AdamW(model_params, lr=lr, weight_decay=1e-2)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)
train(model, raw_dl)

Training Details

Dataset: List of dicts with ID, score (1-9), and photo paths.
Loss: CrossEntropyLoss with weights and label smoothing (0.1).
Optimizer: AdamW (lr=1e-5, weight_decay=0.01).
Scheduler: CosineAnnealingLR.
Other: Mixed precision (AMP), early stopping (patience=3).

Acknowledgments

Built with PyTorch and torchvision.
Inspired by transformer-based set aggregation (e.g., BERT-like CLS token).

For issues, open a GitHub issue. Contributions welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
conf.yaml		conf.yaml
dataset.py		dataset.py
main.ipynb		main.ipynb
model.py		model.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BeautyScorer: A Permutation-Invariant Beauty Scoring Neural Network

Overview

Architecture

Peculiarities and Hidden Gems

Installation

Usage

Preprocessing

Inference

Training/Fine-Tuning

Training Details

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BeautyScorer: A Permutation-Invariant Beauty Scoring Neural Network

Overview

Architecture

Peculiarities and Hidden Gems

Installation

Usage

Preprocessing

Inference

Training/Fine-Tuning

Training Details

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages