Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion clip_benchmark/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@
import torch
from .open_clip import load_open_clip
from .japanese_clip import load_japanese_clip
from .transformers_clip import load_transformers_clip

# loading function must return (model, transform, tokenizer)
TYPE2FUNC = {
"open_clip": load_open_clip,
"ja_clip": load_japanese_clip
"ja_clip": load_japanese_clip,
"transformers": load_transformers_clip,
}
MODEL_TYPES = list(TYPE2FUNC.keys())

Expand Down
24 changes: 24 additions & 0 deletions clip_benchmark/models/transformers_clip.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from torch import nn
from transformers import AutoModel, AutoProcessor
from functools import partial

class TransformerWrapper(nn.Module):
def __init__(self, model):
super().__init__()
self.model = model

def encode_text(self, text):
return self.model.get_text_features(**text)

def encode_image(self, image):
return self.model.get_image_features(image["pixel_values"].squeeze(1))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, why should we remove 1-st dim?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to brain storm on this bit, thanks for asking this question.
The load function should return a model, transforms, and a tokenizer. In the load function I have written it returns the model, the image processor, and the tokenizer.

The transform is used in the collation function while building the dataloader, which is where it adds another dimension to the tensors. So I noticed that the images["pixel_values].shape=(b, 1, c, h, w).

Is there a way I could extract the transform function from an image_processor?


def load_transformers_clip(model_name, pretrained, cache_dir, device):
ckpt = f"{model_name}/{pretrained}"
Comment on lines +18 to +19
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ckpt = f"{model_name}/{pretrained}" may confusing, it's better to provide model_name as checkpoint on the hub, and hardcode pretrained as True IMO. Otherwise it's going to be like

model_name = "openai"
pretrained = "clip-..."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to choose this option for better verbosity.

print(f"Running '{task}' on '{dataset_name}' with the model '{args.pretrained}' on language '{args.language}'")

model = AutoModel.from_pretrained(ckpt, cache_dir=cache_dir, device_map=device)
model = TransformerWrapper(model)
processor = AutoProcessor.from_pretrained(ckpt)

transforms = partial(processor.image_processor, return_tensors="pt")
tokenizer = partial(processor.tokenizer, return_tensors="pt", padding="max_length")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR it might be good to be able to pass additional args for the tokenizer, e.g. for siglip we should specify
padding="max_length", max_length=64

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have loved to pass extra parameters too, but there is no way to pass extra parameters to the load function. The only way we pass parameters is by arguments provided to the script.

return model, transforms, tokenizer