Hi SuperCLIP team,
Thank you for the insightful work and for adding additional comparisons in the NeurIPS 2025 rebuttal. I’m particularly interested in the table where SuperCLIP is compared with SLIP, MaskCLIP, A-CLIP, and DetailCLIP under the DetailCLIP-style pretraining setup (15M samples, batch size 4K, 25 epochs).
To better understand and potentially reproduce these results, I would like to ask for clarification on a few experimental details regarding this table:
Training data
Is the 15M-sample pretraining set based on YFCC15M, or another subset / filtering of YFCC?
Were the same images and captions used across all compared methods?
Optimization details
What learning rate (and schedule) was used for SuperCLIP in this setting (1e-3 or 3e-3)?
Were other hyperparameters (optimizer type, weight decay, warmup strategy) aligned with DetailCLIP’s official setup?
These details would be very helpful for ensuring a fair understanding of the reported gains and for future comparisons. Thanks again for the clear rebuttal and strong results.
Best regards