2023-12-09 - SILC Improving Vision Language Pretraining with Self-Distillation

CLIP + DINO

  1. Each global view used in Lself−dist needs to be aligned with text, otherwise the two objectives diverge. This is realized by computing the image-text contrastive loss for each global view while maintaining the same batch size.
  2. The momentum scheduler of the EMA should not converge to 1.0. Otherwise the teacher stops learning from image-text loss as the update step becomes too small in the later stage of the training. We therefore use a fixed momentum.

TODO

Tune from SigLIP

use tokens