Multi Modal Learning to Rank as a replacement for CLIP
pose multimodal learning as a listwise ranking problem where all modalities are ranked together
potentially use late interaction colbert style for fine grained alignment
hard negative mining + synthetic data