Multi Modal Learning to Rank as a replacement for CLIP

pose multimodal learning as a listwise ranking problem where all modalities are ranked together

potentially use late interaction colbert style for fine grained alignment

hard negative mining + synthetic data