pose multimodal learning as a listwise ranking problem where all modalities are ranked together

potentially use late interaction colbert style for fine grained alignment

hard negative mining + synthetic data