video-generation 30b Transformer text to image and text to video trained on O(100M) videos and O(1B) images tuned with Supervised Fine Tuning 13B video to audio and text to audio model trained on O(1M) hours Flow Matching [ ]