MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset Paper • 2605.21272 • Published 16 days ago • 3
MONET - Massive Open Non-redundant, Enriched, Text-to-image Collection A curated, deduped & recaptioned open image–text dataset of 104.9M samples released under the Apache2.0 licence. https://huggingface.co/blog/jasperai/ • 4 items • Updated 7 days ago • 10
Jagle Collection Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision–Language Models • 5 items • Updated Apr 12 • 2
MobileCLIP2 Collection MobileCLIP2: Mobile-friendly image-text models with SOTA zero-shot capabilities trained on DFNDR-2B • 30 items • Updated Apr 23 • 62
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation Paper • 2605.08029 • Published 28 days ago • 12
Continuous-Time Distribution Matching for Few-Step Diffusion Distillation Paper • 2605.06376 • Published 29 days ago • 26
SenseNova-U1 Collection SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-Unify Architecture • 9 items • Updated 8 days ago • 69
GenLIP Collection Model weights of paper "Let ViT Speak: Generative Language-Image Pre-training" • 6 items • Updated about 1 month ago • 6
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation Paper • 2604.24764 • Published Apr 27 • 118
AVControl: Efficient Framework for Training Audio-Visual Controls Paper • 2603.24793 • Published Mar 25 • 28