Abstract
Goku, a state-of-the-art family of joint image-and-video generation models using rectified flow Transformers, sets new benchmarks in text-to-image and text-to-video tasks.
This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.
Community
Holy
We made a deep dive video for this paper: https://www.youtube.com/watch?v=mwXIWcOXu8g.
"Kamehameha! Transform text into video—just like that!"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models (2025)
- Generative Video Propagation (2024)
- Open-Sora: Democratizing Efficient Video Production for All (2024)
- Pushing the Boundaries of State Space Models for Image and Video Generation (2025)
- Efficient Scaling of Diffusion Transformers for Text-to-Image Generation (2024)
- SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner (2024)
- BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Weights wen? 👀
Mahabharat war
LFG!

Task Type] Long-form video creation
[Aspect Ratio] 16:9
[Duration] 15 seconds
[Core Requirement]
Vehicle must strictly match a black Subaru Impreza WRX 2003 (GD chassis) from reference: exact body geometry, headlights, hood scoop, wheels, badges, paint finish, no alterations. No design changes, no “similar” car replacements. Authentic JDM styling only.
[Visual Quality]
4K Ultra HD, hyper-realistic, cinematic, sharp clean cuts, high dynamic range, physically accurate lighting, no stylization filters, no overexposure.
[Style]
Hardcore, aggressive, mechanical, raw street performance energy. Cold metallic tones with subtle warm highlights.
⸻
Shot Breakdown
0–2s
Black screen. Engine ignition builds.
Hard cut → extreme close-up: WRX headlights ignite sequentially (real halogen glow, slight flicker).
Hard cut → ultra-low angle: exhaust tip pulses heat distortion, subtle condensation in cold air.
Audio: ignition click → deep boxer rumble → soft exhaust pop.
⸻
2–4s
Rapid montage (fast cuts, rising tempo):
• Carbon fiber hood scoop detail with light reflections
• Morning dew droplets sliding across black paint (clearcoat reflections visible)
• Subaru badge micro shot (authentic logo, no distortion)
• Brake caliper + spinning wheel (accurate rim design)
• Rear wing slicing through thin fog
• Tire tread macro (dirt particles, rubber texture)
Audio: metallic hits, ticking rhythm, tightening tension.
⸻
4–5s
Wide static shot: car motionless on empty foggy bridge (inspired by Moscow atmosphere).
Dead silence.
Rear wheels suddenly spin → dense white smoke explosion.
Audio: silence → aggressive tire screech.
⸻
5–9s
Full acceleration sequence:
• Launch start (rear squat, front lift, suspension compression visible)
• Drone dive through fog toward car
• Road-level tracking shot (wet asphalt reflections)
• Interior POV: hands gripping steering wheel, gear shift movement
• Head-on rush toward camera (natural motion blur, no distortion)
Audio: boxer engine roar + gear shifts + wind pressure + percussive electronic beat.
⸻
9–12s
City run sequence:
• Side tracking shot at speed
• Wheel rotation close-up (brake heat shimmer)
• Ultra-low upward angle emphasizing aggression
• Wide aerial of city roads cutting through urban grid
Audio: engine + wind + distant city ambience.
⸻
12–15s
Final sequence:
• Slow-motion side drift
• Aerial wide shot gradually slowing
• Freeze frame: car остановлен на эстакаде, нос направлен к восходу
• Skyline in golden light reflecting on black paint
Audio: engine fades to idle → metallic freeze impact → wind + silence.
⸻
[Tags]
8k, photorealistic, raw footage, HDR, cinematic lighting, Fujifilm XT4 style, physically accurate reflections, no CGI look, no stylization
Get this paper in your agent:
hf papers read 2502.04896 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper