arxiv:2502.04896

Goku: Flow Based Video Generative Foundation Models

Published on Feb 7, 2025

· Submitted by

AK on Feb 10, 2025

#2 Paper of the day

Upvote

107

Authors:

Shoufa Chen ,

Chongjian Ge ,

Fengda Zhu ,

Hao Yang ,

Zhichao Lai ,

Yifei Hu ,

Ting-Che Lin ,

Shilong Zhang ,

Chuan Li ,

Peize Sun ,

Yi Jiang ,

Zehuan Yuan ,

Xiaobing Liu

Abstract

Goku, a state-of-the-art family of joint image-and-video generation models using rectified flow Transformers, sets new benchmarks in text-to-image and text-to-video tasks.

AI-generated summary

This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.

View arXiv page View PDF Project page GitHub 2.91k auto Add to collection

Community

akhaliq

Paper submitter Feb 10, 2025

https://saiyan-world.github.io/goku/

ShoufaChen

Paper author Feb 10, 2025

mrfakename

Feb 10, 2025

Very cool! Any plans to open source?

surajssc1232

Feb 10, 2025

Holy

ribbitribbit365

Feb 10, 2025

We made a deep dive video for this paper: https://www.youtube.com/watch?v=mwXIWcOXu8g.
"Kamehameha! Transform text into video—just like that!"

librarian-bot

Feb 11, 2025

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Reza2kn

Feb 11, 2025

Weights wen? 👀

redfernstech

Feb 13, 2025

Mahabharat war

febryards

Feb 13, 2025

LFG!

tevg

Feb 18, 2025

•

edited Feb 18, 2025

No description provided.

Suba355

2 days ago

Task Type] Long-form video creation
[Aspect Ratio] 16:9
[Duration] 15 seconds

[Core Requirement]
Vehicle must strictly match a black Subaru Impreza WRX 2003 (GD chassis) from reference: exact body geometry, headlights, hood scoop, wheels, badges, paint finish, no alterations. No design changes, no “similar” car replacements. Authentic JDM styling only.

[Visual Quality]
4K Ultra HD, hyper-realistic, cinematic, sharp clean cuts, high dynamic range, physically accurate lighting, no stylization filters, no overexposure.

[Style]
Hardcore, aggressive, mechanical, raw street performance energy. Cold metallic tones with subtle warm highlights.

⸻

Shot Breakdown

0–2s
Black screen. Engine ignition builds.
Hard cut → extreme close-up: WRX headlights ignite sequentially (real halogen glow, slight flicker).
Hard cut → ultra-low angle: exhaust tip pulses heat distortion, subtle condensation in cold air.
Audio: ignition click → deep boxer rumble → soft exhaust pop.

⸻

2–4s
Rapid montage (fast cuts, rising tempo):
• Carbon fiber hood scoop detail with light reflections
• Morning dew droplets sliding across black paint (clearcoat reflections visible)
• Subaru badge micro shot (authentic logo, no distortion)
• Brake caliper + spinning wheel (accurate rim design)
• Rear wing slicing through thin fog
• Tire tread macro (dirt particles, rubber texture)
Audio: metallic hits, ticking rhythm, tightening tension.

⸻

4–5s
Wide static shot: car motionless on empty foggy bridge (inspired by Moscow atmosphere).
Dead silence.
Rear wheels suddenly spin → dense white smoke explosion.
Audio: silence → aggressive tire screech.

⸻

5–9s
Full acceleration sequence:
• Launch start (rear squat, front lift, suspension compression visible)
• Drone dive through fog toward car
• Road-level tracking shot (wet asphalt reflections)
• Interior POV: hands gripping steering wheel, gear shift movement
• Head-on rush toward camera (natural motion blur, no distortion)
Audio: boxer engine roar + gear shifts + wind pressure + percussive electronic beat.

⸻

9–12s
City run sequence:
• Side tracking shot at speed
• Wheel rotation close-up (brake heat shimmer)
• Ultra-low upward angle emphasizing aggression
• Wide aerial of city roads cutting through urban grid
Audio: engine + wind + distant city ambience.

⸻

12–15s
Final sequence:
• Slow-motion side drift
• Aerial wide shot gradually slowing
• Freeze frame: car остановлен на эстакаде, нос направлен к восходу
• Skyline in golden light reflecting on black paint
Audio: engine fades to idle → metallic freeze impact → wind + silence.

⸻

[Tags]
8k, photorealistic, raw footage, HDR, cinematic lighting, Fujifilm XT4 style, physically accurate reflections, no CGI look, no stylization