Goku AI: Video Generator By ByteDance

What is Goku AI?

Goku AI is a family of models designed for joint image and video generation. It uses rectified flow Transformers to achieve high performance in visual generation tasks. Key features include a well-structured data curation process, efficient model architecture, and robust training infrastructure.

Goku supports text-to-image, text-to-video, and image-to-video generation, delivering strong results on benchmarks like GenEval, DPG-Bench, and VBench. It ranks among the top models in text-to-video generation, demonstrating its effectiveness in creating high-quality visual content.

Overview of Goku AI

Feature	Description
AI Tool	Goku AI - Flow Based Video Generative Foundation Models
Category	Visual Generation Framework
Function	Image and Video Generation
Generation Speed	Real-time Processing
Research Paper	arxiv.org/abs/2502.04896
Official Website	saiyan-world.github.io/goku/
GitHub Repository	github.com/Saiyan-World/goku
Benchmark Dataset	Huggingface Goku-MovieGenBench

Key Features of Goku AI

Joint Image and Video Generation
Handles both image and video tasks in a unified framework.
Rectified Flow Transformers
Enhances interaction between image and video tokens for improved output quality.
High-Quality Data Curation
Uses carefully prepared datasets to ensure detailed and accurate visual generation.
Efficient Training Infrastructure
Designed for scalable and robust large-scale model training.
Strong Benchmark Performance
Achieves top scores on GenEval, DPG-Bench, and VBench for text-to-image and text-to-video tasks.
Versatile Task Support
Supports text-to-image, text-to-video, and image-to-video generation.
Research-Driven Design
Incorporates innovative techniques like rectified flow for better model performance.

Examples of Goku AI

1. Stylish Woman in Tokyo

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

2. Thoughtful Man in Paris

An extreme close-up of a gray-haired man with a beard in his 60s, deep in thought pondering the history of the universe as he sits at a cafe in Paris. His eyes focus on people offscreen as they walk, while he sits mostly motionless. He is dressed in a wool coat suit coat with a button-down shirt, wearing a brown beret and glasses, with a very professorial appearance. At the end, he offers a subtle closed-mouth smile as if he found the answer to the mystery of life. The lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.

3. Chef in the Kitchen

Chef chopping onions in the kitchen for the preparation of the dish.

4. Venetian Canal at Dawn

Through a Venetian canal at dawn, glide as morning light paints ancient facades in pastel hues. Track a solitary gondola as it approaches a weathered stone bridge. Above, wooden shutters open with a creak as an unseen resident scatters crumbs to circling pigeons.

5. Candid Everyday Photo

A casual, everyday photo—candid, possibly taken secretly or spontaneously, without artistic posing, without perfect composition, and with no filters. The lighting is natural, and the overall feel is natural. The subject is a 21-year-old woman of European descent, fair-skinned with blonde hair and blue eyes, and she is quite attractive. She’s wearing a woolen dress with a small microphone pinned to it—perhaps she’s being interviewed? The setting is indoors, her hands aren’t visible in the frame, and she is looking at the viewer. It’s a half-length shot, taken in a casual, everyday manner.

Pros and Cons

Pros

High-quality data curation
Rectified flow interaction
Superior performance
Multiple generation tasks
Robust training infrastructure

Cons

High computational resources
Complex model architecture
Performance varies with data

How to Use Goku AI

Overview

Goku is a new family of joint image-and-video generation models based on rectified flow Transformers. It is designed to achieve industry-grade performance, integrating advanced techniques for high-quality visual generation, including meticulous data curation, model design, and flow formulation.

Key Contributions

📊 High-quality fine-grained image and video data curation.
🔄 The pioneering use of rectified flow for enhanced interaction among video and image tokens.
🌟 Superior qualitative and quantitative performance in both image and video generation tasks.

Supported Generation Tasks

🎬 Text-to-Video Generation
🖼️ Image-to-Video Generation
🎨 Text-to-Image Generation

Performance Benchmarks

Goku achieves top scores on major benchmarks:

0.76 on GenEval (text-to-image generation)
83.65 on DPG-Bench (text-to-image generation)
84.85 on VBench (text-to-video generation)

VBench Performance

Goku-T2V achieves an impressive score of 84.85 in VBench, securing the No.2 position as of 2024-10-07, surpassing several leading commercial text-to-video models.

VBench Performance Comparison

Goku-T2V achieves an impressive score of 84.85 in VBench, securing the No.2 position as of 2024-10-07, surpassing several leading commercial text-to-video models.

Method	Total Score	Quality Score	Sampling Score	Style Consistency	Background Consistency	Temporal Flickering	Motion Smoothness	Dynamic Degree	Subject Quality	Imaging Quality	Object Class	Human Action	Object Relationship	Color	Scene	Prompt Style	Overall Consistency
AnimateDiff-V2	80.27	82.90	69.75	95.30	97.68	98.75	97.76	40.83	67.16	70.10	90.90	36.88	92.60	87.47	34.60	50.19	22.42
VideoCrafter-2.0	80.44	82.20	73.42	96.85	98.22	98.41	97.73	42.50	63.13	67.22	92.55	40.66	95.00	92.92	35.86	55.29	25.13
OpenSora V1.2	79.23	80.71	73.30	94.45	97.90	99.47	98.20	47.22	56.18	60.94	83.37	58.41	85.80	87.49	67.51	42.47	23.89
Show-1	78.93	80.42	72.98	95.53	98.02	99.12	98.24	44.44	57.35	58.66	93.07	45.47	95.60	86.35	53.50	47.03	23.06
Gen-3	82.32	84.11	75.17	97.10	96.62	98.61	99.23	60.14	63.34	66.82	87.81	53.64	96.40	80.90	65.09	54.57	24.31
Pika-1.0	80.69	82.92	71.77	96.94	97.36	99.74	99.50	47.50	62.04	61.87	88.72	43.08	86.20	90.57	61.03	49.83	22.26
CogVideoX-5B	81.61	82.75	77.04	96.23	96.52	98.66	96.92	70.97	61.98	62.90	85.23	62.11	99.40	82.81	66.35	53.20	24.91
Kling	81.85	83.39	75.68	98.33	97.60	99.30	99.40	46.94	61.21	65.62	87.24	68.05	93.40	89.90	73.03	50.86	19.62
Mira	71.87	78.78	44.21	96.23	96.92	98.29	97.54	60.33	42.51	60.16	52.06	12.52	63.80	42.24	27.83	16.34	21.89
CausVid	84.27	85.65	78.75	97.53	97.19	96.24	98.05	92.69	64.15	68.88	92.99	72.15	99.80	80.17	64.65	56.58	24.27
Luma	83.61	83.47	84.17	97.33	97.43	98.64	99.35	44.26	65.51	66.55	94.95	82.63	96.40	92.33	83.67	58.98	24.66
HunyuanVideo	83.24	85.09	75.82	97.37	97.76	99.44	98.99	70.83	60.36	67.56	86.10	68.55	94.40	91.60	68.68	53.88	19.80
Goku-T2V (ours)	84.85	85.60	81.87	95.55	96.67	97.71	98.50	-	-	-	-	-	-	-	-	-	-

Source: github.com/Saiyan-World/goku

What is Goku AI?

Overview of Goku AI

Key Features of Goku AI

Joint Image and Video Generation

Rectified Flow Transformers

High-Quality Data Curation

Efficient Training Infrastructure

Strong Benchmark Performance

Versatile Task Support

Research-Driven Design