What is Goku AI?
Goku AI is a family of models designed for joint image and video generation. It uses rectified flow Transformers to achieve high performance in visual generation tasks. Key features include a well-structured data curation process, efficient model architecture, and robust training infrastructure.
Goku supports text-to-image, text-to-video, and image-to-video generation, delivering strong results on benchmarks like GenEval, DPG-Bench, and VBench. It ranks among the top models in text-to-video generation, demonstrating its effectiveness in creating high-quality visual content.
Overview of Goku AI
Feature | Description |
---|---|
AI Tool | Goku AI - Flow Based Video Generative Foundation Models |
Category | Visual Generation Framework |
Function | Image and Video Generation |
Generation Speed | Real-time Processing |
Research Paper | arxiv.org/abs/2502.04896 |
Official Website | saiyan-world.github.io/goku/ |
GitHub Repository | github.com/Saiyan-World/goku |
Benchmark Dataset | Huggingface Goku-MovieGenBench |
Key Features of Goku AI
Joint Image and Video Generation
Handles both image and video tasks in a unified framework.
Rectified Flow Transformers
Enhances interaction between image and video tokens for improved output quality.
High-Quality Data Curation
Uses carefully prepared datasets to ensure detailed and accurate visual generation.
Efficient Training Infrastructure
Designed for scalable and robust large-scale model training.
Strong Benchmark Performance
Achieves top scores on GenEval, DPG-Bench, and VBench for text-to-image and text-to-video tasks.
Versatile Task Support
Supports text-to-image, text-to-video, and image-to-video generation.
Research-Driven Design
Incorporates innovative techniques like rectified flow for better model performance.
Examples of Goku AI
1. Stylish Woman in Tokyo
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
2. Thoughtful Man in Paris
An extreme close-up of a gray-haired man with a beard in his 60s, deep in thought pondering the history of the universe as he sits at a cafe in Paris. His eyes focus on people offscreen as they walk, while he sits mostly motionless. He is dressed in a wool coat suit coat with a button-down shirt, wearing a brown beret and glasses, with a very professorial appearance. At the end, he offers a subtle closed-mouth smile as if he found the answer to the mystery of life. The lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.
3. Chef in the Kitchen
Chef chopping onions in the kitchen for the preparation of the dish.
4. Venetian Canal at Dawn
Through a Venetian canal at dawn, glide as morning light paints ancient facades in pastel hues. Track a solitary gondola as it approaches a weathered stone bridge. Above, wooden shutters open with a creak as an unseen resident scatters crumbs to circling pigeons.
5. Candid Everyday Photo
A casual, everyday photo—candid, possibly taken secretly or spontaneously, without artistic posing, without perfect composition, and with no filters. The lighting is natural, and the overall feel is natural. The subject is a 21-year-old woman of European descent, fair-skinned with blonde hair and blue eyes, and she is quite attractive. She’s wearing a woolen dress with a small microphone pinned to it—perhaps she’s being interviewed? The setting is indoors, her hands aren’t visible in the frame, and she is looking at the viewer. It’s a half-length shot, taken in a casual, everyday manner.
Pros and Cons
Pros
- High-quality data curation
- Rectified flow interaction
- Superior performance
- Multiple generation tasks
- Robust training infrastructure
Cons
- High computational resources
- Complex model architecture
- Performance varies with data
How to Use Goku AI
Overview
Goku is a new family of joint image-and-video generation models based on rectified flow Transformers. It is designed to achieve industry-grade performance, integrating advanced techniques for high-quality visual generation, including meticulous data curation, model design, and flow formulation.
Key Contributions
- 📊 High-quality fine-grained image and video data curation.
- 🔄 The pioneering use of rectified flow for enhanced interaction among video and image tokens.
- 🌟 Superior qualitative and quantitative performance in both image and video generation tasks.
Supported Generation Tasks
- 🎬 Text-to-Video Generation
- 🖼️ Image-to-Video Generation
- 🎨 Text-to-Image Generation
Performance Benchmarks
Goku achieves top scores on major benchmarks:
- 0.76 on GenEval (text-to-image generation)
- 83.65 on DPG-Bench (text-to-image generation)
- 84.85 on VBench (text-to-video generation)
VBench Performance
Goku-T2V achieves an impressive score of 84.85 in VBench, securing the No.2 position as of 2024-10-07, surpassing several leading commercial text-to-video models.
VBench Performance Comparison
Goku-T2V achieves an impressive score of 84.85 in VBench, securing the No.2 position as of 2024-10-07, surpassing several leading commercial text-to-video models.
Method | Total Score | Quality Score | Sampling Score | Style Consistency | Background Consistency | Temporal Flickering | Motion Smoothness | Dynamic Degree | Subject Quality | Imaging Quality | Object Class | Human Action | Object Relationship | Color | Scene | Prompt Style | Overall Consistency |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AnimateDiff-V2 | 80.27 | 82.90 | 69.75 | 95.30 | 97.68 | 98.75 | 97.76 | 40.83 | 67.16 | 70.10 | 90.90 | 36.88 | 92.60 | 87.47 | 34.60 | 50.19 | 22.42 |
VideoCrafter-2.0 | 80.44 | 82.20 | 73.42 | 96.85 | 98.22 | 98.41 | 97.73 | 42.50 | 63.13 | 67.22 | 92.55 | 40.66 | 95.00 | 92.92 | 35.86 | 55.29 | 25.13 |
OpenSora V1.2 | 79.23 | 80.71 | 73.30 | 94.45 | 97.90 | 99.47 | 98.20 | 47.22 | 56.18 | 60.94 | 83.37 | 58.41 | 85.80 | 87.49 | 67.51 | 42.47 | 23.89 |
Show-1 | 78.93 | 80.42 | 72.98 | 95.53 | 98.02 | 99.12 | 98.24 | 44.44 | 57.35 | 58.66 | 93.07 | 45.47 | 95.60 | 86.35 | 53.50 | 47.03 | 23.06 |
Gen-3 | 82.32 | 84.11 | 75.17 | 97.10 | 96.62 | 98.61 | 99.23 | 60.14 | 63.34 | 66.82 | 87.81 | 53.64 | 96.40 | 80.90 | 65.09 | 54.57 | 24.31 |
Pika-1.0 | 80.69 | 82.92 | 71.77 | 96.94 | 97.36 | 99.74 | 99.50 | 47.50 | 62.04 | 61.87 | 88.72 | 43.08 | 86.20 | 90.57 | 61.03 | 49.83 | 22.26 |
CogVideoX-5B | 81.61 | 82.75 | 77.04 | 96.23 | 96.52 | 98.66 | 96.92 | 70.97 | 61.98 | 62.90 | 85.23 | 62.11 | 99.40 | 82.81 | 66.35 | 53.20 | 24.91 |
Kling | 81.85 | 83.39 | 75.68 | 98.33 | 97.60 | 99.30 | 99.40 | 46.94 | 61.21 | 65.62 | 87.24 | 68.05 | 93.40 | 89.90 | 73.03 | 50.86 | 19.62 |
Mira | 71.87 | 78.78 | 44.21 | 96.23 | 96.92 | 98.29 | 97.54 | 60.33 | 42.51 | 60.16 | 52.06 | 12.52 | 63.80 | 42.24 | 27.83 | 16.34 | 21.89 |
CausVid | 84.27 | 85.65 | 78.75 | 97.53 | 97.19 | 96.24 | 98.05 | 92.69 | 64.15 | 68.88 | 92.99 | 72.15 | 99.80 | 80.17 | 64.65 | 56.58 | 24.27 |
Luma | 83.61 | 83.47 | 84.17 | 97.33 | 97.43 | 98.64 | 99.35 | 44.26 | 65.51 | 66.55 | 94.95 | 82.63 | 96.40 | 92.33 | 83.67 | 58.98 | 24.66 |
HunyuanVideo | 83.24 | 85.09 | 75.82 | 97.37 | 97.76 | 99.44 | 98.99 | 70.83 | 60.36 | 67.56 | 86.10 | 68.55 | 94.40 | 91.60 | 68.68 | 53.88 | 19.80 |
Goku-T2V (ours) | 84.85 | 85.60 | 81.87 | 95.55 | 96.67 | 97.71 | 98.50 | - | - | - | - | - | - | - | - | - | - |
Source: github.com/Saiyan-World/goku