What is Goku AI?

Goku AI is a family of models designed for joint image and video generation. It uses rectified flow Transformers to achieve high performance in visual generation tasks. Key features include a well-structured data curation process, efficient model architecture, and robust training infrastructure.

Goku supports text-to-image, text-to-video, and image-to-video generation, delivering strong results on benchmarks like GenEval, DPG-Bench, and VBench. It ranks among the top models in text-to-video generation, demonstrating its effectiveness in creating high-quality visual content.

Overview of Goku AI

FeatureDescription
AI ToolGoku AI - Flow Based Video Generative Foundation Models
CategoryVisual Generation Framework
FunctionImage and Video Generation
Generation SpeedReal-time Processing
Research Paperarxiv.org/abs/2502.04896
Official Websitesaiyan-world.github.io/goku/
GitHub Repositorygithub.com/Saiyan-World/goku
Benchmark DatasetHuggingface Goku-MovieGenBench

Key Features of Goku AI

  • Joint Image and Video Generation

    Handles both image and video tasks in a unified framework.

  • Rectified Flow Transformers

    Enhances interaction between image and video tokens for improved output quality.

  • High-Quality Data Curation

    Uses carefully prepared datasets to ensure detailed and accurate visual generation.

  • Efficient Training Infrastructure

    Designed for scalable and robust large-scale model training.

  • Strong Benchmark Performance

    Achieves top scores on GenEval, DPG-Bench, and VBench for text-to-image and text-to-video tasks.

  • Versatile Task Support

    Supports text-to-image, text-to-video, and image-to-video generation.

  • Research-Driven Design

    Incorporates innovative techniques like rectified flow for better model performance.

Examples of Goku AI

1. Stylish Woman in Tokyo

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

2. Thoughtful Man in Paris

An extreme close-up of a gray-haired man with a beard in his 60s, deep in thought pondering the history of the universe as he sits at a cafe in Paris. His eyes focus on people offscreen as they walk, while he sits mostly motionless. He is dressed in a wool coat suit coat with a button-down shirt, wearing a brown beret and glasses, with a very professorial appearance. At the end, he offers a subtle closed-mouth smile as if he found the answer to the mystery of life. The lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.

3. Chef in the Kitchen

Chef chopping onions in the kitchen for the preparation of the dish.

4. Venetian Canal at Dawn

Through a Venetian canal at dawn, glide as morning light paints ancient facades in pastel hues. Track a solitary gondola as it approaches a weathered stone bridge. Above, wooden shutters open with a creak as an unseen resident scatters crumbs to circling pigeons.

5. Candid Everyday Photo

A casual, everyday photo—candid, possibly taken secretly or spontaneously, without artistic posing, without perfect composition, and with no filters. The lighting is natural, and the overall feel is natural. The subject is a 21-year-old woman of European descent, fair-skinned with blonde hair and blue eyes, and she is quite attractive. She’s wearing a woolen dress with a small microphone pinned to it—perhaps she’s being interviewed? The setting is indoors, her hands aren’t visible in the frame, and she is looking at the viewer. It’s a half-length shot, taken in a casual, everyday manner.

Pros and Cons

Pros

  • High-quality data curation
  • Rectified flow interaction
  • Superior performance
  • Multiple generation tasks
  • Robust training infrastructure

Cons

  • High computational resources
  • Complex model architecture
  • Performance varies with data

How to Use Goku AI

Overview

Goku is a new family of joint image-and-video generation models based on rectified flow Transformers. It is designed to achieve industry-grade performance, integrating advanced techniques for high-quality visual generation, including meticulous data curation, model design, and flow formulation.

Key Contributions

  • 📊 High-quality fine-grained image and video data curation.
  • 🔄 The pioneering use of rectified flow for enhanced interaction among video and image tokens.
  • 🌟 Superior qualitative and quantitative performance in both image and video generation tasks.

Supported Generation Tasks

  • 🎬 Text-to-Video Generation
  • 🖼️ Image-to-Video Generation
  • 🎨 Text-to-Image Generation

Performance Benchmarks

Goku achieves top scores on major benchmarks:

  • 0.76 on GenEval (text-to-image generation)
  • 83.65 on DPG-Bench (text-to-image generation)
  • 84.85 on VBench (text-to-video generation)

VBench Performance

Goku-T2V achieves an impressive score of 84.85 in VBench, securing the No.2 position as of 2024-10-07, surpassing several leading commercial text-to-video models.

VBench Performance Comparison

Goku-T2V achieves an impressive score of 84.85 in VBench, securing the No.2 position as of 2024-10-07, surpassing several leading commercial text-to-video models.

MethodTotal ScoreQuality ScoreSampling ScoreStyle ConsistencyBackground ConsistencyTemporal FlickeringMotion SmoothnessDynamic DegreeSubject QualityImaging QualityObject ClassHuman ActionObject RelationshipColorScenePrompt StyleOverall Consistency
AnimateDiff-V280.2782.9069.7595.3097.6898.7597.7640.8367.1670.1090.9036.8892.6087.4734.6050.1922.42
VideoCrafter-2.080.4482.2073.4296.8598.2298.4197.7342.5063.1367.2292.5540.6695.0092.9235.8655.2925.13
OpenSora V1.279.2380.7173.3094.4597.9099.4798.2047.2256.1860.9483.3758.4185.8087.4967.5142.4723.89
Show-178.9380.4272.9895.5398.0299.1298.2444.4457.3558.6693.0745.4795.6086.3553.5047.0323.06
Gen-382.3284.1175.1797.1096.6298.6199.2360.1463.3466.8287.8153.6496.4080.9065.0954.5724.31
Pika-1.080.6982.9271.7796.9497.3699.7499.5047.5062.0461.8788.7243.0886.2090.5761.0349.8322.26
CogVideoX-5B81.6182.7577.0496.2396.5298.6696.9270.9761.9862.9085.2362.1199.4082.8166.3553.2024.91
Kling81.8583.3975.6898.3397.6099.3099.4046.9461.2165.6287.2468.0593.4089.9073.0350.8619.62
Mira71.8778.7844.2196.2396.9298.2997.5460.3342.5160.1652.0612.5263.8042.2427.8316.3421.89
CausVid84.2785.6578.7597.5397.1996.2498.0592.6964.1568.8892.9972.1599.8080.1764.6556.5824.27
Luma83.6183.4784.1797.3397.4398.6499.3544.2665.5166.5594.9582.6396.4092.3383.6758.9824.66
HunyuanVideo83.2485.0975.8297.3797.7699.4498.9970.8360.3667.5686.1068.5594.4091.6068.6853.8819.80
Goku-T2V (ours)84.8585.6081.8795.5596.6797.7198.50----------

Source: github.com/Saiyan-World/goku

Goku AI FAQs