GigaBrain-0: A World Model-Powered Vision-LanguageAction Model

Abstract

Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, longhorizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

Data Samples

Comparison of training data usage across VLA models. GigaBrain-0 leverages a diverse set of data sources to enhance generalization and reduce dependency on real-world robot data.

Model Architecture

The framework of GigaBrain-0. GigaBrain-0 takes RGB-D input to enhance spatial perception and outputs Embodied Chain-of-Thought (Embodied CoT) as an intermediate representation to strengthen embodied reasoning for manipulation. During training, GigaBrain-0 employs Knowledge Insulation to prevent interference between the optimization processes of action prediction and Embodied CoT generation.

Performance Comparison

Performance comparison with state-of-the-art VLA model π0 across six tasks on G1 and PiPER robot platform. GigaBrain-0 outperforms π0 across dexterous manipulation tasks: (a) Laundry Folding and (b) Paper Towel Preparation ; long-horizon tasks: (c) Juice Preparation and (d) Table Bussing; mobile manipulation tasks: (e) Boxes Moving and (f) Laundry Baskets Moving.

Generalization Performance

Generalization performance of GigaBrain-0 under appearance, placement, and viewpoint shifts. The horizontal axis denotes the sampling probability α of world model–generated data used during training. As α increases from 0% to 90%, GigaBrain-0 exhibits a substantial improvement in generalization performance across novel appearance, object placement, and viewpoint conditions. This demonstrates that incorporating a higher proportion of synthetically generated training data significantly enhances the model’s robustness to real-world distribution shifts.

GigaBrain-0-Small

We present GigaBrain-0-Small, an optimized lightweight variant specifically designed for efficient inference on edge platforms such as the NVIDIA Jetson AGX Orin. Compared to 𝜋₀, GigaBrain-0-Small achieves significantly higher inference efficiency on the Orin while maintaining comparable success rates on the table bussing task, demonstrating its effectiveness as a compact yet capable policy for real-world robotic deployment.

VLM Instruction accuracy in benchmark tasks

GigaBrain-0: A World Model-Powered Vision-Language-Action Model