NVIDIA's Cosmos Policy: A Breakthrough in Robot Control Using World Foundation Models

NVIDIA has expanded its Cosmos world foundation models (WFMs) with the introduction of Cosmos Policy, a new approach that adapts pretrained video models for advanced robot control and planning. This innovation represents a significant step forward in robotics, autonomous vehicles, and industrial AI applications.

How Cosmos Policy Works

Cosmos Policy builds upon NVIDIA’s Cosmos Predict-2 world foundation model through post-training for manipulation tasks. Unlike traditional approaches that use separate neural networks for perception and control, Cosmos Policy encodes robot actions, physical states, and success scores as additional latent frames within the same model architecture.

This unified approach allows the model to leverage its pre-existing understanding of physics and temporal dynamics, enabling three key capabilities in a single system:

Predicting action sequences for robotic movement
Forecasting future robot observations for world modeling
Estimating expected returns for planning

Advantages Over Traditional Approaches

Traditional robotic control systems often rely on vision-language models (VLMs) that can suggest high-level actions but lack the ability to execute them precisely. In contrast, Cosmos Policy’s world foundation model approach is specifically designed to understand how scenes evolve over time—a crucial capability for robot control.

The diffusion-based architecture enables Cosmos Policy to handle complex, high-dimensional distributions across long temporal horizons, making it particularly effective for tasks requiring precise coordination and multi-step execution.

Performance Benchmarks

NVIDIA’s testing shows Cosmos Policy achieving state-of-the-art performance on standard benchmarks:

On the LIBERO benchmark, Cosmos Policy achieved a 98.5% average success rate, outperforming previous leading models
On RoboCasa, it reached a 67.1% success rate with only 50 training demonstrations per task, compared to competitors requiring hundreds or thousands of examples
In real-world bimanual manipulation tests using the ALOHA robot platform, model-based planning enhanced task completion rates by 12.5%

Implementation Flexibility

Cosmos Policy can be deployed in two ways:

As a direct policy, where only actions are generated at inference time
As a planning policy, where multiple candidate actions are evaluated by predicting their resulting future states and values

The model’s ability to inherit pretrained knowledge while remaining simple to train and deploy represents a significant advancement in making robots more adaptable and efficient across various tasks.