Wednesday, January 15, 2025

Previewing NVIDIA's Cosmos, a Generative AI Worldbuilding Tool

NVIDIA Cosmos is a new platform from NVIDIA that uses generative AI for digital worldbuilding. In this post, I will demonstrate some possible outputs from Cosmos using NVIDIA's simulation tools. Afterward, I will discuss some of the functions and libraries Cosmos uses.

To create your own digital worlds with Cosmos, follow this link to the Simulation Tools in NVIDIA Explore NVIDIA Explore - Simulation Tools. To start, we will select the cosmos-1.0-diffusion-7b

The Preview Image for NVIDIA Cosmos Diffusion Model
The Preview Image for cosmos-1.0-diffusion-7b

Once we've selected cosmos-1.0-diffusion-7b, we are presented with an option to input text, an image, or a video. The default example is a robot rolling through a chemical plant, with a video as the output.

AN AI-Generated Image of a Robot Traversing a Chemical Plant

For this demonstration, I'm going to begin by inputting the following text into the input box: "A crane at a dock lifting a large cargo crate from a ship onto the dock. photorealistic" After about 60 seconds, Cosmos produces a short 5-second video as output. Here is a frame from the one it generated from my first prompt:

An AI-Generated Image of a Dock with a Crane

In this case, we used the Cosmos-1.0-diffusion-7b-Text2World model, which takes an input of up 300 words and produces an output video of 121 frames.As described in the linked documentation, it uses self-attention, cross-attention, and feedforward layers. Additionally, it uses adaptive layer normalization for denoising between each layer. Each layer is necessary and serves a unique purpose.

Starting with the self-attention layer, it is used to determine what words in the input text will be most relevant to the output image. For example, the word "crane" in our prompt is weighted higher than the word "at". While both are relevant to the output, the object of the crane is in the center of the video. Next, the cross-attention layer relates the information contained in each word and assigns it to a relevant image as a result. In our case, this is shown by the word crate and the image of a brown crate. To clarify, the word "crate" is referred to as the source, and the image is referred to as the target.

The third layer, the feedforward layer, redefines each word after the the cross-attention layer finds it relevance. For example, the crate in our example is placed on the dock in our image, because the feedforward layer related it to the phrase "onto the dock". Lastly, the adaptive layer normalization stabilizes the output, which in this case could refer to making the crane move slowly and not too jittery.

In addition to the cosmos-1.0-7b-diffusion-7b which uses the text2world model, there is also the cosmos-1.0-autoregressive-5b model.

The Preview Image for cosmos-1.0-autoregressive-5b

This model takes a picture as input and produces a 5-second video as an output. The first frame of the output video is the exact picture, and the model predicts what happens in the next 5 seconds of the scene to create the video. For this model, there are a series of 9 preselected images to choose from.

Sample Images for Video Generation

Similar to the text2world model, the autoregressive video2world model employs self-attention, cross-attention, and feedforward layers. It should be noted that while this model is referred to as video2world, it can accept text, images, or videos as input, and outputs a video from whichever input was given.

Overall, NVIDIA Cosmos is a powerful worldbuilding tool for a variety of applications including simulation software as well as game development. To learn more about the development tools NVIDIA has to offer, check out the following post: An Overview of NVIDIA's AI Tools for Developers

No comments: