Wednesday, January 15, 2025

Previewing NVIDIA's Cosmos, a Generative AI Worldbuilding Tool

NVIDIA Cosmos is a new platform from NVIDIA that uses generative AI for digital worldbuilding. In this post, I will demonstrate some possible outputs from Cosmos using NVIDIA's simulation tools. Afterward, I will discuss some of the functions and libraries Cosmos uses.

To create your own digital worlds with Cosmos, follow this link to the Simulation Tools in NVIDIA Explore NVIDIA Explore - Simulation Tools. To start, we will select the cosmos-1.0-diffusion-7b

The Preview Image for NVIDIA Cosmos Diffusion Model
The Preview Image for cosmos-1.0-diffusion-7b

Once we've selected cosmos-1.0-diffusion-7b, we are presented with an option to input text, an image, or a video. The default example is a robot rolling through a chemical plant, with a video as the output.

AN AI-Generated Image of a Robot Traversing a Chemical Plant

For this demonstration, I'm going to begin by inputting the following text into the input box: "A crane at a dock lifting a large cargo crate from a ship onto the dock. photorealistic" After about 60 seconds, Cosmos produces a short 5-second video as output. Here is a frame from the one it generated from my first prompt:

An AI-Generated Image of a Dock with a Crane

In this case, we used the Cosmos-1.0-diffusion-7b-Text2World model, which takes an input of up 300 words and produces an output video of 121 frames.As described in the linked documentation, it uses self-attention, cross-attention, and feedforward layers. Additionally, it uses adaptive layer normalization for denoising between each layer. Each layer is necessary and serves a unique purpose.

Starting with the self-attention layer, it is used to determine what words in the input text will be most relevant to the output image. For example, the word "crane" in our prompt is weighted higher than the word "at". While both are relevant to the output, the object of the crane is in the center of the video. Next, the cross-attention layer relates the information contained in each word and assigns it to a relevant image as a result. In our case, this is shown by the word crate and the image of a brown crate. To clarify, the word "crate" is referred to as the source, and the image is referred to as the target.

The third layer, the feedforward layer, redefines each word after the the cross-attention layer finds it relevance. For example, the crate in our example is placed on the dock in our image, because the feedforward layer related it to the phrase "onto the dock". Lastly, the adaptive layer normalization stabilizes the output, which in this case could refer to making the crane move slowly and not too jittery.

In addition to the cosmos-1.0-7b-diffusion-7b which uses the text2world model, there is also the cosmos-1.0-autoregressive-5b model.

The Preview Image for cosmos-1.0-autoregressive-5b

This model takes a picture as input and produces a 5-second video as an output. The first frame of the output video is the exact picture, and the model predicts what happens in the next 5 seconds of the scene to create the video. For this model, there are a series of 9 preselected images to choose from.

Sample Images for Video Generation

Similar to the text2world model, the autoregressive video2world model employs self-attention, cross-attention, and feedforward layers. It should be noted that while this model is referred to as video2world, it can accept text, images, or videos as input, and outputs a video from whichever input was given.

Overall, NVIDIA Cosmos is a powerful worldbuilding tool for a variety of applications including simulation software as well as game development. To learn more about the development tools NVIDIA has to offer, check out the following post: An Overview of NVIDIA's AI Tools for Developers

Wednesday, January 8, 2025

NVIDIA's Jetson Orin Nano Super Developer Kit

NVIDIA recently unveiled their new Jetson Orin Nano Super Developer Kit, a powerful computer designed for using AI on edge devices.  Key features of the kit include a performance of up to 67 TOPS(Trillion Operations per Second), 102 GB/s of memory bandwidth, and a CPU frequency of up to 1.7 GHz. Other technical specifications found on the datasheet are a 6-core ARM Cortex-A78AE v8.2 64-bit CPU (arm-cortex-a78ae-product-brief), an NVIDIA Ampere GPU (NVIDIA Ampere Architecture) and 8GB of memory. The kit has a cost of $249 and sold out quickly. It is currently on backorder from retailers such as Sparkfun Electronics, and Seeed Studio.

NVIDIA Jetson Orin Nano Developer Kit
This new Super Developer Kit is an upgraded version of the previous Jetson Orin Nano Developer Kit. NVIDIA has provided the JetPack 6.1 SDK which can be used on existing Jetson Orin Nano Developer Kits to access features from the Super developer Kit. To unlock the super performance, users can select the latest release of JetPack, Jetpack 6.1 (rev. 1), when installing the latest SD Card image for the Kit. Detailed installation instructions can be found on the following page: JetPack SDK

The following table from NVIDIA highlights the main improvements of the Super Developer Kit, including the improved operating speed of up to 67 TOPS and memory bandwidth of 102 GB/s.
In this video, NVIDIA CEO Jensen Huang presents the Jetson Orin Nano Super Developer Kit.


If you are interested in learning about software compatible with the Jetson Orin Nano, check out this post in which I summarize the key features of NVIDIA's AI tools:  An Overview of NVIDIA's AI Tools For Developers