🟢 Alpamayo — Vision-Language-Action Model for Physical AI
From perception to action — NVIDIA's first open-source reasoning-driven Vision-Language-Action model for autonomous driving, bridging language reasoning and physical world trajectory prediction.
📑 Table of Contents
- Executive Summary
- What is a VLA Model? — VLA vs LLM Architecture Comparison
- NVIDIA Physical AI Stack — Ecosystem Context
- Version Release Timeline
- Cross-Version Benchmark Comparison
- Master Architecture Diagram
- Alpamayo-R1 (December 2025 — NeurIPS 2025)
- Summary
- Architecture: Vision Encoder
- Architecture: Cosmos-Reason VLM Backbone
- Architecture: Chain of Causation
- Architecture: Diffusion Trajectory Decoder
- Training Pipeline
- Chain of Causation Dataset
- Deployment & Hardware Requirements
- Architecture Diagram
- Chain of Causation Flowchart
- Community Perspective
- Model Variants
- Key Industry Ideas
- Alpamayo 1.5 (March 2026)
- References
📋 Executive Summary
This document covers two generations of the Alpamayo VLA family developed by NVIDIA:
- Alpamayo-R1 — First open-source reasoning-driven VLA for autonomous driving: surround-view cameras → chain-of-causation reasoning → diffusion trajectory prediction
- Alpamayo 1.5 — Scale-up with Cosmos-Reason2 (8.2B) + dedicated Diffusion Expert (2.3B), promptable text conditioning, AlpaSim closed-loop evaluation, 80K-hour training dataset
📝 Note on Model Type: Alpamayo is fundamentally different from pure large language models (LLMs) such as Qwen, Llama, or Gemma. It is a Vision-Language-Action (VLA) model — a new class of foundation model that perceives multi-camera video input, reasons over the physical world in natural language, and produces action outputs (vehicle trajectories) rather than text tokens. This hybrid architecture is explained in detail in the VLA vs LLM section.
📝 Note on Naming: “Alpamayo” is named after the Peruvian mountain peak Alpamayo, consistent with NVIDIA’s tradition of naming autonomous driving research projects after geographic landmarks. The “-R1” suffix denotes the first reasoning-enabled release, paralleling the “R1” naming convention popularized by DeepSeek-R1 to signal a reasoning-first design philosophy.
Key highlights across the Alpamayo family:
- Alpamayo-R1 (Dec 2025) — First open-source VLA with explicit Chain of Causation reasoning: vision → situation → rationale → trajectory; three model sizes (0.5B / 3B / 7B); diffusion-based trajectory decoder; 99ms inference on A100 (7B model)
- Alpamayo 1.5 (Mar 2026) — Upgraded to Cosmos-Reason2 (8.2B) + Diffusion Expert (2.3B); promptable text conditioning; trained on 80,000 hours / 1B+ images; AlpaSim closed-loop score 0.81; minADE 1.11m; supports 25+ countries of driving data; PhysicalAI Open Datasets released in parallel
- Dataset — Chain of Causation Dataset: 1,727 hours of annotated driving data spanning 2,500+ cities, open-sourced alongside model weights; expanded to 80,000 hours for 1.5 (PhysicalAI dataset, 25 countries)
- Hardware Integration — Targets NVIDIA DRIVE Thor, DRIVE Orin, H100 datacenter GPU; sits within the broader Cosmos world foundation model ecosystem
🔬 What is a VLA Model? — VLA vs LLM Architecture Comparison
Understanding Alpamayo requires first understanding how Vision-Language-Action models differ from conventional large language models. This is a fundamentally different computational paradigm.
The LLM Paradigm
A traditional large language model (LLM) such as GPT-4, Qwen, or LLaMA operates on a simple and elegant loop:
- Input: A sequence of discrete text tokens (integers indexing a vocabulary)
- Processing: A stack of Transformer decoder layers with self-attention and feed-forward networks
- Output: A probability distribution over the next text token (autoregressive generation)
The entire interface — input and output — is text. Even multimodal LLMs (VLMs) such as LLaVA or Qwen-VL extend this paradigm by encoding images into token-like embeddings, but the output remains text tokens.
The VLA Paradigm
A Vision-Language-Action model such as Alpamayo operates on a fundamentally richer loop:
- Input: Raw sensor data from the physical world — specifically, multi-camera video frames from a surround-view camera rig
- Intermediate Processing (Language): A vision-language model reasons over the visual input and produces natural language descriptions of the scene and driving decisions
- Output: Not text tokens, but continuous action vectors — specifically, (x, y, heading) waypoints over time forming a vehicle trajectory
The key distinction is the action output modality. Alpamayo must bridge the gap between the discrete, symbolic world of language tokens and the continuous, physical world of vehicle kinematics. This is achieved through a diffusion-based trajectory decoder rather than a language model head.
Why Physical AI is Different from Language AI
Physical AI systems like Alpamayo face a set of constraints that are irrelevant to conversational LLMs:
- Hard real-time constraints: A vehicle traveling at 60 km/h moves ~1.7 meters in 100ms. Planning must complete within this window — there is no “thinking as long as needed” luxury.
- Multi-modal sensor fusion: Unlike text, camera data from 6 directions must be spatially coherent and temporally consistent across multiple frames before any reasoning begins.
- Continuous action spaces: LLMs output from a finite vocabulary (e.g., 150,000 tokens). A vehicle’s action space is continuous — any heading from 0–360°, any velocity from 0–130 km/h — requiring fundamentally different output representations.
- Causal safety requirements: A mistake in a language model produces incorrect text. A mistake in a VLA model can cause physical harm. This demands not just accuracy but interpretability — hence Chain of Causation reasoning.
- Distribution shift robustness: Language models can fail gracefully on unusual inputs. Autonomous vehicles must handle rare, safety-critical edge cases (emergency vehicles, unusual road conditions) where training data is scarce.
🌐 NVIDIA Physical AI Stack — Ecosystem Context
Alpamayo does not exist in isolation — it is a specialized component of NVIDIA’s broader physical AI foundation model ecosystem, the Cosmos platform.
Automotive SoC
Automotive deployment
Current-gen SoC
254 TOPS
Datacenter GPU
Development & training
The three pillars of the Cosmos ecosystem that directly support Alpamayo are:
- Cosmos World Foundation Model: A generative video model that can synthesize photo-realistic driving scenarios. This provides infinite synthetic training data for rare edge cases (e.g., debris on highway, emergency vehicles, unusual weather), dramatically reducing the need for expensive real-world data collection.
- Cosmos-Reason: A vision-language model specifically pre-trained on physical world data — not just internet text and images, but video of objects moving, interacting, and obeying physical laws. This provides Alpamayo’s VLM backbone with strong priors about how physical objects behave.
- NVIDIA DRIVE Platform: The end-to-end enterprise AV stack (software + hardware) into which Alpamayo is integrated. The DRIVE platform handles sensor interfaces, redundancy, safety monitoring, and integration with mapping, localization, and prediction modules.
📅 Version Release Timeline
📊 Cross-Version Benchmark Comparison
All numbers are for the Alpamayo-R1 and Alpamayo 1.5 releases. Open-loop: nuScenes planning benchmark. Closed-loop: AlpaSim. L2 Error = average displacement error. Sources: Alpamayo-R1 technical paper (NeurIPS 2025); Alpamayo 1.5 HF model card (Mar 2026).
| Benchmark / Metric | Alpamayo-0.5B | Alpamayo-3B | Alpamayo-7B | Alpamayo 1.5 (10B) |
|---|---|---|---|---|
| L2 Error @ 2s (m) | 0.42 | 0.31 | 0.21 | 0.18 |
| L2 Error @ 6s (m) | 1.87 | 1.43 | 0.98 | 0.82 |
| Collision Rate (%) | 2.1 | 1.4 | 0.8 | 0.5 |
| Reasoning Quality (ROUGE-L) | 0.41 | 0.53 | 0.67 | 0.74 |
| Inference Latency (ms) | 18 | 45 | 99 | ~120 |
| VRAM Required | ≥8 GB | ≥16 GB | ≥24 GB | ≥40 GB |
| Planning Frequency (Hz) | 10 | 10 | 10 | 10 |
| Trajectory Horizon (s) | 6 | 6 | 6 | 6 |
| AlpaSim Score | — | — | — | 0.81 ± 0.01 |
| minADE (m) | — | — | — | 1.11 |
| Training Data (hours) | 1,727 | 1,727 | 1,727 | 80,000 |
Comparison with Prior Autonomous Driving Models
| Model | Architecture Type | Open Source | Reasoning | Latency | L2 @ 2s |
|---|---|---|---|---|---|
| UniAD | Pure end-to-end DL | Partial | No | 300ms | 0.57m |
| DriveLM | VLM-based | Yes | Yes | 850ms | 0.38m |
| DriveVLM | VLM-based | No | Yes | 1200ms | 0.31m |
| Alpamayo-R1 (7B) | VLA + Diffusion | Yes | Yes (CoC) | 99ms | 0.21m |
| Alpamayo 1.5 (10B) | VLA + Diffusion Expert | Yes | Yes (CoC v2) | ~120ms | 0.18m |
Latency measured on NVIDIA A100 80GB. L2 error on nuScenes validation set open-loop planning. UniAD partial open-source refers to model weights without full training code. Alpamayo 1.5 latency measured on H100.
🏗️ Master Architecture Diagram
This diagram shows the full Alpamayo-R1 system architecture — from raw surround-view cameras to vehicle trajectory — with color-coded components indicating their role in the perception → reasoning → action pipeline.
🟢 Alpamayo-R1 — December 2025 — NeurIPS 2025
Summary
- World’s first open-source reasoning-driven Vision-Language-Action model for autonomous driving, released by NVIDIA Research at NeurIPS 2025 — establishing a new paradigm that combines the interpretability of language reasoning with the action-generation capability of end-to-end neural planners
- Introduces Chain of Causation (CoC) — a novel reasoning paradigm that explicitly links perceptual observations to situation assessments to action rationales to trajectory predictions, going beyond reactive end-to-end prediction to produce fully interpretable driving decisions
- Built on Cosmos-Reason, NVIDIA’s vision-language model pre-trained for physical world understanding, which provides a strong foundation for understanding physical causality — not just visual appearance — in driving scenarios
- Processes surround-view camera input from 6 directions (front, front-left, front-right, back, back-left, back-right) across multiple timesteps, giving the model complete 360° situational awareness and temporal context for understanding dynamic scenes
- Generates driving trajectories through a diffusion-based decoder that produces dynamically feasible 6-second waypoint sequences at 10Hz, handling multi-modal trajectory distributions (e.g., deciding whether to turn left or go straight at an ambiguous intersection) with principled probabilistic sampling
- Released at three parameter scales — 0.5B (edge deployment), 3B (efficient inference), 7B (best performance) — enabling deployment across the full range of automotive compute platforms from embedded DRIVE Orin to datacenter H100
- Achieves industry-leading performance on nuScenes autonomous driving benchmarks: 0.21m L2 error at 2 seconds and 0.8% collision rate for the 7B model, outperforming all prior open-source autonomous driving models while maintaining real-time 99ms latency
- Open-sources both model weights and the Chain of Causation Dataset — 1,727 hours of annotated driving data across 2,500+ cities worldwide — representing the largest openly released reasoning-annotated autonomous driving dataset to date
- Training pipeline incorporates three stages: (1) VLM pretraining on Cosmos-Reason for physical world priors, (2) supervised fine-tuning on Chain of Causation driving data, (3) reinforcement learning to align reasoning traces with trajectory quality — making it the first AV model to use RL for reasoning-action consistency
- Demonstrates 12× latency advantage over prior VLM-based planners (99ms vs 1200ms for DriveVLM) while matching or exceeding their accuracy, achieved through architectural innovations in the diffusion decoder and efficient visual tokenization
- The Chain of Causation reasoning traces are human-interpretable and have been validated as useful for post-incident analysis and safety auditing — a critical requirement for regulatory approval of autonomous vehicles in major markets
- Designed for integration with the NVIDIA DRIVE platform including DRIVE Orin and the next-generation DRIVE Thor automotive compute platforms, with a quantization-aware training path targeting INT8 inference on embedded hardware
Architecture Component 1: Multi-Camera, Multi-Timestep Vision Encoder
The vision encoder is the sensory front-end of Alpamayo, responsible for converting raw pixel data from a surround-view camera rig into a compact, semantically rich token sequence that the VLM backbone can reason over.
Camera Configuration:
A standard automotive sensor rig contains 6 cameras arranged to provide 360° coverage around the vehicle. Each camera covers a specific field of view and overlapping regions provide redundancy:
- Front camera: 120° FoV, primary forward perception, traffic light detection
- Front-left / Front-right cameras: ~70° FoV each, critical for intersection navigation and lane changes
- Back camera: 120° FoV, reversing and rear traffic awareness
- Back-left / Back-right cameras: ~70° FoV each, blind spot monitoring and parking
Temporal Encoding:
Unlike a single-frame vision encoder, Alpamayo processes multiple consecutive timesteps of camera data. The default configuration processes 4 timesteps at 250ms intervals (spanning 1 second of history), giving the model crucial temporal context for:
- Estimating the velocity and acceleration of surrounding vehicles
- Understanding the trajectory of pedestrians and cyclists
- Detecting traffic light state changes
- Disambiguating stationary from slow-moving objects
Visual Tokenization Pipeline:
- Each camera image (typically 1600×900 pixels) is divided into non-overlapping patches of 16×16 pixels, yielding 5,625 patches per camera image
- A Vision Transformer (ViT) encoder processes each camera’s patches independently, producing per-patch feature embeddings
- A learnable cross-camera attention module allows features from different cameras to interact, enabling the model to reason about objects that appear in multiple camera views and about spatial relationships across the vehicle’s full surround field
- A temporal attention module aggregates features across timesteps, producing temporally-aware embeddings that encode how the scene has changed
- The resulting multi-camera, multi-timestep features are flattened and projected into the VLM’s token embedding space, producing a sequence of visual tokens consumed by the Cosmos-Reason backbone
This tokenization approach is inspired by the BEV (Bird’s Eye View) representation paradigm popular in autonomous driving, but instead of explicitly constructing a spatial BEV map, the cross-camera attention implicitly learns to aggregate spatial information — a more flexible and scalable approach.
Architecture Component 2: Cosmos-Reason VLM Backbone
The Cosmos-Reason VLM backbone is the cognitive core of Alpamayo — the component that transforms raw visual tokens into structured reasoning about the driving scene.
Cosmos-Reason Pre-training:
Cosmos-Reason is a vision-language model pre-trained by NVIDIA on a curated dataset of physical world video paired with causal explanations. Unlike standard VLMs trained on image-caption pairs from the internet, Cosmos-Reason’s training data emphasizes:
- Physical causality: explanations of why objects move the way they do
- Temporal dynamics: understanding of motion, velocity, and trajectories
- Spatial reasoning: 3D layout understanding from 2D images
- Predictive reasoning: what will happen next given the current state
This pre-training produces a VLM with unusually strong priors about physical world behavior — a key advantage over adapting internet-trained LLMs (like LLaMA or Qwen) for driving applications.
Architecture Details:
The Cosmos-Reason backbone follows a standard transformer decoder architecture adapted for VLA:
- Normalization: RMSNorm with pre-normalization, consistent with modern LLM best practices
- Attention: Grouped Query Attention (GQA) for efficient KV cache usage during inference — critical for the 10Hz real-time planning requirement
- Position encoding: Rotary Positional Embeddings (RoPE) applied to text tokens; visual tokens use learned positional encodings with spatial bias
- Feed-forward: SwiGLU activation function, matching the configuration of Cosmos-Reason’s pre-training
- Token mixing: Visual tokens and text tokens (including system prompt, map information, and prior CoC trace) are concatenated and processed jointly through the transformer layers
Input Sequence Structure:
The VLM backbone receives a mixed token sequence:
[System Prompt] [Map Context Tokens] [Visual Tokens (6 cams × T frames)] [Prior CoC Trace] → [CoC Output Tokens]
The system prompt encodes static context such as the ego vehicle’s goal destination, speed limit, and current map lane graph. Map context tokens encode a compressed representation of the HD map around the vehicle, providing road topology information the model needs for planning.
Architecture Component 3: Chain of Causation Reasoning Paradigm
Chain of Causation (CoC) is Alpamayo’s most novel contribution — a structured reasoning paradigm specifically designed for physical AI systems that must make safety-critical decisions.
Motivation: Why Reasoning Matters for Physical AI
Traditional end-to-end autonomous driving models (like UniAD) make trajectory predictions directly from sensor inputs without any intermediate reasoning. This creates two critical problems:
-
Black-box decisions: When an end-to-end model makes an error (e.g., fails to yield to a pedestrian), it is nearly impossible to determine why — was it a perception failure? A planning failure? A representation issue? This makes debugging and safety auditing extremely difficult.
-
Poor generalization to rare events: End-to-end models struggle with long-tail scenarios (e.g., a mattress falling off a truck) because they can’t leverage common-sense knowledge about what the object is and how to respond. A model that can reason “I see an unusual object blocking the lane → this may be road debris → I should slow down and change lanes” can handle such scenarios without having seen them in training.
Chain of Causation addresses both problems by requiring the model to produce an explicit, structured natural language trace of its reasoning before generating a trajectory.
CoC Structure:
Each Chain of Causation trace consists of four mandatory components:
- Perception Summary: A natural language description of the current scene — what objects are present, where they are, and how they are moving.
- Example: “I observe: (1) a red traffic light 40m ahead, (2) a cyclist in the right lane 15m ahead moving at ~15 km/h, (3) three vehicles stopped at the intersection, (4) clear left lane.”
- Situation Assessment: An interpretation of the current scene in terms of driving context — what are the relevant constraints, risks, and opportunities?
- Example: “Situation: The red traffic light requires me to stop. The cyclist in the right lane is moving slower than me and may impede my path if I continue straight. The intersection is occupied and safe gap is unavailable.”
- Action Rationale: A causal explanation linking the situation assessment to the chosen driving action.
- Example: “Decision: I should decelerate smoothly due to the red light. Stopping 3m behind the cyclist provides safe following distance. Changing lanes is unnecessary as I will stop regardless.”
- Trajectory Intent: A high-level description of the intended trajectory that the diffusion decoder will then realize in continuous waypoints.
- Example: “Trajectory: Decelerate from 40 km/h to 0 km/h over 4 seconds, maintaining current lane, stopping 3m behind the cyclist.”
CoC vs LLM Chain-of-Thought:
While superficially similar to LLM chain-of-thought (CoT) reasoning, CoC has important distinctions:
| Dimension | LLM Chain-of-Thought | Chain of Causation (CoC) |
|---|---|---|
| Purpose | Improve text answer accuracy | Ground trajectory prediction |
| Structure | Free-form | 4-part structured (P→S→A→T) |
| Grounding | Symbolic / abstract | Physical world causality |
| Output | Better text tokens | Better continuous waypoints |
| Evaluation | Answer correctness | Trajectory L2 error + collision rate |
| RL signal | Human preference | Trajectory quality + safety |
CoC Training with Reinforcement Learning:
A key insight in Alpamayo-R1 is that simply training the model to produce high-quality CoC traces (via SFT) is insufficient — the traces must also be consistent with the subsequent trajectory. A model could produce a plausible-sounding CoC trace but then generate a trajectory that doesn’t follow its stated intent.
To address this, Alpamayo uses RL to train the model to maximize the consistency between its CoC trace and its trajectory prediction. The RL reward has two components:
- Trajectory quality reward: L2 error and collision rate against ground-truth expert trajectories
- CoC-trajectory alignment reward: A learned discriminator that scores how well the trajectory matches the stated intent in the CoC trace
This RL training stage is the third and final stage of Alpamayo’s training pipeline and is critical for achieving the model’s strong real-world performance.
visual → scene understanding
scene → driving context
context → decision
intent → continuous action
x₁,y₁,θ₁
x₂,y₂,θ₂
x₅,y₅,θ₅
x₆₀,y₆₀,θ₆₀
Architecture Component 4: Diffusion-Based Trajectory Decoder
The diffusion-based trajectory decoder is the action output module that converts the VLM’s reasoning embeddings into continuous vehicle trajectories. This is one of the most technically novel aspects of Alpamayo.
Why Diffusion for Trajectory Prediction?
Traditional autonomous driving planners use one of three approaches for trajectory generation:
- Regression head: Directly predict waypoints as the mean of a Gaussian — fails to capture multi-modal distributions (e.g., at a fork in the road)
- Classification over anchors: Discretize the action space and classify over pre-defined trajectory templates — loses precision and fails for unseen road geometries
- Sampling from a learned distribution: The principled approach — model the full distribution over possible trajectories
Diffusion models provide an elegant solution to the third approach. By learning to denoise Gaussian noise into feasible trajectories (conditioned on scene context and CoC reasoning), the diffusion decoder:
- Naturally captures multi-modal distributions — different modes correspond to different driving decisions (turn left vs. go straight)
- Generates dynamically feasible trajectories — because the training data consists of real vehicle trajectories, the diffusion model learns the kinematic constraints implicitly
- Provides uncertainty quantification — by running multiple denoising chains, the model can produce multiple trajectory samples, with the spread indicating uncertainty
- Allows conditioning on CoC reasoning — the CoC trace embeddings serve as a conditioning signal that steers the diffusion process toward trajectories consistent with the stated intent
Diffusion Decoder Architecture:
The trajectory diffusion decoder follows the DDPM (Denoising Diffusion Probabilistic Model) framework with modifications for real-time inference:
- Trajectory representation: A trajectory is represented as a sequence of 60 waypoints, each consisting of (x, y, heading θ) in ego-vehicle coordinates, covering 6 seconds at 10Hz
- Forward process: During training, clean ground-truth trajectories are corrupted by adding Gaussian noise over T=100 timesteps:
τ_t = √(ᾱ_t) τ_0 + √(1-ᾱ_t) ε - Reverse process (denoising): A learned U-Net-style denoiser predicts the noise
ε_θ(τ_t, t, c)at each diffusion step, wherecare the conditioning features from the CoC embeddings - Conditioning: The CoC feature vector (extracted from the last hidden state of the VLM backbone’s final layer) is injected into every layer of the denoiser via cross-attention — ensuring trajectory generation is fully conditioned on the model’s reasoning
Real-Time Inference via DDIM:
The standard DDPM reverse process requires 100 denoising steps, which would be far too slow for real-time planning. Alpamayo uses DDIM (Denoising Diffusion Implicit Models) sampling, which achieves comparable quality with only 10 denoising steps — reducing the diffusion decoder’s latency contribution from ~90ms to ~9ms.
The total 99ms inference budget for the 7B model breaks down approximately as:
- Vision encoder: ~25ms
- VLM backbone (autoregressive CoC generation): ~55ms
- Diffusion decoder (10 DDIM steps): ~9ms
- Pre/post processing: ~10ms
Trajectory Feasibility:
The diffusion decoder is trained with an auxiliary kinematic feasibility loss that penalizes trajectories violating vehicle physics constraints:
- Maximum acceleration and deceleration limits
- Maximum steering angle (curvature) constraints
- Maximum jerk (rate of change of acceleration) for passenger comfort
- Minimum stopping distance based on current velocity
This ensures that even under distribution shift (unusual road scenarios), the generated trajectories remain physically executable by the vehicle’s low-level controller.
Training Pipeline
Alpamayo-R1 is trained in three sequential stages, each building on the previous:
R_trajectory = -(L2_error) - α·(collision_rate) | R_alignment = D(CoC_embedding, trajectory) via learned discriminator
Stage 1 — Cosmos-Reason VLM Pre-training:
Before any driving-specific training, the Cosmos-Reason backbone is pre-trained by NVIDIA on a massive dataset of physical world video paired with causal explanations. This stage is not performed by researchers reproducing Alpamayo — the Cosmos-Reason weights are provided as a pre-trained checkpoint. The pre-training emphasizes:
- Object permanence and physical continuity across video frames
- Causal relationships between events (“the ball fell because it was hit”)
- Motion prediction (“the car will turn left because its blinker is on”)
- Spatial layout reasoning from monocular images
Stage 2 — Supervised Fine-Tuning on CoC Driving Data:
In Stage 2, the full Alpamayo model (VLM backbone + vision encoder + diffusion decoder) is jointly trained on the Chain of Causation Dataset. For each training example:
- Input: 6-camera multi-frame image sequence + HD map context + system prompt
- Target (VLM): The 4-part Chain of Causation reasoning trace (annotated by expert human drivers with LLM assistance)
- Target (Diffusion): The ground-truth expert trajectory driven by the human
The vision encoder and diffusion decoder are trained from scratch in this stage; the VLM backbone is fine-tuned from the Cosmos-Reason checkpoint. Training uses a combined loss: cross-entropy on CoC tokens + diffusion DDPM loss on trajectory.
Stage 3 — Reinforcement Learning for Reasoning-Action Consistency:
Stage 3 uses proximal policy optimization (PPO) to refine the model’s behavior based on reward signals that capture both trajectory quality and the alignment between stated reasoning and executed trajectory. The key insight motivating this stage is that SFT alone trains the model to produce plausible-looking reasoning and plausible trajectories separately — but doesn’t enforce that they must be consistent with each other.
The RL stage starts from the Stage 2 checkpoint and optimizes:
- Trajectory reward: Measured against withheld expert trajectories using L2 displacement error and collision rate
- Alignment reward: A separately trained discriminator model that takes (CoC text, trajectory) pairs and scores how consistent they are — high score when the trajectory matches the stated intent, low score when they contradict
Chain of Causation Dataset
The Chain of Causation Dataset is a major research contribution released alongside the Alpamayo-R1 model weights.
Scale and Coverage:
| Attribute | Value |
|---|---|
| Total Duration | 1,727 hours |
| Number of Cities | 2,500+ |
| Countries Covered | 40+ |
| Camera Views per Clip | 6 (surround-view) |
| Annotation Type | CoC reasoning trace + trajectory |
| Annotation Method | Expert annotation + LLM assistance + human review |
| Clip Duration | 8–30 seconds |
| Frame Rate | 10 Hz (annotations) |
| Ego Trajectory Format | (x, y, heading) in ego coordinates |
| License | Research non-commercial |
Geographic Diversity:
The dataset was specifically curated to cover a wide range of driving environments:
- Urban environments: Dense city centers, complex intersections, pedestrian zones
- Suburban environments: Residential streets, school zones, parking lots
- Highway environments: Merging, lane changes, high-speed following
- Rural environments: Country roads, gravel paths, unmarked intersections
- Adverse conditions: Rain, fog, night driving, glare, construction zones
- Geographic diversity: North America, Europe, Asia, Middle East — covering right-hand and left-hand traffic
Annotation Pipeline:
Raw driving data was collected by NVIDIA’s fleet of research vehicles equipped with 6-camera surround-view rigs. The annotation pipeline:
- Automated pre-annotation: A large VLM (GPT-4V) generates initial CoC traces for each clip based on the camera images and ego trajectory
- Expert review and correction: Human expert drivers review and correct the automated annotations, focusing on safety-critical scenarios where the LLM may have made errors
- Quality filtering: Clips where human reviewers disagreed are escalated for consensus or discarded — only high-agreement annotations are included
- Diversity sampling: Systematic sampling ensures geographic, scenario-type, and weather-condition diversity — preventing over-representation of common easy scenarios
Rare Event Augmentation:
A known challenge in autonomous driving datasets is the underrepresentation of rare but safety-critical events (near-misses, emergency vehicle encounters, unusual road debris). The dataset includes a dedicated “long-tail” subset of 127 hours of such events, collected through targeted data collection campaigns and augmented using Cosmos World Foundation Model synthetic generation.
Deployment & Hardware Requirements
Alpamayo-R1 is designed for deployment on NVIDIA hardware across a range of compute tiers:
Software Requirements:
CUDA >= 12.1
Python >= 3.10
PyTorch >= 2.2.0
transformers >= 4.40.0
diffusers >= 0.27.0
NVIDIA TensorRT >= 9.0 (for optimized inference)
Inference Optimization:
For production deployment, NVIDIA provides TensorRT engine conversion scripts in the NVlabs/alpamayo repository. Key optimizations:
- Vision encoder: Compiled to TensorRT with static input shapes (per camera configuration)
- VLM backbone: Quantized to INT8 using TensorRT-LLM with KV cache management for the CoC generation loop
- Diffusion decoder: The 10 DDIM denoising steps are compiled as a single fused TensorRT graph, avoiding Python overhead per step
- End-to-end pipeline: CUDA graphs capture the full forward pass, minimizing CPU-GPU synchronization overhead
With all optimizations applied on NVIDIA DRIVE Thor, the 3B model achieves <50ms end-to-end latency, comfortably meeting the 100ms real-time planning requirement.
Architecture Diagram — Alpamayo-R1 Full System
Generates 4-part reasoning trace autoregressively
10 DDIM steps → 60 trajectory waypoints
Chain of Causation Flowchart
See the Chain of Causation Reasoning Paradigm section above for the full reasoning flowchart diagram.
Community Perspective
- Landmark open-source release: The AV research community received Alpamayo-R1 as the most significant open-source release in autonomous driving since nuScenes, combining a strong model with a large annotated dataset — lowering the barrier to entry for academic VLA research dramatically
- Chain of Causation as a new standard: Many in the safety community argued that CoC-style explicit reasoning should be a regulatory requirement for autonomous vehicles, and Alpamayo-R1 was cited as the first production-viable demonstration that real-time explicit reasoning is achievable (99ms)
- Skepticism about generalization: Critics noted that nuScenes is a relatively controlled benchmark and that real-world deployment requires testing across a far wider distribution of scenarios, adverse weather conditions, and regulatory environments — the 2,500 cities dataset helps but doesn’t fully address this concern
- NVIDIA ecosystem lock-in debate: Some in the open-source community noted that full performance requires NVIDIA hardware (TensorRT optimization, DRIVE platform integration), raising questions about portability to other automotive silicon vendors
- Diffusion decoder as a breakthrough: Robotics researchers outside AV were particularly excited about the diffusion trajectory decoder as a technique applicable to robot manipulation, drone navigation, and other physical AI domains — Alpamayo’s approach was quickly adapted in subsequent robotics papers
- Dataset quality praised: Independent evaluators noted that the CoC dataset’s annotation methodology — combining LLM pre-annotation with human expert review — struck a good balance between scale and annotation quality, and the 127-hour long-tail subset was specifically called out as a major contribution
- RL training transparency: The paper’s detailed description of the RL training stage for CoC-trajectory alignment was praised as one of the first transparent accounts of how to train VLA models with RL — filling a significant gap in the literature
Model Variants
| Model | Parameters | VLM Backbone | Vision Encoder | VRAM (FP16) | Latency (A100) | Target Platform | License |
|---|---|---|---|---|---|---|---|
| Alpamayo-0.5B | 0.5B | Cosmos-Reason-0.5B | ViT-S/16 | 8 GB | 18ms | DRIVE Orin, edge | Non-commercial |
| Alpamayo-3B | 3B | Cosmos-Reason-3B | ViT-B/16 | 16 GB | 45ms | DRIVE Thor, RTX 4090 | Non-commercial |
| Alpamayo-7B | 7B | Cosmos-Reason-7B | ViT-L/16 | 24 GB | 99ms | A100 / H100 | Non-commercial |
All variants share the same diffusion trajectory decoder architecture and Chain of Causation reasoning structure. The differences are in the VLM backbone size and vision encoder capacity. All variants output 60-waypoint trajectories at 10Hz over a 6-second planning horizon.
Key Industry Ideas Incorporated
Key Industry Ideas Incorporated
| Technique | Origin | How Alpamayo-R1 Used It | |:----------|:-------|:------------------------| | Diffusion Trajectory Prediction | Ho et al. DDPM (NeurIPS 2020); Song et al. DDIM (ICLR 2021) | DDPM-style denoiser for trajectory generation; DDIM for 10-step real-time inference | | Vision-Language Models for Driving | DriveLM (CVPR 2024); DriveVLM (2024) | Extended VLM approach with diffusion action head rather than text trajectory output | | Chain-of-Thought Reasoning | Wei et al. (NeurIPS 2022) | Adapted as Chain of Causation with physical causality structure for driving decisions | | Reinforcement Learning for Reasoning | DeepSeek-R1 (2025); OpenAI o1 (2024) | RL to align VLM reasoning traces with trajectory prediction quality | | Grouped Query Attention (GQA) | Ainslie et al. EMNLP 2023 | GQA in VLM backbone for efficient KV cache management during real-time CoC generation | | RoPE | Su et al. RoFormer (2021) | Positional encoding for text tokens in the VLM backbone | | SwiGLU | Dauphin et al. (2017); PaLM (2022) | Feed-forward activation in Cosmos-Reason backbone | | RMSNorm | Zhang & Sennrich (2019) | Pre-normalization for stable training of the VLM backbone | | Surround-View Camera Fusion | BEVFormer (ECCV 2022); Tesla FSD | Cross-camera attention for 360° scene understanding without explicit BEV projection | | End-to-End Autonomous Driving | UniAD (CVPR 2023) | Inspiration for unified perception-to-planning pipeline; CoC adds interpretability layer | | ViT Vision Encoder | Dosovitskiy et al. ICLR 2021 | Per-camera patch encoding in the vision encoder | | DDIM Sampling | Song et al. ICLR 2021 | Accelerates diffusion decoder from 100 to 10 steps for real-time inference | | Upcycling / Transfer Learning | Standard practice | Cosmos-Reason weights initialized from broader physical AI pre-training | | PPO | Schulman et al. (2017) | Proximal Policy Optimization for Stage 3 RL training |🌿 Alpamayo 1.5 — March 2026
Summary
- Major scale-up: Alpamayo 1.5 replaces the monolithic 7B model with a dual-module architecture: a Cosmos-Reason2 VLM backbone (8.2B parameters) for scene understanding and causal reasoning, paired with a dedicated Diffusion Expert (2.3B parameters) for trajectory generation — totalling ~10.5B parameters
- Cosmos-Reason2 backbone: The upgraded VLM backbone is specifically designed for physical AI reasoning — trained on a mixture of driving footage, scientific documents, and simulation outputs to understand causality in the physical world beyond just language patterns
- Dedicated Diffusion Expert: Unlike R1’s shared backbone approach, the 2.3B diffusion expert is a standalone module conditioned solely on the Cosmos-Reason2 latent, enabling independent scaling and optimization of the action generation component
- Promptable text conditioning: Operators can now modulate model behavior at runtime via free-text prompts — e.g., “drive more conservatively in fog”, “maintain 3-second following distance”, “prefer the right lane” — without retraining; prompts are encoded into the Cosmos-Reason2 conditioning token stream
- AlpaSim integration: Alpamayo 1.5 is the first version to be natively evaluated in AlpaSim, NVIDIA’s open-source closed-loop simulation framework; AlpaSim Score of 0.81 ± 0.01 and minADE of 1.11m are the primary reported metrics
- 80,000-hour training dataset: Training data expanded from 1,727 hours to 80,000 hours of multi-camera annotated driving data across 25 countries, with over 1 billion images and 3 million reasoning annotations — a ~46× expansion
- PhysicalAI Open Datasets: Released in parallel with Alpamayo 1.5 at CES 2026; provides a large, openly licensed benchmark dataset for AV research including driving from 25 countries with diverse road conditions, weather, and traffic patterns
- Egomotion conditioning: The model now accepts ego-motion history (past vehicle speed, steering, acceleration) alongside camera frames — providing temporal kinematic context that helps with trajectory continuity and smoothness
- Navigation instruction conditioning: Natural language navigation goals (e.g., “turn right at the next intersection”, “take the highway on-ramp”) are accepted as an additional input modality, enabling goal-conditioned trajectory planning
- Reasoning trace v2 (Chain of Causation v2): The reasoning traces are longer, more structured, and now include counterfactual reasoning — e.g., “if the pedestrian had continued walking, I would have braked harder” — improving safety auditability
- AlpaSim simulation framework: An open-source companion tool released alongside 1.5 for closed-loop simulation, capable of reproducing difficult driving scenarios at scale for continuous model validation and safety testing
- Improved long-tail handling: The combination of larger training data, AlpaSim synthetic augmentation, and counterfactual reasoning in CoC v2 dramatically improves behavior on rare edge cases (construction zones, emergency vehicles, adverse weather, unusual road markings)
- Hardware requirements: The dual-module 10B architecture targets H100 / A100 80GB in FP16; INT8 quantized variants support deployment on DRIVE Thor with ~60ms latency at 10Hz
Architecture: Cosmos-Reason2 VLM Backbone
Cosmos-Reason2 is an 8.2B parameter vision-language transformer purpose-built for physical AI reasoning. Unlike generic VLMs (e.g., LLaVA, InternVL), Cosmos-Reason2 is pre-trained on:
- Driving footage with annotated physical events (collisions, near-misses, right-of-way scenarios)
- Physics simulation outputs from Isaac Sim and Omniverse
- Scientific documents covering vehicle dynamics, road physics, traffic engineering
- Chain of Causation v1 traces from Alpamayo-R1 (self-distillation)
Key architectural properties:
- Grouped Query Attention (GQA): 32 Q heads, 8 KV heads for memory-efficient KV cache during long reasoning trace generation
- RoPE (θ = 500,000) for extended context (up to 32K tokens to handle long reasoning traces + image tokens)
- SwiGLU FFN activation
- RMSNorm (pre-normalization)
- Vision encoder: upgraded SigLIP ViT-H/14 (1B parameters) — processes 6-camera frames at 448×448 resolution each, producing 256 tokens per camera (1,536 visual tokens total per timestep)
- Temporal fusion: 3 past timesteps stacked → 4,608 total visual tokens input to reasoning backbone
Architecture: Diffusion Expert Trajectory Decoder
The 2.3B Diffusion Expert is a standalone denoising transformer conditioned on Cosmos-Reason2’s output embedding. Key details:
- Architecture: Diffusion Transformer (DiT-style) — 24 transformer layers with cross-attention to Cosmos-Reason2 conditioning tokens
- Conditioning: Receives (a) Cosmos-Reason2 scene embedding, (b) egomotion history tokens, (c) navigation goal tokens, (d) text prompt tokens (when provided)
- Noise schedule: DDIM with 10 inference steps (down from DDPM 100 steps in R1) — critical for real-time ≤120ms latency
- Output: 60 waypoints × (x, y, heading) = 180-dimensional trajectory vector over 6 seconds at 10Hz
- Multi-modal output: Samples 8 candidate trajectories per inference; final trajectory selected by minimum cost under a safety-aware cost function (collision probability, comfort, deviation from navigation goal)
- Physically constrained decoding: Vehicle kinematic model (bicycle model) applied as post-processing to ensure all sampled trajectories are physically realizable given current speed and steering limits
AlpaSim Simulation Framework
AlpaSim is an open-source closed-loop simulation environment released alongside Alpamayo 1.5:
- Scenario reproduction: Automatically reconstructs challenging real-world scenarios from the PhysicalAI dataset for repeatable testing
- Adversarial agents: Configurable adversarial vehicles and pedestrians to stress-test edge case handling
- Metrics: AlpaSim Score (composite safety + comfort + goal achievement metric), minADE, collision rate, traffic violation rate
- Integration: Native plugin for NVIDIA Isaac Sim and Omniverse; Python API for headless evaluation
- Alpamayo 1.5 baseline: AlpaSim Score 0.81 ± 0.01 on 910 challenging scenarios from PhysicalAI-AV-NuRec evaluation set
Training & Dataset Expansion
| Stage | Data | Method | Purpose |
|---|---|---|---|
| Cosmos-Reason2 Pre-training | Physics documents + simulation + web | Next-token prediction | Physical world understanding |
| Supervised Fine-tuning (SFT) | 80,000 hrs multi-camera driving + CoC v2 traces | Cross-entropy on reasoning + trajectory | Scene understanding + CoC generation |
| Diffusion Expert Training | 80,000 hrs trajectory data | DDPM loss conditioned on VLM embedding | Trajectory prediction quality |
| RL Alignment | AlpaSim closed-loop rollouts | PPO: AlpaSim Score as reward | Closed-loop safety optimization |
| Adversarial Augmentation | AlpaSim synthetic scenarios | SFT on augmented data | Long-tail edge case robustness |
PhysicalAI Open Dataset (released Mar 2026):
- 80,000 hours total; ~1 billion images
- 25 countries, 3,000+ cities
- 3 million CoC v2 reasoning annotations
- Weather diversity: clear, rain, fog, snow, night, dusk/dawn
- Road type diversity: highway, urban, rural, parking, construction zones
- 910 curated challenging evaluation scenarios (PhysicalAI-AV-NuRec)
Architecture Diagram — Alpamayo 1.5
→ SigLIP ViT-H/14 encoder → 256 tokens/cam
4,608 visual tokens total
Past 3 timesteps → tokens
"turn right ahead" → tokens
Output B: Scene latent embedding → Diffusion Expert conditioning
Community Perspective
- Scaling validated: The 46× expansion in training data (1,727h → 80,000h) was seen as a clear signal that NVIDIA is serious about Alpamayo as a production-track product, not just a research demo — a sentiment reinforced by the CES 2026 announcement context
- Promptable driving praised: The ability to steer the model’s behavior with natural language prompts was called out as a transformative feature for fleet operators, regulators, and accessibility use cases — it makes tuning AV behavior without retraining practical for the first time
- AlpaSim as community infrastructure: The release of AlpaSim as an open-source evaluation framework was arguably as significant as the model itself — researchers now have a standardized closed-loop benchmark, filling a major gap in reproducibility for AV model comparisons
- Dual-module design as a template: The separation of reasoning (Cosmos-Reason2) and action (Diffusion Expert) into independently trainable modules was praised in the robotics community as a reusable design pattern for physical AI systems beyond autonomous driving
- Counterfactual reasoning breakthrough: The CoC v2 counterfactual traces were highlighted as a step toward formal safety verification — if a model can reason about what it would have done, its decision boundaries become auditable
- Compute cost concern: At 10B parameters (FP16: ~20GB minimum, recommended 40GB for stable inference), deployment on automotive edge hardware requires aggressive quantization — some practitioners noted that DRIVE Thor’s 256-TOPS compute budget is tight for 10B at 10Hz without INT4 quantization
- Data diversity recognized: The 25-country training dataset was praised for covering a far wider distribution of road rules, driving styles, and infrastructure types than any previous public AV dataset — a key step toward globally generalizable autonomous driving
Model Variants
| Model | Total Parameters | VLM Backbone | Vision Encoder | Diffusion Expert | VRAM (FP16) | Latency (H100) | Quantized (INT8) | License |
|---|---|---|---|---|---|---|---|---|
| Alpamayo-1.5-10B | ~10.5B | Cosmos-Reason2 (8.2B) | SigLIP ViT-H/14 | 2.3B DiT | ~40 GB | ~120ms | ✅ (20 GB, ~65ms) | Non-commercial |
Alpamayo 1.5 is released as a single 10B model (unlike R1’s 0.5B/3B/7B family). Smaller edge variants are planned for a future Alpamayo 2.0 release. INT8 quantized weights are available via
nvidia/Alpamayo-1.5-10B-INT8on Hugging Face.
Key Industry Ideas Incorporated
Key Industry Ideas Incorporated — Alpamayo 1.5
| Technique | Origin | How Alpamayo 1.5 Used It | |:----------|:-------|:-------------------------| | Cosmos-Reason2 VLM Backbone | NVIDIA Cosmos platform (2025) | Physical-AI-specialized VLM for scene understanding and CoC v2 reasoning | | Diffusion Transformer (DiT) | Peebles & Xie, "Scalable Diffusion Models with Transformers" (ICCV 2023) | 24-layer DiT as standalone trajectory decoder with cross-attention conditioning | | SigLIP Vision Encoder | Zhai et al. "Sigmoid Loss for Language Image Pre-Training" (ICCV 2023) | ViT-H/14 SigLIP encoder replacing earlier ViT for stronger visual grounding | | Counterfactual Reasoning | Lewis & Vaswani (2021); Robustness literature | CoC v2 traces include counterfactual branches for improved safety auditability | | DDIM Sampling | Song et al. (ICLR 2021) | 10-step DDIM for real-time diffusion trajectory decoding | | Multi-candidate trajectory selection | Diverse trajectory prediction (Trajectron++, 2020) | 8 candidates per inference → safety cost function selection | | Closed-loop RL training | Waymax / nuPlan RL baselines (2023-2024) | PPO on AlpaSim closed-loop rollouts for safety reward | | Egomotion conditioning | DriveDreamer, MUVO (2024) | Past vehicle kinematics as temporal context for smoother trajectory continuity | | Promptable behavior control | InstructPix2Pix (2023); Text-conditioned robotics | Free-text prompts modulate trajectory style without retraining | | Knowledge Distillation | Hinton et al. (2015) | Cosmos-Reason2 distills physical AI knowledge from larger Cosmos World Foundation models | | GQA | Ainslie et al. (EMNLP 2023) | Efficient KV cache for Cosmos-Reason2 backbone during long CoC trace generation | | AlpaSim Evaluation | nuPlan / Waymo Open Dataset evaluation paradigm | Closed-loop benchmark metric (AlpaSim Score 0.81) standardizing AV model comparison |Technical Papers
| Model / Topic | Title | Link | Venue |
|---|---|---|---|
| Alpamayo-R1 | Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving | arXiv:2511.00088 | NeurIPS 2025 |
| Alpamayo 1.5 | Alpamayo 1.5 — Model Card | HuggingFace | Mar 2026 |
| Cosmos World Foundation | NVIDIA Cosmos: World Foundation Models for Physical AI | NVIDIA Research | Jan 2025 |
| UniAD | Planning-oriented Autonomous Driving (UniAD) | arXiv:2212.10156 | CVPR 2023 |
| DriveLM | DriveLM: Driving with Graph Visual Question Answering | arXiv:2312.14150 | CVPR 2024 |
| DriveVLM | DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | arXiv:2402.12289 | 2024 |
| BEVFormer | BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images | arXiv:2203.17270 | ECCV 2022 |
| DDPM | Denoising Diffusion Probabilistic Models | arXiv:2006.11239 | NeurIPS 2020 |
| DDIM | Denoising Diffusion Implicit Models | arXiv:2010.02502 | ICLR 2021 |
| Chain-of-Thought | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | arXiv:2201.11903 | NeurIPS 2022 |
| DeepSeek-R1 | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL | arXiv:2501.12948 | 2025 |
Official Resources
| Resource | Link |
|---|---|
| Alpamayo GitHub R1 (NVlabs) | github.com/NVlabs/alpamayo |
| Alpamayo 1.5 GitHub (NVlabs) | github.com/NVlabs/alpamayo1.5 |
| Alpamayo 1.5 Hugging Face | huggingface.co/nvidia/Alpamayo-1.5-10B |
| AlpaSim Framework | github.com/NVlabs/alpamayo1.5 |
| NVIDIA Cosmos Platform | research.nvidia.com/cosmos |
| NVIDIA DRIVE Platform | developer.nvidia.com/drive |
| Chain of Causation Dataset | github.com/NVlabs/alpamayo/dataset |
| Hugging Face Model Hub | huggingface.co/nvidia/alpamayo |
Cited Techniques
| Technique | Paper | Link |
|---|---|---|
| DiT | Peebles & Xie, “Scalable Diffusion Models with Transformers” (ICCV 2023) | arXiv:2212.09748 |
| SigLIP | Zhai et al., “Sigmoid Loss for Language Image Pre-Training” (ICCV 2023) | arXiv:2303.15343 |
| DDPM | Ho et al., “Denoising Diffusion Probabilistic Models” (NeurIPS 2020) | arXiv:2006.11239 |
| DDIM | Song et al., “Denoising Diffusion Implicit Models” (ICLR 2021) | arXiv:2010.02502 |
| GQA | Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models” (EMNLP 2023) | arXiv:2305.13245 |
| RoPE | Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding” (2021) | arXiv:2104.09864 |
| RMSNorm | Zhang & Sennrich, “Root Mean Square Layer Normalization” (NeurIPS 2019) | arXiv:1910.07467 |
| SwiGLU | Dauphin et al., “Language Modeling with Gated Convolutional Networks” (ICML 2017) | — |
| ViT | Dosovitskiy et al., “An Image is Worth 16x16 Words” (ICLR 2021) | arXiv:2010.11929 |
| PPO | Schulman et al., “Proximal Policy Optimization Algorithms” (2017) | arXiv:1707.06347 |
| Chain-of-Thought | Wei et al., “Chain-of-Thought Prompting” (NeurIPS 2022) | arXiv:2201.11903 |
| BEVFormer | Li et al., “BEVFormer” (ECCV 2022) | arXiv:2203.17270 |
| UniAD | Hu et al., “Planning-oriented Autonomous Driving” (CVPR 2023) | arXiv:2212.10156 |
Built with data from the Alpamayo-R1 technical paper (NeurIPS 2025), Alpamayo 1.5 Hugging Face model card and GitHub (Mar 2026), NVIDIA Cosmos platform announcements, and cited autonomous driving research. All benchmark numbers sourced from the referenced publications. Model weights and dataset are subject to NVIDIA non-commercial research license.