π¦ Llama Model Family β A Comprehensive Technical Reference
From a fully open-weight decoder-only Transformer to a native multimodal Mixture-of-Experts colossus β the complete evolutionary arc of Meta's Llama series.
π Table of Contents
- Executive Summary
- Version Release Timeline
- Cross-Version Benchmark Comparison
- Master Architecture Diagram
- Llama 1 β February 2023
- Llama 2 β July 2023
- Llama 3 β April 2024
- Llama 3.1 β July 2024
- Llama 3.2 β September 2024
- Llama 3.3 β December 2024
- Llama 4 β April 2025
- References
π Executive Summary
The Llama (Large Language Model Meta AI) series is Metaβs flagship family of open-weight language models, spanning from a pure research release in February 2023 to a production-scale multimodal Mixture-of-Experts system in April 2025.
- Llama 1 (Feb 2023): Decoder-only Transformer with RoPE, RMSNorm, and SwiGLU; 4 sizes (7Bβ65B); sparked the open-source LLM revolution.
- Llama 2 (Jul 2023): Expanded context to 4K, added GQA for larger models, introduced RLHF-tuned Chat variants with Ghost Attention; commercial license.
- Llama 3 (Apr 2024): GQA universally applied, 128K vocabulary (tiktoken), 15T training tokens; 8B and 70B flagship sizes.
- Llama 3.1 (Jul 2024): 128K context via RoPE scaling (ΞΈ=500,000), new 405B flagship, multilingual capability, tool use, multi-stage post-training pipeline.
- Llama 3.2 (Sep 2024): Introduced vision models (11B, 90B with cross-attention adapters) and small edge models (1B, 3B via pruning/distillation).
- Llama 3.3 (Dec 2024): Single drop-in 70B improvement via enhanced post-training β better math, reasoning, and coding at lower compute cost.
- Llama 4 (Apr 2025): Native early-fusion multimodal MoE (Scout 17B active / Maverick 17B active / Behemoth 288B active); iRoPE positional encoding; up to 10M context.
π Version Release Timeline
π Cross-Version Benchmark Comparison
| Benchmark | Llama 1 (65B) | Llama 2 (70B) | Llama 3 (70B) | Llama 3.1 (70B) | Llama 3.3 (70B) |
|---|---|---|---|---|---|
| MMLU | 63.4 | 68.9 | 82.0 | 86.0 | 86.0 |
| HumanEval | 23.7 | 29.9 | 81.7 | 80.5 | 88.4 |
| MATH | 6.7 | 16.0 | 50.4 | 65.1 | ~77 |
| GSM8K | 50.9 | 56.8 | 93.0 | 95.1 | ~95 |
| Context Window | 2K | 4K | 8K | 128K | 128K |
| Training Tokens | 1.4T | 2T | 15T | 15T | 15T |
ποΈ Master Architecture Diagram
Applied before sub-layers
Llama 2β3.3: GQA (Q/KV split)
Llama 4: GQA + iRoPE interleaving
Llama 4: SwiGLU MoE (sparse routing)
π€ Llama 1 β February 2023
Summary
- Architecture: Pure decoder-only Transformer; no encoder cross-attention.
- Positional Encoding: Rotary Position Embeddings (RoPE) with ΞΈ=10,000 applied to every attention layer.
- Normalization: Pre-norm RMSNorm (replaces LayerNorm entirely); eliminates mean centering for efficiency.
- Activation: SwiGLU feed-forward with three weight matrices (W1, W2, W3); intermediate dim β 2/3 Γ 4d.
- Attention: Multi-Head Attention (MHA) β no grouped-query; full Q/K/V projections per head.
- Vocabulary: 32,000 tokens via SentencePiece BPE (byte-level fallback for unknown characters).
- Context Window: 2,048 tokens (causal attention mask; no sliding window).
- Training Data: Publicly available corpora β CommonCrawl, C4, GitHub, Wikipedia, Books, ArXiv, StackExchange; totaling ~1T (7B/13B) and 1.4T (33B/65B) tokens.
- Optimizer: AdamW, Ξ²β=0.9, Ξ²β=0.95, Ξ΅=10β»β΅; cosine LR schedule; weight decay 0.1; gradient clipping 1.0.
- Training Infrastructure: 2,048 A100 80GB GPUs (65B model); efficient memory via FlashAttention v1 and activation checkpointing.
- Open Release: Weights released under a non-commercial research license; spawned hundreds of community fine-tunes (Alpaca, Vicuna, WizardLM, etc.).
Architecture Diagram
Community Perspective
Llama 1 fundamentally democratised large language model research. Within weeks of its release (initially leaked, then officially distributed), the community produced Alpaca (Stanford, Stanford CRFM fine-tune using GPT-4-generated instruction data), Vicuna (LMSYS, ShareGPT conversations), WizardLM, and dozens of quantised variants via llama.cpp and GGUF format. The model demonstrated that publicly available data alone could match or surpass contemporaneous commercial APIs at a fraction of the compute cost, validating scaling law predictions and opening a new era of community-driven alignment research.
Model Variants
| Model | Parameters | Layers | Heads (Q) | Hidden Dim | Context | Training Tokens |
|---|---|---|---|---|---|---|
| Llama 1 7B | 6.7B | 32 | 32 | 4,096 | 2,048 | 1T |
| Llama 1 13B | 13.0B | 40 | 40 | 5,120 | 2,048 | 1T |
| Llama 1 33B | 32.5B | 60 | 52 | 6,656 | 2,048 | 1.4T |
| Llama 1 65B | 65.2B | 80 | 64 | 8,192 | 2,048 | 1.4T |
Key Industry Ideas Incorporated
| Technique | Origin | How Llama 1 Used It | |:----------|:-------|:--------------------| | Rotary Position Embeddings (RoPE) | Su et al., 2021 | Applied to Q and K projections at every layer; ΞΈ=10,000 | | RMSNorm | Zhang & Sennrich, 2019 | Replaces LayerNorm pre-sub-layer; removes mean centering | | SwiGLU Activation | Shazeer, 2020 | FFN uses gated linear unit with SiLU; three weight matrices | | Pre-norm Transformer | Xiong et al., 2020 | Norm applied before attention and FFN (vs. post-norm) | | FlashAttention | Dao et al., 2022 | Tiling-based attention for memory-efficient training on A100s | | BPE Tokeniser | Sennrich et al., 2016 | 32K token vocab via SentencePiece byte-level BPE | | Causal LM Objective | GPT series | Standard next-token prediction, no masked LM |π Llama 2 β July 2023
Summary
- Context Expansion: Window doubled to 4,096 tokens; trained with a longer document mix.
- Grouped-Query Attention (GQA): Introduced for 34B and 70B models only; 7B and 13B retain full MHA.
- GQA Configuration (70B): 40 query heads, 8 key-value heads (5:1 ratio); dramatically reduces KV-cache memory.
- GQA Configuration (34B): 64 query heads, 8 key-value heads (8:1 ratio).
- Training Data: ~2 trillion tokens; updated pretraining mix with additional helpfulness-oriented data.
- Chat Variants: First official instruction-tuned release; SFT on >27,500 human-annotated examples.
- RLHF Pipeline: Two separate reward models trained β one for helpfulness, one for safety; Proximal Policy Optimisation (PPO) applied iteratively.
- Ghost Attention (GAtt): Synthetic technique to preserve system-prompt adherence across multi-turn dialogue; system instruction replicated at each user turn during training.
- Safety Measures: Red-teaming, safety reward model, context distillation, human evaluations; βresponsible use guideβ released alongside weights.
- Commercial License: Llama 2 Community License allows commercial use for organisations with fewer than 700M monthly active users (separate license for larger entities).
- Code Model: Code Llama released separately (Aug 2023) on top of Llama 2 base; 7B/13B/34B with 100K context infill.
Architecture Diagram
7B/13B: MHA (unchanged)
Community Perspective
Llama 2βs commercial license transformed the ecosystem: enterprises could now legally deploy and fine-tune these weights in production. The 70B Chat model quickly became the reference open-weight instruction-tuned LLM, benchmarked extensively against GPT-3.5. The Ghost Attention mechanism addressed a key weakness in multi-turn instruction following that plagued early RLHF systems. Code Llama (built on Llama 2) became one of the most widely adopted open code generation models, directly influencing Copilot-alternative tooling.
Model Variants
| Model | Parameters | Layers | Heads (Q/KV) | Hidden Dim | Context | Notes |
|---|---|---|---|---|---|---|
| Llama 2 7B | 6.7B | 32 | 32/32 (MHA) | 4,096 | 4,096 | Base + Chat |
| Llama 2 13B | 13.0B | 40 | 40/40 (MHA) | 5,120 | 4,096 | Base + Chat |
| Llama 2 34B | 34B | 48 | 64/8 (GQA) | 8,192 | 4,096 | Base only (no Chat) |
| Llama 2 70B | 68.9B | 80 | 64/8 (GQA) | 8,192 | 4,096 | Base + Chat |
Key Industry Ideas Incorporated
| Technique | Origin | How Llama 2 Used It | |:----------|:-------|:--------------------| | Grouped-Query Attention (GQA) | Ainslie et al., 2023 | Applied to 34B and 70B; reduces KV-cache ~5β8Γ | | RLHF with PPO | Stiennon et al., 2020; InstructGPT | Two reward models (helpfulness + safety) trained on 1.4M comparisons | | Ghost Attention (GAtt) | Meta internal | Synthetic multi-turn training to preserve system-prompt over long conversations | | Rejection Sampling Fine-tuning | Meta internal | Used between SFT and PPO; sample K responses, keep highest reward | | Safety Red-teaming | Anthropic, OpenAI tradition | Dedicated red team; safety-specific reward model; context distillation | | Context Distillation | Askell et al., 2021 | Distilling safety behavior from system-prompted model to base model |π‘ Llama 3 β April 2024
Summary
- Architecture: Same decoder-only Transformer skeleton; GQA now applied universally across all model sizes.
- Vocabulary Jump: 128,256 tokens using tiktoken BPE (previously 32K SentencePiece); much better multilingual and code tokenisation.
- Model Sizes: Two flagship sizes β 8B and 70B; 400B+ in pre-release preview.
- Context Window: 8,192 tokens (2Γ Llama 2); RoPE ΞΈ unchanged at 500,000 in the 3.1 release but at 8K here.
- Training Scale: 15T tokens (7.5Γ Llama 2); quality-filtered CommonCrawl + code + multilingual sources; 95% English.
- 8B Config: 32 layers, 32 Q heads, 8 KV heads (4:1 GQA), hidden dim 4096.
- 70B Config: 80 layers, 64 Q heads, 8 KV heads (8:1 GQA), hidden dim 8192.
- Post-Training: SFT (high-quality human instructions) β Rejection Sampling β PPO β DPO; four distinct stages.
- Safety Tooling: Llama Guard 2 and Code Shield released alongside; Meta Prompt Guard for injection detection.
- FlashAttention 2: Used throughout training; significantly improves memory efficiency vs v1 used in Llama 1.
- Instruction-tuned Variants: Llama 3 Instruct models show substantial jump on HumanEval (81.7 for 70B-Instruct) and MMLU (82.0) over Llama 2 70B.
Architecture Diagram
FlashAttention 2
Community Perspective
Llama 3 represented a watershed moment in open-weight capabilities. The 70B Instruct model surpassed GPT-3.5 on several established benchmarks and competed seriously with early GPT-4 variants on coding tasks. The tiktoken vocabulary change, while breaking compatibility with prior Llama tokenisers, dramatically improved multilingual efficiency. The Llama Guard safety tooling suite became a widely referenced framework for responsible deployment of open models, with organisations like Hugging Face and AI safety researchers publishing extensive evaluations.
Model Variants
| Model | Parameters | Layers | Heads (Q/KV) | Hidden Dim | FFN Dim | Context |
|---|---|---|---|---|---|---|
| Llama 3 8B | 8.0B | 32 | 32/8 | 4,096 | 14,336 | 8,192 |
| Llama 3 8B Instruct | 8.0B | 32 | 32/8 | 4,096 | 14,336 | 8,192 |
| Llama 3 70B | 70.6B | 80 | 64/8 | 8,192 | 28,672 | 8,192 |
| Llama 3 70B Instruct | 70.6B | 80 | 64/8 | 8,192 | 28,672 | 8,192 |
Key Industry Ideas Incorporated
| Technique | Origin | How Llama 3 Used It | |:----------|:-------|:--------------------| | tiktoken BPE | OpenAI (GPT-3/4 tokeniser) | 128,256 vocab; byte-level fallback; improved multilingual/code efficiency | | Universal GQA | Ainslie et al., 2023 | Applied to all model sizes (previously only 34B/70B in Llama 2) | | FlashAttention 2 | Dao et al., 2023 | 2Γ speedup vs FA1; used throughout training | | DPO (Direct Preference Optimisation) | Rafailov et al., 2023 | Added as final post-training stage after SFT + RS + PPO | | Rejection Sampling Fine-tuning | Meta Llama 2 paper | Multi-stage: SFT β RS β PPO β DPO | | Llama Guard | Meta, 2023 | Input/output safety classification model; open-sourced alongside Llama 3 | | Scaling Laws | Hoffmann et al., 2022 (Chinchilla) | 15T tokens selected to over-train per Chinchilla-optimal guidance |πΆ Llama 3.1 β July 2024
Summary
- Context Explosion: Long-context variant extends window to 128,192 tokens via RoPE ΞΈ scaling to 500,000; trained with sequences up to 128K.
- New Flagship β 405B: 126 layers, 128 Q heads, 8 KV heads, hidden dim 16,384; largest open-weight dense model at time of release.
- Multilingual: Official support for 8 languages β English, German, French, Italian, Portuguese, Hindi, Spanish, Thai.
- Tool Use & Function Calling: First Llama to support structured JSON tool calls natively in the instruct model; enables agentic pipelines.
- Multi-Stage Post-Training: Supervised Fine-Tuning β Rejection Sampling (per-language) β DPO β PPO; each stage builds on the last.
- Synthetic Data at Scale: Meta generated large volumes of synthetic instruction and reasoning data (similar to Nemotron-4 approach); key to 405B performance.
- Distillation from 405B: The 8B and 70B Instruct models in 3.1 were explicitly improved via knowledge distillation from the 405B teacher.
- VRAM Requirements: 405B in BF16 requires ~8 Γ H100 80GB GPUs; 70B in BF16 requires ~2 Γ H100.
- International Safety Standard: First open model released with a model card addressing EU AI Act risk categories.
- Benchmark: 405B scores 88.6 on MMLU, 61.6 on GPQA Diamond, competing with GPT-4o at time of release.
Architecture Diagram
Community Perspective
Llama 3.1 405B became a reference point for the entire open-source LLM ecosystem β arguably the first model to make GPT-4-level performance accessible without API dependence for well-resourced organisations. The 128K context window unlocked document-level reasoning tasks that were previously exclusive to commercial APIs. The officially sanctioned distillation policy (using 405B outputs to improve smaller Llama variants) enabled a new class of small-but-capable fine-tunes; community projects like Hermes 3, Nous-Capybara, and dozens of GGUF quantisations proliferated rapidly.
Model Variants
| Model | Parameters | Layers | Heads (Q/KV) | Hidden Dim | FFN Dim | Context |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | 8.0B | 32 | 32/8 | 4,096 | 14,336 | 131,072 |
| Llama 3.1 8B Instruct | 8.0B | 32 | 32/8 | 4,096 | 14,336 | 131,072 |
| Llama 3.1 70B | 70.6B | 80 | 64/8 | 8,192 | 28,672 | 131,072 |
| Llama 3.1 70B Instruct | 70.6B | 80 | 64/8 | 8,192 | 28,672 | 131,072 |
| Llama 3.1 405B | 405B | 126 | 128/8 | 16,384 | 53,248 | 131,072 |
| Llama 3.1 405B Instruct | 405B | 126 | 128/8 | 16,384 | 53,248 | 131,072 |
Key Industry Ideas Incorporated
| Technique | Origin | How Llama 3.1 Used It | |:----------|:-------|:----------------------| | RoPE ΞΈ Scaling (ΞΈ=500K) | Chen et al., 2023; Su et al., 2021 | Extends effective context from 8K to 128K via frequency interpolation | | Knowledge Distillation | Hinton et al., 2015 | 405B used as teacher; 8B and 70B Instruct improved via distillation | | Multi-Stage Post-Training | Meta internal | SFT β RS β DPO β PPO; each stage refines the previous | | Synthetic Instruction Data | Gunasekar et al., 2023 (phi-1) | Large-scale synthetic generations for multilingual and tool-use alignment | | Function/Tool Calling | OpenAI API design | Structured JSON schemas; models fine-tuned to follow tool schemas | | Near-Optimal Scaling | Hoffmann et al., 2022 | 15T tokens for 405B; compute-optimal relative to model size |π· Llama 3.2 β September 2024
Summary
- Two Product Lines: Vision models (11B, 90B) and edge/mobile text models (1B, 3B).
- Vision Architecture: Cross-attention adapter (Flamingo-style) attaching a frozen ViT image encoder to Llama 3.1 text backbone; cross-attention inserted at every fourth transformer layer.
- Image Encoder: ViT-H/14 backbone, image resolution 560 Γ 560 px; tiled to handle higher-resolution inputs.
- VisionβLanguage Alignment: Linear projection of ViT patch embeddings followed by cross-attention; text decoder treats image features as external memory.
- 11B Vision Model: Llama 3.1 8B text backbone + vision adapter layers; strong single-image understanding, OCR, charts.
- 90B Vision Model: Llama 3.1 70B text backbone + vision adapter; multi-image reasoning, document analysis, visual QA.
- 1B Edge Model: Obtained via structured pruning of Llama 3.1 8B followed by knowledge distillation; targets mobile NPUs and CPUs.
- 3B Edge Model: Pruned/distilled from Llama 3.1 70B with 8B as intermediate teacher; on-device performance near 7B-class.
- Quantisation: 1B and 3B released in INT4 SpinQuant and QAT quantised variants for mobile deployment (Apple silicon, Arm, Qualcomm).
- On-Device Context: Edge models support 128K context with quantised KV cache; designed for real-time streaming inference.
- Meta AI Connect: Announced at Meta Connect 2024 with live demo of on-device inference on Ray-Ban Meta glasses.
Architecture Diagram
Community Perspective
Llama 3.2 marked Metaβs entry into the on-device AI race. The 1B and 3B models demonstrated that a well-distilled small model could punch significantly above its weight class, reviving interest in model compression research. The vision models provided the first truly open-weight multimodal alternative to GPT-4V, with the 90B competing on several VQA and document benchmarks. The Flamingo-style cross-attention adapter architecture became a popular blueprint for community vision extensions to other text-only models.
Model Variants
| Model | Parameters | Architecture | Input | Context | Notes |
|---|---|---|---|---|---|
| Llama 3.2 1B | 1.24B | Pruned decoder | Text | 128K | Mobile/edge; INT4 quantised variants |
| Llama 3.2 3B | 3.21B | Pruned decoder | Text | 128K | On-device; QAT quantised variants |
| Llama 3.2 11B | 11B | Llama 3.1 8B + Vision Adapter | Text + Image | 128K | Cross-attention, ViT-H/14 |
| Llama 3.2 11B Vision Instruct | 11B | Llama 3.1 8B + Vision Adapter | Text + Image | 128K | Instruction-tuned; OCR, VQA |
| Llama 3.2 90B | 90B | Llama 3.1 70B + Vision Adapter | Text + Image | 128K | Multi-image, document analysis |
| Llama 3.2 90B Vision Instruct | 90B | Llama 3.1 70B + Vision Adapter | Text + Image | 128K | Instruction-tuned; chart/figure QA |
Key Industry Ideas Incorporated
| Technique | Origin | How Llama 3.2 Used It | |:----------|:-------|:----------------------| | Cross-Attention Vision Adapter | Flamingo (Alayrac et al., 2022) | Frozen ViT cross-attended at every 4th layer; text backbone preserved | | ViT Image Encoder | Dosovitskiy et al., 2020 | ViT-H/14; 560Γ560 input with tiling for high-res | | Structured Pruning | Michel et al., 2019; various | Layer/head/width pruning applied to Llama 3.1 8B β 1B and 70B β 3B | | Knowledge Distillation | Hinton et al., 2015 | Teacherβstudent training with Llama 3.1 as teacher for edge models | | SpinQuant / QAT | Meta internal, 2024 | INT4 quantisation-aware training for mobile deployment | | LLM-in-a-glasses form factor | Meta Connect 2024 | Demonstrated on Ray-Ban Meta smart glasses for real-time inference |π« Llama 3.3 β December 2024
Summary
- Single Model Release: Only one model size β 70B; same architectural specification as Llama 3.1 70B.
- Enhanced Post-Training: Entirely revised SFT dataset, new DPO preference pairs, and extended PPO reward modelling; no architectural changes.
- Math & Reasoning Gains: MATH benchmark improvement to ~77 (vs Llama 3.1 70B ~65); structured chain-of-thought data incorporated.
- Coding Improvements: HumanEval rises to 88.4 (vs 80.5 for 3.1 70B Instruct); new code-specific SFT and DPO data.
- Instruction Following: Significantly improved IFEval scores; better adherence to format constraints (JSON, markdown, lists).
- Multilingual Retention: All 8 languages from 3.1 maintained; additional data for Hindi and Thai in post-training.
- CostβPerformance Target: Designed to offer Llama 3.1 405B-class performance at 70B inference cost; Meta claims parity on many benchmarks.
- Drop-in Replacement: Compatible with Llama 3.1 70B serving infrastructure; same tokeniser, same context length, same tool-call schema.
- Safety Updates: Updated Llama Guard 3 and Prompt Guard released simultaneously; improved refusal on newer jailbreak patterns.
- Research Insight: Demonstrates that high-quality post-training data can yield substantial gains without any pretraining compute.
Architecture Diagram
(Math + Code + IF)
(per capability)
+ PPO
Community Perspective
Llama 3.3 illustrated a key principle increasingly understood in the field: post-training data quality can deliver disproportionate gains relative to compute. By releasing a single improved 70B model, Meta offered organisations a free upgrade path from 3.1 70B with no infrastructure changes. The model quickly became the default recommendation for cost-sensitive deployments, with benchmarks showing it frequently matching or exceeding Llama 3.1 405B on reasoning and coding tasks at a fraction of the serving cost. It also sparked renewed interest in post-training research as an alternative to scaling pretraining compute.
Model Variants
| Model | Parameters | Architecture | Context | Key Improvements |
|---|---|---|---|---|
| Llama 3.3 70B | 70.6B | Identical to Llama 3.1 70B base | 128K | Better math/reasoning datasets |
| Llama 3.3 70B Instruct | 70.6B | Llama 3.1 70B base + new post-training | 128K | HumanEval 88.4, MATH ~77, IFEval gains |
Key Industry Ideas Incorporated
| Technique | Origin | How Llama 3.3 Used It | |:----------|:-------|:----------------------| | High-Quality SFT Curation | LIMA (Zhou et al., 2023) | "Less is more" principle applied; curated math/code/IF examples over volume | | Process Reward Models (PRMs) | Lightman et al., 2023 | Step-level math reward signals incorporated in rejection sampling | | DPO with Fresh Preference Data | Rafailov et al., 2023 | New human preference labels targeting 3.1 70B failure modes | | Chain-of-Thought Distillation | Wei et al., 2022 | CoT traces from 405B used to supervise 70B on reasoning | | IFEval-targeted Training | Zhou et al., 2023 | Explicit instruction-following format tasks in SFT and DPO | | Capability-specific Rejection Sampling | Meta internal | Separate RS pools for math, code, multilingual, and safety |π΄ Llama 4 β April 2025
Summary
- Architecture Paradigm Shift: Native early-fusion multimodal Mixture-of-Experts (MoE); abandons the cross-attention adapter approach of Llama 3.2.
- Early Fusion: Images and text tokens processed together from the first layer; no separate vision encoder; unified tokenisation of visual and language tokens.
- MoE FFN: Sparse expert routing replaces the dense SwiGLU FFN; only a subset of experts activated per token, dramatically reducing active parameters vs total parameters.
- iRoPE Positional Encoding: βinterleaved RoPEβ β alternating attention layers have no positional encoding (NoPE / infinite context) and layers with standard RoPE; enables extrapolation to very long sequences.
- Scout (109B total, 17B active): 16 experts; 10M token context window; single H100 deployment target; vision + text.
- Maverick (400B total, 17B active): 128 experts per token routing; 1M context; full multimodal; strong benchmark performance; requires ~8Γ H100.
- Behemoth (~2T total, 288B active): 16 experts; training/teacher model; used for distillation; 16K+ context; not released in April 2025.
- MetaP Hyperparameter Tuning: New technique for systematic hyperparameter transfer across model scales; stabilises training of very large MoE models.
- Training Scale: 30T+ tokens on multimodal data; largest pretraining budget in the Llama series.
- Omni Capability: Scout and Maverick handle image, video, and text inputs; text-only output at release; voice planned.
- Benchmark: Maverick achieves 80.5 on MMLU-Pro and competitive results on GPQA Diamond; Scoutβs 10M context enables retrieval over entire codebases.
Architecture Diagram
Scout: 16Q/8KV Β· Maverick: larger config
Maverick: 128 experts (top-k routing)
Behemoth: 16 experts, 288B active
Community Perspective
Llama 4 represents the most architecturally ambitious Llama release to date. The shift from cross-attention vision adapters to native early-fusion MoE aligns Metaβs approach with GPT-4o and Gemini Ultra in treating multimodality as a first-class concern rather than an add-on. Scoutβs 10M-token context window is unprecedented in an open-weight model and opens entirely new application categories β whole-repository code analysis, book-length summarisation, and long-horizon agentic tasks. The iRoPE position encoding scheme attracted significant academic interest as a practical solution to the context-length extrapolation problem. The MetaP hyperparameter transfer technique addresses one of the key engineering pain points in training very large MoE models reliably.
Model Variants
| Model | Total Params | Active Params | Experts | Context | Modality | Notes |
|---|---|---|---|---|---|---|
| Llama 4 Scout | 109B | 17B | 16 | 10M | Text + Image | Single H100 80GB deployment |
| Llama 4 Scout Instruct | 109B | 17B | 16 | 10M | Text + Image | Instruction-tuned |
| Llama 4 Maverick | 400B | 17B | 128 | 1M | Text + Image | ~8Γ H100 required |
| Llama 4 Maverick Instruct | 400B | 17B | 128 | 1M | Text + Image | Instruction-tuned; MMLU-Pro 80.5 |
| Llama 4 Behemoth | ~2T | 288B | 16 | 16K+ | Text | Teacher model; not released Apr 2025 |
Key Industry Ideas Incorporated
| Technique | Origin | How Llama 4 Used It | |:----------|:-------|:--------------------| | Sparse MoE (top-k routing) | Shazeer et al., 2017; Switch Transformer | Each token routes to top-k of N experts in FFN; ~17B active per token | | Early Fusion Multimodal | Gemini (Google, 2023); Chameleon (Meta, 2024) | Visual and text tokens processed jointly from layer 0; no cross-attention adapter | | iRoPE (interleaved NoPE + RoPE) | Meta internal; inspired by YaRN, LongRoPE | Alternating position-free and RoPE layers enable 10M+ context extrapolation | | MetaP Hyperparameter Transfer | Meta internal, 2025 | ΞΌP-inspired framework for transferring LR/batch-size across MoE scales | | Load Balancing Loss (MoE) | Lepikhin et al., 2020; Fedus et al., 2021 | Auxiliary loss to prevent expert collapse during training | | 30T+ Token Pretraining | Meta internal | Multimodal web-scale corpus; largest single pretraining run in the series | | Distillation from Behemoth | Hinton et al., 2015 | Scout and Maverick post-training improved via Behemoth teacher |π References
Technical Papers
Official Blog Posts
| Post | Date | Topic |
|---|---|---|
| Introducing LLaMA: A foundational, 65-billion-parameter language model | Feb 2023 | Llama 1 announcement |
| Llama 2: Open Foundation and Fine-Tuned Chat Models | Jul 2023 | Llama 2 release |
| Meta Llama 3 | Apr 2024 | Llama 3 announcement |
| Llama 3.1: Our most capable models to date | Jul 2024 | Llama 3.1 / 405B |
| Llama 3.2: Revolutionizing edge AI and vision | Sep 2024 | Vision + edge models |
| Llama 3.3: New 70B model with improved performance | Dec 2024 | Llama 3.3 70B |
| Llama 4: The next generation of open foundation models | Apr 2025 | Llama 4 Scout/Maverick |
GitHub Repositories
| Repository | Description |
|---|---|
| meta-llama/llama | Official Llama 1 weights and inference code |
| meta-llama/llama2 | Official Llama 2 repository |
| meta-llama/llama3 | Official Llama 3 tokeniser and model card |
| meta-llama/llama-models | Unified Llama 3.x model cards and configs |
| ggerganov/llama.cpp | C++ inference engine; GGUF quantised formats |
| huggingface/transformers | HF integration for all Llama variants |
| vllm-project/vllm | High-throughput inference with continuous batching |
| meta-llama/PurpleLlama | Llama Guard, Code Shield, Prompt Guard safety tools |
Cited Techniques
| Technique | Reference | Used In |
|---|---|---|
| Rotary Position Embeddings (RoPE) | Su et al., arXiv:2104.09864 | Llama 1β4 |
| RMSNorm | Zhang & Sennrich, NeurIPS 2019 | Llama 1β4 |
| SwiGLU | Shazeer, arXiv:2002.05202 | Llama 1β4 |
| Grouped-Query Attention | Ainslie et al., arXiv:2305.13245 | Llama 2 (34B/70B), 3β4 |
| FlashAttention 2 | Dao, arXiv:2307.08691 | Llama 3β4 |
| tiktoken BPE | OpenAI, 2022 | Llama 3β4 |
| RLHF / PPO | Stiennon et al.; Ouyang et al. | Llama 2β3.3 |
| DPO | Rafailov et al., arXiv:2305.18290 | Llama 3β3.3 |
| Knowledge Distillation | Hinton et al., arXiv:1503.02531 | Llama 3.1, 3.2, 4 |
| Ghost Attention | Touvron et al., arXiv:2307.09288 | Llama 2 Chat |
| Sparse MoE | Shazeer et al., arXiv:1701.06538 | Llama 4 |
| Early-Fusion Multimodal | Chameleon (Meta); Gemini | Llama 4 |
| iRoPE | Meta internal, 2025 | Llama 4 |
| SpinQuant / QAT | Meta internal, 2024 | Llama 3.2 edge |
| Structured Pruning | Michel et al., 2019 | Llama 3.2 1B/3B |
This document covers the Llama model family from Llama 1 (Feb 2023) through Llama 4 (Apr 2025).
All benchmark figures are reported as published; instruct vs base distinctions apply where noted.
Maintained for educational and research reference purposes.