πŸ¦™ Llama Model Family β€” A Comprehensive Technical Reference

Versions Covered Meta AI Last Updated

From a fully open-weight decoder-only Transformer to a native multimodal Mixture-of-Experts colossus β€” the complete evolutionary arc of Meta's Llama series.


πŸ“‘ Table of Contents


πŸ“‹ Executive Summary

The Llama (Large Language Model Meta AI) series is Meta’s flagship family of open-weight language models, spanning from a pure research release in February 2023 to a production-scale multimodal Mixture-of-Experts system in April 2025.

  • Llama 1 (Feb 2023): Decoder-only Transformer with RoPE, RMSNorm, and SwiGLU; 4 sizes (7B–65B); sparked the open-source LLM revolution.
  • Llama 2 (Jul 2023): Expanded context to 4K, added GQA for larger models, introduced RLHF-tuned Chat variants with Ghost Attention; commercial license.
  • Llama 3 (Apr 2024): GQA universally applied, 128K vocabulary (tiktoken), 15T training tokens; 8B and 70B flagship sizes.
  • Llama 3.1 (Jul 2024): 128K context via RoPE scaling (ΞΈ=500,000), new 405B flagship, multilingual capability, tool use, multi-stage post-training pipeline.
  • Llama 3.2 (Sep 2024): Introduced vision models (11B, 90B with cross-attention adapters) and small edge models (1B, 3B via pruning/distillation).
  • Llama 3.3 (Dec 2024): Single drop-in 70B improvement via enhanced post-training β€” better math, reasoning, and coding at lower compute cost.
  • Llama 4 (Apr 2025): Native early-fusion multimodal MoE (Scout 17B active / Maverick 17B active / Behemoth 288B active); iRoPE positional encoding; up to 10M context.

πŸ“… Version Release Timeline

| Version | Release | Key Milestone | Parameters | Context | |:-------:|:-------:|:-------------|:----------:|:-------:| | Llama 1 | Feb 2023 | First fully open-weight competitive LLM | 7B – 65B | 2K | | Llama 2 | Jul 2023 | Commercial license, RLHF Chat, GQA (large models) | 7B – 70B | 4K | | Llama 3 | Apr 2024 | 128K vocab, GQA universal, 15T tokens | 8B, 70B | 8K | | Llama 3.1 | Jul 2024 | 128K context, 405B flagship, tool use | 8B – 405B | 128K | | Llama 3.2 | Sep 2024 | Vision models + edge models (1B/3B) | 1B – 90B | 128K | | Llama 3.3 | Dec 2024 | Improved post-training; 70B only | 70B | 128K | | Llama 4 | Apr 2025 | Native multimodal MoE, iRoPE, 10M ctx | 109B–~2T total | 10M |

πŸ“Š Cross-Version Benchmark Comparison

Benchmark Llama 1 (65B) Llama 2 (70B) Llama 3 (70B) Llama 3.1 (70B) Llama 3.3 (70B)
MMLU 63.4 68.9 82.0 86.0 86.0
HumanEval 23.7 29.9 81.7 80.5 88.4
MATH 6.7 16.0 50.4 65.1 ~77
GSM8K 50.9 56.8 93.0 95.1 ~95
Context Window 2K 4K 8K 128K 128K
Training Tokens 1.4T 2T 15T 15T 15T

πŸ—οΈ Master Architecture Diagram

Llama Series β€” Architectural Evolution Overview
Feb 2023 (Llama 1) β†’ Apr 2025 (Llama 4)
Input Tokens
↓
Token Embedding + Positional Encoding
Llama 1–3.3: RoPE (ΞΈ=10K β†’ 500K)  |  Llama 4: iRoPE (interleaved)
↓
Γ— N Decoder Layers
Pre-Norm (RMSNorm)
LayerNorm replaced by RMSNorm
Applied before sub-layers
Self-Attention
Llama 1: MHA (32/40/52/64 heads)
Llama 2–3.3: GQA (Q/KV split)
Llama 4: GQA + iRoPE interleaving
Feed-Forward Network
Llama 1–3.3: Dense SwiGLU (W1, W2, W3)
Llama 4: SwiGLU MoE (sparse routing)
↓
Final RMSNorm
↓
Language Model Head (Linear + Softmax)
↓
Next-Token Probabilities
πŸ¦™ Shared Core: RoPE + RMSNorm + SwiGLU (all versions) πŸ”€ GQA from Llama 2 (large) / Llama 3 (all) ⚑ MoE + iRoPE: Llama 4 only

🟀 Llama 1 β€” February 2023

πŸ“… Feb 2023 | arXiv:2302.13971 | Meta AI / FAIR

Summary

  • Architecture: Pure decoder-only Transformer; no encoder cross-attention.
  • Positional Encoding: Rotary Position Embeddings (RoPE) with ΞΈ=10,000 applied to every attention layer.
  • Normalization: Pre-norm RMSNorm (replaces LayerNorm entirely); eliminates mean centering for efficiency.
  • Activation: SwiGLU feed-forward with three weight matrices (W1, W2, W3); intermediate dim β‰ˆ 2/3 Γ— 4d.
  • Attention: Multi-Head Attention (MHA) β€” no grouped-query; full Q/K/V projections per head.
  • Vocabulary: 32,000 tokens via SentencePiece BPE (byte-level fallback for unknown characters).
  • Context Window: 2,048 tokens (causal attention mask; no sliding window).
  • Training Data: Publicly available corpora β€” CommonCrawl, C4, GitHub, Wikipedia, Books, ArXiv, StackExchange; totaling ~1T (7B/13B) and 1.4T (33B/65B) tokens.
  • Optimizer: AdamW, β₁=0.9, Ξ²β‚‚=0.95, Ξ΅=10⁻⁡; cosine LR schedule; weight decay 0.1; gradient clipping 1.0.
  • Training Infrastructure: 2,048 A100 80GB GPUs (65B model); efficient memory via FlashAttention v1 and activation checkpointing.
  • Open Release: Weights released under a non-commercial research license; spawned hundreds of community fine-tunes (Alpaca, Vicuna, WizardLM, etc.).

Architecture Diagram

Llama 1 Architecture (Feb 2023)
7B / 13B / 33B / 65B β€” Decoder-only Transformer with RoPE + RMSNorm + SwiGLU
Input Tokens (vocab: 32K)
↓
Token Embedding + RoPE (ΞΈ=10,000)
↓
Repeat Γ— N Layers (32/40/60/80)
RMSNorm (pre)
Pre-norm, no bias
Multi-Head Attention (MHA)
Causal Β· All Q/K/V heads equal Β· RoPE on Q,K
↓ residual add ↓
RMSNorm (pre)
Pre-norm, no bias
SwiGLU FFN
W1(SiLU(x))βŠ™W3(x) β†’ W2 Β· intermediate β‰ˆ β…”Γ—4d
↓
Final RMSNorm
↓
LM Head β†’ Logits (32K vocab)

Community Perspective

Llama 1 fundamentally democratised large language model research. Within weeks of its release (initially leaked, then officially distributed), the community produced Alpaca (Stanford, Stanford CRFM fine-tune using GPT-4-generated instruction data), Vicuna (LMSYS, ShareGPT conversations), WizardLM, and dozens of quantised variants via llama.cpp and GGUF format. The model demonstrated that publicly available data alone could match or surpass contemporaneous commercial APIs at a fraction of the compute cost, validating scaling law predictions and opening a new era of community-driven alignment research.

Model Variants

Model Parameters Layers Heads (Q) Hidden Dim Context Training Tokens
Llama 1 7B 6.7B 32 32 4,096 2,048 1T
Llama 1 13B 13.0B 40 40 5,120 2,048 1T
Llama 1 33B 32.5B 60 52 6,656 2,048 1.4T
Llama 1 65B 65.2B 80 64 8,192 2,048 1.4T
Key Industry Ideas Incorporated | Technique | Origin | How Llama 1 Used It | |:----------|:-------|:--------------------| | Rotary Position Embeddings (RoPE) | Su et al., 2021 | Applied to Q and K projections at every layer; ΞΈ=10,000 | | RMSNorm | Zhang & Sennrich, 2019 | Replaces LayerNorm pre-sub-layer; removes mean centering | | SwiGLU Activation | Shazeer, 2020 | FFN uses gated linear unit with SiLU; three weight matrices | | Pre-norm Transformer | Xiong et al., 2020 | Norm applied before attention and FFN (vs. post-norm) | | FlashAttention | Dao et al., 2022 | Tiling-based attention for memory-efficient training on A100s | | BPE Tokeniser | Sennrich et al., 2016 | 32K token vocab via SentencePiece byte-level BPE | | Causal LM Objective | GPT series | Standard next-token prediction, no masked LM |

🟠 Llama 2 β€” July 2023

πŸ“… Jul 2023 | arXiv:2307.09288 | Meta AI

Summary

  • Context Expansion: Window doubled to 4,096 tokens; trained with a longer document mix.
  • Grouped-Query Attention (GQA): Introduced for 34B and 70B models only; 7B and 13B retain full MHA.
  • GQA Configuration (70B): 40 query heads, 8 key-value heads (5:1 ratio); dramatically reduces KV-cache memory.
  • GQA Configuration (34B): 64 query heads, 8 key-value heads (8:1 ratio).
  • Training Data: ~2 trillion tokens; updated pretraining mix with additional helpfulness-oriented data.
  • Chat Variants: First official instruction-tuned release; SFT on >27,500 human-annotated examples.
  • RLHF Pipeline: Two separate reward models trained β€” one for helpfulness, one for safety; Proximal Policy Optimisation (PPO) applied iteratively.
  • Ghost Attention (GAtt): Synthetic technique to preserve system-prompt adherence across multi-turn dialogue; system instruction replicated at each user turn during training.
  • Safety Measures: Red-teaming, safety reward model, context distillation, human evaluations; β€œresponsible use guide” released alongside weights.
  • Commercial License: Llama 2 Community License allows commercial use for organisations with fewer than 700M monthly active users (separate license for larger entities).
  • Code Model: Code Llama released separately (Aug 2023) on top of Llama 2 base; 7B/13B/34B with 100K context infill.

Architecture Diagram

Llama 2 Architecture (Jul 2023)
7B / 13B (MHA) Β· 34B / 70B (GQA) β€” Context: 4096 Β· RLHF Chat Variants
Input Tokens (32K vocab, SentencePiece)
↓
Token Embedding + RoPE (ΞΈ=10,000) β€” 4096 ctx
↓
Repeat Γ— N Layers
RMSNorm (pre)
NEW for 34B/70B
Grouped-Query Attention (GQA)
70B: 40 Q heads / 8 KV heads
7B/13B: MHA (unchanged)
↓ residual ↓
RMSNorm (pre)
SwiGLU FFN (Dense)
Identical to Llama 1
↓
Final RMSNorm β†’ LM Head
↓
Base Model
NEW
Chat (SFT+RLHF+GAtt)

Community Perspective

Llama 2’s commercial license transformed the ecosystem: enterprises could now legally deploy and fine-tune these weights in production. The 70B Chat model quickly became the reference open-weight instruction-tuned LLM, benchmarked extensively against GPT-3.5. The Ghost Attention mechanism addressed a key weakness in multi-turn instruction following that plagued early RLHF systems. Code Llama (built on Llama 2) became one of the most widely adopted open code generation models, directly influencing Copilot-alternative tooling.

Model Variants

Model Parameters Layers Heads (Q/KV) Hidden Dim Context Notes
Llama 2 7B 6.7B 32 32/32 (MHA) 4,096 4,096 Base + Chat
Llama 2 13B 13.0B 40 40/40 (MHA) 5,120 4,096 Base + Chat
Llama 2 34B 34B 48 64/8 (GQA) 8,192 4,096 Base only (no Chat)
Llama 2 70B 68.9B 80 64/8 (GQA) 8,192 4,096 Base + Chat
Key Industry Ideas Incorporated | Technique | Origin | How Llama 2 Used It | |:----------|:-------|:--------------------| | Grouped-Query Attention (GQA) | Ainslie et al., 2023 | Applied to 34B and 70B; reduces KV-cache ~5–8Γ— | | RLHF with PPO | Stiennon et al., 2020; InstructGPT | Two reward models (helpfulness + safety) trained on 1.4M comparisons | | Ghost Attention (GAtt) | Meta internal | Synthetic multi-turn training to preserve system-prompt over long conversations | | Rejection Sampling Fine-tuning | Meta internal | Used between SFT and PPO; sample K responses, keep highest reward | | Safety Red-teaming | Anthropic, OpenAI tradition | Dedicated red team; safety-specific reward model; context distillation | | Context Distillation | Askell et al., 2021 | Distilling safety behavior from system-prompted model to base model |

🟑 Llama 3 β€” April 2024

πŸ“… Apr 2024 | arXiv:2407.21783 | Meta AI

Summary

  • Architecture: Same decoder-only Transformer skeleton; GQA now applied universally across all model sizes.
  • Vocabulary Jump: 128,256 tokens using tiktoken BPE (previously 32K SentencePiece); much better multilingual and code tokenisation.
  • Model Sizes: Two flagship sizes β€” 8B and 70B; 400B+ in pre-release preview.
  • Context Window: 8,192 tokens (2Γ— Llama 2); RoPE ΞΈ unchanged at 500,000 in the 3.1 release but at 8K here.
  • Training Scale: 15T tokens (7.5Γ— Llama 2); quality-filtered CommonCrawl + code + multilingual sources; 95% English.
  • 8B Config: 32 layers, 32 Q heads, 8 KV heads (4:1 GQA), hidden dim 4096.
  • 70B Config: 80 layers, 64 Q heads, 8 KV heads (8:1 GQA), hidden dim 8192.
  • Post-Training: SFT (high-quality human instructions) β†’ Rejection Sampling β†’ PPO β†’ DPO; four distinct stages.
  • Safety Tooling: Llama Guard 2 and Code Shield released alongside; Meta Prompt Guard for injection detection.
  • FlashAttention 2: Used throughout training; significantly improves memory efficiency vs v1 used in Llama 1.
  • Instruction-tuned Variants: Llama 3 Instruct models show substantial jump on HumanEval (81.7 for 70B-Instruct) and MMLU (82.0) over Llama 2 70B.

Architecture Diagram

Llama 3 Architecture (Apr 2024)
8B / 70B β€” GQA Universal Β· 128K Vocab Β· 15T Training Tokens Β· 8K Context
NEW
Input Tokens (128,256 vocab β€” tiktoken BPE)
↓
Token Embedding + RoPE (ΞΈ=500K in 3.1; 8K ctx here)
↓
Repeat Γ— 32 (8B) / 80 (70B) Layers
RMSNorm (pre)
Universal GQA
Grouped-Query Attention (GQA)
8B: 32 Q / 8 KV Β· 70B: 64 Q / 8 KV
FlashAttention 2
↓ residual ↓
RMSNorm (pre)
SwiGLU FFN (Dense)
Same gated activation; scaled intermediate dim
↓
Final RMSNorm β†’ LM Head (128,256)
↓
Next-Token Prediction

Community Perspective

Llama 3 represented a watershed moment in open-weight capabilities. The 70B Instruct model surpassed GPT-3.5 on several established benchmarks and competed seriously with early GPT-4 variants on coding tasks. The tiktoken vocabulary change, while breaking compatibility with prior Llama tokenisers, dramatically improved multilingual efficiency. The Llama Guard safety tooling suite became a widely referenced framework for responsible deployment of open models, with organisations like Hugging Face and AI safety researchers publishing extensive evaluations.

Model Variants

Model Parameters Layers Heads (Q/KV) Hidden Dim FFN Dim Context
Llama 3 8B 8.0B 32 32/8 4,096 14,336 8,192
Llama 3 8B Instruct 8.0B 32 32/8 4,096 14,336 8,192
Llama 3 70B 70.6B 80 64/8 8,192 28,672 8,192
Llama 3 70B Instruct 70.6B 80 64/8 8,192 28,672 8,192
Key Industry Ideas Incorporated | Technique | Origin | How Llama 3 Used It | |:----------|:-------|:--------------------| | tiktoken BPE | OpenAI (GPT-3/4 tokeniser) | 128,256 vocab; byte-level fallback; improved multilingual/code efficiency | | Universal GQA | Ainslie et al., 2023 | Applied to all model sizes (previously only 34B/70B in Llama 2) | | FlashAttention 2 | Dao et al., 2023 | 2Γ— speedup vs FA1; used throughout training | | DPO (Direct Preference Optimisation) | Rafailov et al., 2023 | Added as final post-training stage after SFT + RS + PPO | | Rejection Sampling Fine-tuning | Meta Llama 2 paper | Multi-stage: SFT β†’ RS β†’ PPO β†’ DPO | | Llama Guard | Meta, 2023 | Input/output safety classification model; open-sourced alongside Llama 3 | | Scaling Laws | Hoffmann et al., 2022 (Chinchilla) | 15T tokens selected to over-train per Chinchilla-optimal guidance |

πŸ”Ά Llama 3.1 β€” July 2024

πŸ“… Jul 2024 | arXiv:2407.21783 | Meta AI

Summary

  • Context Explosion: Long-context variant extends window to 128,192 tokens via RoPE ΞΈ scaling to 500,000; trained with sequences up to 128K.
  • New Flagship β€” 405B: 126 layers, 128 Q heads, 8 KV heads, hidden dim 16,384; largest open-weight dense model at time of release.
  • Multilingual: Official support for 8 languages β€” English, German, French, Italian, Portuguese, Hindi, Spanish, Thai.
  • Tool Use & Function Calling: First Llama to support structured JSON tool calls natively in the instruct model; enables agentic pipelines.
  • Multi-Stage Post-Training: Supervised Fine-Tuning β†’ Rejection Sampling (per-language) β†’ DPO β†’ PPO; each stage builds on the last.
  • Synthetic Data at Scale: Meta generated large volumes of synthetic instruction and reasoning data (similar to Nemotron-4 approach); key to 405B performance.
  • Distillation from 405B: The 8B and 70B Instruct models in 3.1 were explicitly improved via knowledge distillation from the 405B teacher.
  • VRAM Requirements: 405B in BF16 requires ~8 Γ— H100 80GB GPUs; 70B in BF16 requires ~2 Γ— H100.
  • International Safety Standard: First open model released with a model card addressing EU AI Act risk categories.
  • Benchmark: 405B scores 88.6 on MMLU, 61.6 on GPQA Diamond, competing with GPT-4o at time of release.

Architecture Diagram

Llama 3.1 Architecture (Jul 2024)
8B / 70B / 405B β€” 128K Context Β· RoPE ΞΈ=500K Β· Multilingual Β· Tool Use
Input Tokens (128,256 vocab)
↓
NEW
RoPE (ΞΈ=500,000) β€” 128K context window
↓
Repeat Γ— 32 (8B) / 80 (70B) / 126 (405B) Layers
RMSNorm
GQA (RoPE ΞΈ=500K)
8B: 32Q/8KV Β· 70B: 64Q/8KV Β· 405B: 128Q/8KV
↓ residual ↓
RMSNorm
SwiGLU FFN (Dense)
↓
Final RMSNorm β†’ LM Head
↓
Base
Instruct (Multi-Stage RLHF)
NEW
Tool-Use / JSON Calling

Community Perspective

Llama 3.1 405B became a reference point for the entire open-source LLM ecosystem β€” arguably the first model to make GPT-4-level performance accessible without API dependence for well-resourced organisations. The 128K context window unlocked document-level reasoning tasks that were previously exclusive to commercial APIs. The officially sanctioned distillation policy (using 405B outputs to improve smaller Llama variants) enabled a new class of small-but-capable fine-tunes; community projects like Hermes 3, Nous-Capybara, and dozens of GGUF quantisations proliferated rapidly.

Model Variants

Model Parameters Layers Heads (Q/KV) Hidden Dim FFN Dim Context
Llama 3.1 8B 8.0B 32 32/8 4,096 14,336 131,072
Llama 3.1 8B Instruct 8.0B 32 32/8 4,096 14,336 131,072
Llama 3.1 70B 70.6B 80 64/8 8,192 28,672 131,072
Llama 3.1 70B Instruct 70.6B 80 64/8 8,192 28,672 131,072
Llama 3.1 405B 405B 126 128/8 16,384 53,248 131,072
Llama 3.1 405B Instruct 405B 126 128/8 16,384 53,248 131,072
Key Industry Ideas Incorporated | Technique | Origin | How Llama 3.1 Used It | |:----------|:-------|:----------------------| | RoPE ΞΈ Scaling (ΞΈ=500K) | Chen et al., 2023; Su et al., 2021 | Extends effective context from 8K to 128K via frequency interpolation | | Knowledge Distillation | Hinton et al., 2015 | 405B used as teacher; 8B and 70B Instruct improved via distillation | | Multi-Stage Post-Training | Meta internal | SFT β†’ RS β†’ DPO β†’ PPO; each stage refines the previous | | Synthetic Instruction Data | Gunasekar et al., 2023 (phi-1) | Large-scale synthetic generations for multilingual and tool-use alignment | | Function/Tool Calling | OpenAI API design | Structured JSON schemas; models fine-tuned to follow tool schemas | | Near-Optimal Scaling | Hoffmann et al., 2022 | 15T tokens for 405B; compute-optimal relative to model size |

πŸ”· Llama 3.2 β€” September 2024

πŸ“… Sep 2024 | Meta AI Blog | Meta AI

Summary

  • Two Product Lines: Vision models (11B, 90B) and edge/mobile text models (1B, 3B).
  • Vision Architecture: Cross-attention adapter (Flamingo-style) attaching a frozen ViT image encoder to Llama 3.1 text backbone; cross-attention inserted at every fourth transformer layer.
  • Image Encoder: ViT-H/14 backbone, image resolution 560 Γ— 560 px; tiled to handle higher-resolution inputs.
  • Vision–Language Alignment: Linear projection of ViT patch embeddings followed by cross-attention; text decoder treats image features as external memory.
  • 11B Vision Model: Llama 3.1 8B text backbone + vision adapter layers; strong single-image understanding, OCR, charts.
  • 90B Vision Model: Llama 3.1 70B text backbone + vision adapter; multi-image reasoning, document analysis, visual QA.
  • 1B Edge Model: Obtained via structured pruning of Llama 3.1 8B followed by knowledge distillation; targets mobile NPUs and CPUs.
  • 3B Edge Model: Pruned/distilled from Llama 3.1 70B with 8B as intermediate teacher; on-device performance near 7B-class.
  • Quantisation: 1B and 3B released in INT4 SpinQuant and QAT quantised variants for mobile deployment (Apple silicon, Arm, Qualcomm).
  • On-Device Context: Edge models support 128K context with quantised KV cache; designed for real-time streaming inference.
  • Meta AI Connect: Announced at Meta Connect 2024 with live demo of on-device inference on Ray-Ban Meta glasses.

Architecture Diagram

Llama 3.2 Architecture (Sep 2024)
Vision (11B/90B): Cross-Attention Adapter Β· Edge (1B/3B): Pruned + Distilled
πŸ–ΌοΈ Vision Models (11B / 90B)
Image Input (560Γ—560)
↓
NEW
Frozen ViT-H/14 (Image Encoder)
↓ patch embeddings ↓
Linear Projection
↓
NEW
Cross-Attention Layers (every 4th)
↓
Llama 3.1 Text Backbone (8B/70B)
πŸ“± Edge Models (1B / 3B)
Llama 3.1 8B / 70B (teacher)
↓ structured pruning
NEW
Pruned Decoder (fewer layers/heads)
↓ knowledge distillation
1B / 3B Student Model
↓
INT4 QAT / SpinQuant

Community Perspective

Llama 3.2 marked Meta’s entry into the on-device AI race. The 1B and 3B models demonstrated that a well-distilled small model could punch significantly above its weight class, reviving interest in model compression research. The vision models provided the first truly open-weight multimodal alternative to GPT-4V, with the 90B competing on several VQA and document benchmarks. The Flamingo-style cross-attention adapter architecture became a popular blueprint for community vision extensions to other text-only models.

Model Variants

Model Parameters Architecture Input Context Notes
Llama 3.2 1B 1.24B Pruned decoder Text 128K Mobile/edge; INT4 quantised variants
Llama 3.2 3B 3.21B Pruned decoder Text 128K On-device; QAT quantised variants
Llama 3.2 11B 11B Llama 3.1 8B + Vision Adapter Text + Image 128K Cross-attention, ViT-H/14
Llama 3.2 11B Vision Instruct 11B Llama 3.1 8B + Vision Adapter Text + Image 128K Instruction-tuned; OCR, VQA
Llama 3.2 90B 90B Llama 3.1 70B + Vision Adapter Text + Image 128K Multi-image, document analysis
Llama 3.2 90B Vision Instruct 90B Llama 3.1 70B + Vision Adapter Text + Image 128K Instruction-tuned; chart/figure QA
Key Industry Ideas Incorporated | Technique | Origin | How Llama 3.2 Used It | |:----------|:-------|:----------------------| | Cross-Attention Vision Adapter | Flamingo (Alayrac et al., 2022) | Frozen ViT cross-attended at every 4th layer; text backbone preserved | | ViT Image Encoder | Dosovitskiy et al., 2020 | ViT-H/14; 560Γ—560 input with tiling for high-res | | Structured Pruning | Michel et al., 2019; various | Layer/head/width pruning applied to Llama 3.1 8B β†’ 1B and 70B β†’ 3B | | Knowledge Distillation | Hinton et al., 2015 | Teacher–student training with Llama 3.1 as teacher for edge models | | SpinQuant / QAT | Meta internal, 2024 | INT4 quantisation-aware training for mobile deployment | | LLM-in-a-glasses form factor | Meta Connect 2024 | Demonstrated on Ray-Ban Meta smart glasses for real-time inference |

🟫 Llama 3.3 β€” December 2024

πŸ“… Dec 2024 | Meta AI Blog | Meta AI

Summary

  • Single Model Release: Only one model size β€” 70B; same architectural specification as Llama 3.1 70B.
  • Enhanced Post-Training: Entirely revised SFT dataset, new DPO preference pairs, and extended PPO reward modelling; no architectural changes.
  • Math & Reasoning Gains: MATH benchmark improvement to ~77 (vs Llama 3.1 70B ~65); structured chain-of-thought data incorporated.
  • Coding Improvements: HumanEval rises to 88.4 (vs 80.5 for 3.1 70B Instruct); new code-specific SFT and DPO data.
  • Instruction Following: Significantly improved IFEval scores; better adherence to format constraints (JSON, markdown, lists).
  • Multilingual Retention: All 8 languages from 3.1 maintained; additional data for Hindi and Thai in post-training.
  • Cost–Performance Target: Designed to offer Llama 3.1 405B-class performance at 70B inference cost; Meta claims parity on many benchmarks.
  • Drop-in Replacement: Compatible with Llama 3.1 70B serving infrastructure; same tokeniser, same context length, same tool-call schema.
  • Safety Updates: Updated Llama Guard 3 and Prompt Guard released simultaneously; improved refusal on newer jailbreak patterns.
  • Research Insight: Demonstrates that high-quality post-training data can yield substantial gains without any pretraining compute.

Architecture Diagram

Llama 3.3 Architecture (Dec 2024)
70B only β€” Identical to Llama 3.1 70B Β· Enhanced Post-Training Pipeline
Llama 3.1 70B Base Architecture (unchanged)
80 layers Β· 64Q/8KV GQA Β· RoPE ΞΈ=500K Β· 128K context Β· 128K vocab
↓
Enhanced Post-Training Pipeline
Stage 1
Revised SFT
(Math + Code + IF)
β†’
Stage 2
Rejection Sampling
(per capability)
β†’
Stage 3
DPO (new pairs)
+ PPO
↓
IMPROVED
Llama 3.3 70B Instruct
Drop-in replacement for Llama 3.1 70B Instruct Β· Same tool schemas Β· Same serving infrastructure

Community Perspective

Llama 3.3 illustrated a key principle increasingly understood in the field: post-training data quality can deliver disproportionate gains relative to compute. By releasing a single improved 70B model, Meta offered organisations a free upgrade path from 3.1 70B with no infrastructure changes. The model quickly became the default recommendation for cost-sensitive deployments, with benchmarks showing it frequently matching or exceeding Llama 3.1 405B on reasoning and coding tasks at a fraction of the serving cost. It also sparked renewed interest in post-training research as an alternative to scaling pretraining compute.

Model Variants

Model Parameters Architecture Context Key Improvements
Llama 3.3 70B 70.6B Identical to Llama 3.1 70B base 128K Better math/reasoning datasets
Llama 3.3 70B Instruct 70.6B Llama 3.1 70B base + new post-training 128K HumanEval 88.4, MATH ~77, IFEval gains
Key Industry Ideas Incorporated | Technique | Origin | How Llama 3.3 Used It | |:----------|:-------|:----------------------| | High-Quality SFT Curation | LIMA (Zhou et al., 2023) | "Less is more" principle applied; curated math/code/IF examples over volume | | Process Reward Models (PRMs) | Lightman et al., 2023 | Step-level math reward signals incorporated in rejection sampling | | DPO with Fresh Preference Data | Rafailov et al., 2023 | New human preference labels targeting 3.1 70B failure modes | | Chain-of-Thought Distillation | Wei et al., 2022 | CoT traces from 405B used to supervise 70B on reasoning | | IFEval-targeted Training | Zhou et al., 2023 | Explicit instruction-following format tasks in SFT and DPO | | Capability-specific Rejection Sampling | Meta internal | Separate RS pools for math, code, multilingual, and safety |

πŸ”΄ Llama 4 β€” April 2025

πŸ“… Apr 2025 | Meta AI Blog | Meta AI

Summary

  • Architecture Paradigm Shift: Native early-fusion multimodal Mixture-of-Experts (MoE); abandons the cross-attention adapter approach of Llama 3.2.
  • Early Fusion: Images and text tokens processed together from the first layer; no separate vision encoder; unified tokenisation of visual and language tokens.
  • MoE FFN: Sparse expert routing replaces the dense SwiGLU FFN; only a subset of experts activated per token, dramatically reducing active parameters vs total parameters.
  • iRoPE Positional Encoding: β€œinterleaved RoPE” β€” alternating attention layers have no positional encoding (NoPE / infinite context) and layers with standard RoPE; enables extrapolation to very long sequences.
  • Scout (109B total, 17B active): 16 experts; 10M token context window; single H100 deployment target; vision + text.
  • Maverick (400B total, 17B active): 128 experts per token routing; 1M context; full multimodal; strong benchmark performance; requires ~8Γ— H100.
  • Behemoth (~2T total, 288B active): 16 experts; training/teacher model; used for distillation; 16K+ context; not released in April 2025.
  • MetaP Hyperparameter Tuning: New technique for systematic hyperparameter transfer across model scales; stabilises training of very large MoE models.
  • Training Scale: 30T+ tokens on multimodal data; largest pretraining budget in the Llama series.
  • Omni Capability: Scout and Maverick handle image, video, and text inputs; text-only output at release; voice planned.
  • Benchmark: Maverick achieves 80.5 on MMLU-Pro and competitive results on GPQA Diamond; Scout’s 10M context enables retrieval over entire codebases.

Architecture Diagram

Llama 4 Architecture (Apr 2025)
Scout (17B active / 109B total) Β· Maverick (17B active / 400B total) Β· Early-Fusion MoE Β· iRoPE
Text Tokens
NEW
Visual Tokens (Early Fusion)
↓ unified tokenisation ↓
NEW
iRoPE Positional Encoding (interleaved)
Alternating: NoPE layers (no position) + RoPE layers (full position) β†’ enables 10M context
↓
Repeat Γ— N Layers
RMSNorm (pre)
GQA Self-Attention
iRoPE: even layers NoPE / odd layers RoPE
Scout: 16Q/8KV Β· Maverick: larger config
↓ residual ↓
RMSNorm (pre)
NEW: MoE FFN
Sparse MoE (SwiGLU Experts)
Scout: 16 experts (top-k routing)
Maverick: 128 experts (top-k routing)
Behemoth: 16 experts, 288B active
↓
Final RMSNorm β†’ LM Head
↓
Scout (10M ctx, 1Γ—H100)
Maverick (1M ctx, 8Γ—H100)
Behemoth (Teacher, ~2T total)

Community Perspective

Llama 4 represents the most architecturally ambitious Llama release to date. The shift from cross-attention vision adapters to native early-fusion MoE aligns Meta’s approach with GPT-4o and Gemini Ultra in treating multimodality as a first-class concern rather than an add-on. Scout’s 10M-token context window is unprecedented in an open-weight model and opens entirely new application categories β€” whole-repository code analysis, book-length summarisation, and long-horizon agentic tasks. The iRoPE position encoding scheme attracted significant academic interest as a practical solution to the context-length extrapolation problem. The MetaP hyperparameter transfer technique addresses one of the key engineering pain points in training very large MoE models reliably.

Model Variants

Model Total Params Active Params Experts Context Modality Notes
Llama 4 Scout 109B 17B 16 10M Text + Image Single H100 80GB deployment
Llama 4 Scout Instruct 109B 17B 16 10M Text + Image Instruction-tuned
Llama 4 Maverick 400B 17B 128 1M Text + Image ~8Γ— H100 required
Llama 4 Maverick Instruct 400B 17B 128 1M Text + Image Instruction-tuned; MMLU-Pro 80.5
Llama 4 Behemoth ~2T 288B 16 16K+ Text Teacher model; not released Apr 2025
Key Industry Ideas Incorporated | Technique | Origin | How Llama 4 Used It | |:----------|:-------|:--------------------| | Sparse MoE (top-k routing) | Shazeer et al., 2017; Switch Transformer | Each token routes to top-k of N experts in FFN; ~17B active per token | | Early Fusion Multimodal | Gemini (Google, 2023); Chameleon (Meta, 2024) | Visual and text tokens processed jointly from layer 0; no cross-attention adapter | | iRoPE (interleaved NoPE + RoPE) | Meta internal; inspired by YaRN, LongRoPE | Alternating position-free and RoPE layers enable 10M+ context extrapolation | | MetaP Hyperparameter Transfer | Meta internal, 2025 | ΞΌP-inspired framework for transferring LR/batch-size across MoE scales | | Load Balancing Loss (MoE) | Lepikhin et al., 2020; Fedus et al., 2021 | Auxiliary loss to prevent expert collapse during training | | 30T+ Token Pretraining | Meta internal | Multimodal web-scale corpus; largest single pretraining run in the series | | Distillation from Behemoth | Hinton et al., 2015 | Scout and Maverick post-training improved via Behemoth teacher |

πŸ“š References

Technical Papers

Paper Authors Year Topic
LLaMA: Open and Efficient Foundation Language Models Touvron et al. 2023 Llama 1
Llama 2: Open Foundation and Fine-Tuned Chat Models Touvron et al. 2023 Llama 2
The Llama 3 Herd of Models Meta AI 2024 Llama 3 / 3.1
RoFormer: Enhanced Transformer with Rotary Position Embedding Su et al. 2021 RoPE
Root Mean Square Layer Normalization Zhang & Sennrich 2019 RMSNorm
GLU Variants Improve Transformer Shazeer 2020 SwiGLU
GQA: Training Generalised Multi-Query Transformer Models from Multi-Head Checkpoints Ainslie et al. 2023 GQA
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Dao 2023 FlashAttention 2
Training Language Models to Follow Instructions with Human Feedback Ouyang et al. 2022 RLHF / InstructGPT
Direct Preference Optimization: Your Language Model is Secretly a Reward Model Rafailov et al. 2023 DPO
Flamingo: a Visual Language Model for Few-Shot Learning Alayrac et al. 2022 Cross-Attention Vision
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer Shazeer et al. 2017 MoE

Official Blog Posts

Post Date Topic
Introducing LLaMA: A foundational, 65-billion-parameter language model Feb 2023 Llama 1 announcement
Llama 2: Open Foundation and Fine-Tuned Chat Models Jul 2023 Llama 2 release
Meta Llama 3 Apr 2024 Llama 3 announcement
Llama 3.1: Our most capable models to date Jul 2024 Llama 3.1 / 405B
Llama 3.2: Revolutionizing edge AI and vision Sep 2024 Vision + edge models
Llama 3.3: New 70B model with improved performance Dec 2024 Llama 3.3 70B
Llama 4: The next generation of open foundation models Apr 2025 Llama 4 Scout/Maverick

GitHub Repositories

Repository Description
meta-llama/llama Official Llama 1 weights and inference code
meta-llama/llama2 Official Llama 2 repository
meta-llama/llama3 Official Llama 3 tokeniser and model card
meta-llama/llama-models Unified Llama 3.x model cards and configs
ggerganov/llama.cpp C++ inference engine; GGUF quantised formats
huggingface/transformers HF integration for all Llama variants
vllm-project/vllm High-throughput inference with continuous batching
meta-llama/PurpleLlama Llama Guard, Code Shield, Prompt Guard safety tools

Cited Techniques

Technique Reference Used In
Rotary Position Embeddings (RoPE) Su et al., arXiv:2104.09864 Llama 1–4
RMSNorm Zhang & Sennrich, NeurIPS 2019 Llama 1–4
SwiGLU Shazeer, arXiv:2002.05202 Llama 1–4
Grouped-Query Attention Ainslie et al., arXiv:2305.13245 Llama 2 (34B/70B), 3–4
FlashAttention 2 Dao, arXiv:2307.08691 Llama 3–4
tiktoken BPE OpenAI, 2022 Llama 3–4
RLHF / PPO Stiennon et al.; Ouyang et al. Llama 2–3.3
DPO Rafailov et al., arXiv:2305.18290 Llama 3–3.3
Knowledge Distillation Hinton et al., arXiv:1503.02531 Llama 3.1, 3.2, 4
Ghost Attention Touvron et al., arXiv:2307.09288 Llama 2 Chat
Sparse MoE Shazeer et al., arXiv:1701.06538 Llama 4
Early-Fusion Multimodal Chameleon (Meta); Gemini Llama 4
iRoPE Meta internal, 2025 Llama 4
SpinQuant / QAT Meta internal, 2024 Llama 3.2 edge
Structured Pruning Michel et al., 2019 Llama 3.2 1B/3B

This document covers the Llama model family from Llama 1 (Feb 2023) through Llama 4 (Apr 2025).
All benchmark figures are reported as published; instruct vs base distinctions apply where noted.
Maintained for educational and research reference purposes.

↑ Back to top