๐ŸŸฃ Qwen โ€” Model Architecture Across Generations

Versions: 5 Team: Alibaba Cloud Updated: March 2026

From a competitive Chinese LLM to a global open-weight powerhouse โ€” tracing 5 generations of architecture evolution.


๐Ÿ“‘ Table of Contents


๐Ÿ“‹ Executive Summary

This document covers five generations of the Qwen large language model family developed by Alibaba Cloudโ€™s Qwen Team:

  • Qwen 1 โ€” The foundation: SwiGLU, RoPE, RMSNorm, 151K BPE vocab, 3T tokens
  • Qwen 1.5 โ€” Scale-out: 6 dense sizes (0.5Bโ€“72B) + first MoE model, 32K context, HF transformers native
  • Qwen 2 โ€” Architecture leap: GQA, DCA+YARN for 128K context, MoE 57B-A14B, 7T tokens, 30 languages
  • Qwen 2.5 โ€” Data scaling: 18T tokens, new 3B/14B/32B sizes, structured output, 8K generation
  • Qwen 3 โ€” Reasoning era: Hybrid think/non-think modes, 36T tokens, 119 languages, 4-stage RL

๐Ÿ“ Note: The Qwen family also includes specialized variants โ€” Qwen-VL (vision-language), Qwen-Audio, Qwen-Coder, and Qwen-Math โ€” which are documented separately. This document focuses on the core LLM architecture.


๐Ÿ“… Version Release Timeline

| Version | Release Date | Paper / Blog | Flagship Size | Training Tokens | Context Length | Headline Feature | |:-------:|:----------:|:------------|:------------:|:--------------:|:------------:|:----------------| | Qwen 1 | Sep 28, 2023 | [arXiv:2309.16609](https://arxiv.org/abs/2309.16609) | 72B | 3T | 8K (32K ext.) | First competitive Alibaba LLM | | Qwen 1.5 | Feb 4, 2024 | [Blog](https://qwenlm.github.io/blog/qwen1.5/) | 72B + MoE-A2.7B | ~3T | 32K | HF-native + first MoE | | Qwen 2 | Jul 15, 2024 | [arXiv:2407.10671](https://arxiv.org/abs/2407.10671) | 72B + MoE-57B-A14B | 7T | 128K | GQA + DCA/YARN + MoE | | Qwen 2.5 | Sep 19, 2024 | [arXiv:2412.15115](https://arxiv.org/abs/2412.15115) | 72B | 18T | 128K | Data scaling + structured output | | Qwen 3 | Apr 29, 2025 | [arXiv:2505.09388](https://arxiv.org/abs/2505.09388) | 235B-A22B | 36T | 128K | Hybrid thinking + 119 languages |

๐Ÿ“Š Cross-Version Benchmark Comparison

All numbers are for the flagship base model of each generation (largest dense model). Sources: official technical papers.

Benchmark Qwen 1 (72B) Qwen 1.5 (72B) Qwen 2 (72B) Qwen 2.5 (72B) Qwen 3 (32B)
MMLU 74.5 77.5 84.2 86.1 ~83*
HumanEval 37.2 41.5 64.6 59.1 ~65*
MATH 17.4 34.1 51.1 62.1 ~68*
GSM8K 78.9 79.5 89.5 91.5 ~92*
BBH 67.4 65.5 82.4 86.3 ~85*
Context Length 8K (32K ext.) 32K 128K 128K 128K
Languages 2 (en/zh) ~12 ~30 ~29 119
Training Tokens 3T ~3T 7T 18T 36T
Vocabulary 151,646 151,646 151,646 151,646 151,646

*Qwen 3 base model benchmarks are for the largest dense model (Qwen3-32B-Base) from the Qwen 3 technical report pre-training performance curves. Exact numbers may vary.


๐Ÿ—๏ธ Master Architecture Diagram

This diagram shows the core Transformer decoder architecture shared across all Qwen versions, with color-coded annotations indicating which generation introduced each component.

Qwen Architecture โ€” Component Evolution
๐ŸŸฃ Qwen 1 ๐ŸŸช Qwen 1.5 ๐Ÿ”ต Qwen 2 ๐Ÿ”ท Qwen 2.5 ๐ŸŸข Qwen 3
Token Embedding โ€” 151,646 vocab, byte-level BPE
Introduced in Qwen 1 โ€ข Shared across all versions
โ†“
ร— N TRANSFORMER LAYERS
RMSNorm (Pre-Normalization) โ€” Qwen 1
Self-Attention Block
RoPE QKV Bias MHA โ†’ used until 1.5
GQA replaces MHA DCA + YARN (128K ctx)
+ Residual Connection
RMSNorm (Pre-Normalization) โ€” Qwen 1
Feed-Forward Block
SwiGLU Activation Dense FFN โ†’ replaced in MoE variants
MoE FFN (first in 1.5, 64 experts)
Fine-grained Experts + Shared Experts MoE 128 routed + 8 shared
+ Residual Connection
โ†“
LM Head โ€” Next-token prediction
Structured Output / JSON (2.5+) Think/No-Think modes (3)

๐ŸŸฃ Qwen 1 โ€” September 2023

๐Ÿ“… Released: September 28, 2023  |  ๐Ÿ“„ arXiv:2309.16609

Summary

  • First large-scale open-weight LLM from Alibaba Cloud, establishing the Qwen brand in the competitive Chinese LLM landscape alongside Yi, Baichuan, and ChatGLM
  • Built on a standard decoder-only Transformer architecture with causal attention masks, following the paradigm set by GPT and LLaMA
  • Adopted SwiGLU (Swish-Gated Linear Unit) activation in FFN layers โ€” borrowed from PaLM/LLaMA, replacing the conventional ReLU/GELU, providing smoother gradients and better performance
  • Used Rotary Positional Embeddings (RoPE) for encoding position information, enabling better length generalization than absolute positional embeddings
  • Applied RMSNorm with pre-normalization for improved training stability โ€” a technique popularized by LLaMA and now standard practice
  • Introduced QKV bias in attention layers โ€” an uncommon design choice at the time that later papers showed helps with RoPE-based length extrapolation
  • Trained on 3 trillion tokens of multilingual data (primarily English and Chinese) using a byte-level BPE tokenizer with 151,646 vocabulary โ€” one of the largest vocabs at the time, designed for strong multilingual compression
  • Released in two sizes: 7B and 72B parameters, using standard Multi-Head Attention (MHA) across all layers
  • Included specialized variants: Qwen-Chat (aligned via SFT + RLHF), Code-Qwen, and Math-Qwen โ€” demonstrating a full-stack approach from day one
  • Supported 8K context natively with NTK-aware interpolation for extending to 32K โ€” an industry technique for RoPE-based dynamic length extrapolation

Architecture Diagram โ€” Qwen 1

Qwen 1 โ€” Transformer Decoder Block
Input Embeddings (151,646 vocab) + RoPE
โ†“
Multi-Head Attention (MHA)
RoPE Positional QKV Bias Causal Mask 72B: 64 heads, d=128
+ Residual โ†’ RMSNorm
โ†“
Feed-Forward Network (Dense)
SwiGLU Activation 2/3 intermediate ratio 72B: 49,152 intermediate
+ Residual โ†’ RMSNorm
โ†“
LM Head โ†’ Next Token Prediction

Community Perspective

  • Received strong reception in the Chinese AI community as a capable alternative to LLaMA for Chinese-English tasks
  • The 72B model demonstrated that Chinese AI labs could produce GPT-3.5-competitive models
  • The large 151K vocabulary was praised for efficient multilingual tokenization โ€” many competitors used smaller vocabs
  • Tool-use and code interpreter capabilities in Qwen-Chat were ahead of most open-source alternatives at launch
  • Some concern about training data transparency compared to fully open models like LLaMA

Model Variants

Model Parameters Layers Heads (Q/KV) Context Embedding Tying
Qwen-7B 7.7B 32 32 / 32 8K (32K) No
Qwen-72B 72B 80 64 / 64 8K (32K) No
Qwen-7B-Chat 7.7B 32 32 / 32 8K (32K) No
Qwen-72B-Chat 72B 80 64 / 64 8K (32K) No
Key Industry Ideas Incorporated | Technique | Origin | How Qwen 1 Used It | |:----------|:-------|:-------------------| | SwiGLU | PaLM (Google, 2022) | FFN activation function replacing GELU | | RoPE | Su et al. (RoFormer, 2021) | Positional encoding for all attention layers | | RMSNorm | Jiang et al. (2023) | Replaced LayerNorm for faster, stabler training | | BPE Tokenizer | Sennrich et al. (2015) | Byte-level BPE with 151K vocab for multilingual | | NTK-aware Interpolation | Reddit/community (2023) | Dynamic RoPE scaling for context extension |

๐ŸŸช Qwen 1.5 โ€” February 2024

๐Ÿ“… Released: February 4, 2024  |  ๐Ÿ“„ Blog Post

Summary

  • Incremental refinement rather than architecture overhaul โ€” focused on improving base model quality and massively expanding the developer experience
  • Expanded to 8 dense model sizes: 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B, and 110B โ€” the 110B was the first 100B+ model in the Qwen family
  • Introduced the first Qwen MoE model: Qwen1.5-MoE-A2.7B with 14.3B total parameters, 2.7B activated โ€” achieving 7B-class performance at 1/3 the compute
  • Architecture identical to Qwen 1 (MHA, SwiGLU, RoPE, RMSNorm, QKV bias) โ€” improvements came from better data, longer training, and alignment techniques
  • Uniformly 32K context across all model sizes โ€” up from the 8K default of Qwen 1 โ€” achieved through RoPE frequency adjustments
  • Native Hugging Face transformers integration โ€” no more trust_remote_code=True, making deployment frictionless with transformers>=4.37.0
  • Alignment enhanced with DPO (Direct Preference Optimization) and PPO (Proximal Policy Optimization) โ€” producing significantly better chat models
  • Multilingual capabilities expanded to ~12 languages with structured evaluation on Arabic, Spanish, French, Japanese, Korean, Thai, Vietnamese, and more
  • MoE architecture used 64 fine-grained experts with 4 shared + 60 routed (4 activated per token) โ€” inspired by DeepSeek-MoEโ€™s fine-grained expert design
  • Upcycling initialization for MoE: started from Qwen-1.8B weights, transformed into MoE structure with randomized initialization for diversity โ€” reduced training cost by 75% vs. training from scratch

Architecture Diagram โ€” Qwen 1.5 MoE

Qwen 1.5-MoE โ€” First MoE Architecture
Input Embeddings (151,646 vocab) + RoPE (32K context)
โ†“
Multi-Head Attention (MHA) inherited
Same as Qwen 1: RoPE + QKV Bias + Causal Mask
+ Residual โ†’ RMSNorm
โ†“
โœจ NEW
MoE Feed-Forward Network
๐Ÿ”€ Gated Router โ€” softmax โ†’ top-4 selection from 60 routed experts
4 Shared Experts
Always activated
60 Routed Experts (fine-grained)
4 activated per token โ†’ SwiGLU each
๐Ÿ“ Total: 14.3B params | Active: 2.7B params | Non-emb: 2.0B params
+ Residual โ†’ RMSNorm
โ†“
LM Head โ†’ Next Token Prediction

Community Perspective

  • Widely praised for the developer experience overhaul โ€” HF-native support was a game-changer for adoption
  • The MoE model (A2.7B) surprised many by matching Mistral-7B and Qwen1.5-7B while being 1/3 the activated size
  • The 110B model was seen as a statement of scale ambition, though it didnโ€™t get as much adoption as the 72B
  • Strong reception for the expanded size lineup โ€” the 0.5B and 1.8B models enabled edge/mobile deployment
  • Criticism: the architecture was largely unchanged from Qwen 1, so improvements felt incremental

Model Variants

Model Total Params Active Params Layers Context Notes
Qwen1.5-0.5B 0.5B 0.5B 24 32K Embedding tying
Qwen1.5-1.8B 1.8B 1.8B 24 32K Embedding tying
Qwen1.5-4B 4B 4B 40 32K โ€”
Qwen1.5-7B 7.7B 7.7B 32 32K โ€”
Qwen1.5-14B 14B 14B 40 32K โ€”
Qwen1.5-32B 32B 32B 64 32K โ€”
Qwen1.5-72B 72B 72B 80 32K โ€”
Qwen1.5-110B 110B 110B 80 32K First 100B+ Qwen
Qwen1.5-MoE-A2.7B 14.3B 2.7B 24 32K 64 experts, 4 shared
Key Industry Ideas Incorporated | Technique | Origin | How Qwen 1.5 Used It | |:----------|:-------|:-------------------| | Fine-grained MoE Experts | DeepSeek-MoE (Jan 2024) | 64 fine-grained experts instead of 8 coarse experts | | Shared + Routed Experts | DeepSeek-MoE, Rajbhandari et al. (2022) | 4 shared experts always active alongside routed ones | | Upcycling | Komatsuzaki et al. (2023) | Initialize MoE from dense model weights | | DPO | Rafailov et al. (2023) | Direct preference optimization for alignment | | PPO | Schulman et al. (2017) | Proximal policy optimization for RLHF |

๐Ÿ”ต Qwen 2 โ€” July 2024

๐Ÿ“… Released: July 15, 2024  |  ๐Ÿ“„ arXiv:2407.10671

Summary

  • Major architecture upgrade โ€” the most significant changes since Qwenโ€™s inception, introducing multiple new attention and positional mechanisms
  • Grouped Query Attention (GQA) replaced MHA across all models โ€” dramatically reducing KV cache memory during inference while maintaining quality
  • Dual Chunk Attention (DCA) + YARN enabled 128K context by segmenting long sequences into manageable chunks with rescaled attention weights
  • Expanded to 5 model sizes: 0.5B, 1.5B, 7B, 72B (dense) + 57B-A14B (MoE) โ€” the MoE model had 57B total parameters with 14B active per token
  • MoE architecture advanced significantly: fine-grained experts with smaller expert size, shared + routing experts (8 shared + 64 routed, 8 activated), and upcycled from Qwen2-7B
  • Training data scaled to 7 trillion tokens (from 3T) with dramatically expanded code, math, and multilingual content โ€” supporting ~30 languages
  • Smaller models (0.5B, 1.5B) used embedding tying and were trained on 12T and 7T tokens respectively โ€” more tokens per parameter than larger models
  • Post-training involved SFT with 500K+ examples followed by both offline DPO and online RLHF with a reward model โ€” the most sophisticated alignment pipeline in the Qwen family at the time
  • Online Merging Optimizer was used to mitigate alignment tax โ€” reducing performance degradation from RLHF
  • RoPE base frequency increased from 10,000 to 1,000,000 in the long-context training phase โ€” enabling much longer effective sequence lengths

Architecture Diagram โ€” Qwen 2

Qwen 2 โ€” Key Architecture Changes from Qwen 1/1.5
โŒ Removed (Qwen 1/1.5)
Multi-Head Attention
8K/32K context
RoPE base freq = 10,000
3T training tokens
โ†’
โœ… Added (Qwen 2)
Grouped Query Attention
128K context (DCA + YARN)
RoPE base freq = 1,000,000
7T training tokens
โœจ UPGRADED MoE
Qwen2-57B-A14B MoE
57B total โ†’ 14B active 8 shared experts 64 routed experts top-8 routing
Upcycled from Qwen2-7B | Expert intermediate size: 2,560 | Shuffled + 50% re-init for diversity
GQA Head Configurations
Model Q Heads KV Heads GQA Ratio
0.5B1427:1
1.5B1226:1
7B2847:1
72B6488:1
57B-A14B (MoE)2847:1

Official Paper Figure

Needle in a Haystack test results for Qwen2 instruction-tuned models showing capability across 128K context:

Qwen2 Needle in a Haystack

Source: Qwen2 Technical Report (arXiv:2407.10671), Figure 1

Community Perspective

  • GQA adoption was welcomed as overdue โ€” competitors like LLaMA 2 (70B) had already adopted it for KV cache efficiency
  • The 128K context via DCA+YARN was a major selling point, though real-world performance degraded at extreme lengths
  • 57B-A14B MoE model showcased that Qwenโ€™s MoE expertise had matured โ€” fine-grained experts were more efficient than Mixtralโ€™s coarse approach
  • Qwen2-72Bโ€™s competitiveness with LLaMA-3-70B established Qwen as a top-tier global open-weight model โ€” not just a Chinese alternative
  • The 7T token dataset with 30 language support marked Qwenโ€™s transition from a bilingual to a truly multilingual model family

Model Variants

Model Total Params Hidden Layers Q Heads / KV Heads Context Tokens
Qwen2-0.5B 0.5B 896 24 14 / 2 128K 12T
Qwen2-1.5B 1.5B 1,536 28 12 / 2 128K 7T
Qwen2-7B 7B 3,584 28 28 / 4 128K 7T
Qwen2-72B 72B 8,192 80 64 / 8 128K 7T
Qwen2-57B-A14B 57B (14B active) 3,584 28 28 / 4 128K 4.5T
Key Industry Ideas Incorporated | Technique | Origin | How Qwen 2 Used It | |:----------|:-------|:-------------------| | GQA | Ainslie et al. (2023) | Replaced MHA for all Qwen 2 models | | Dual Chunk Attention | An et al. (2024) | Long sequence handling for 128K | | YARN | Peng et al. (2023) | Attention weight rescaling for length extrapolation | | Fine-grained MoE | Dai et al. (DeepSeek, 2024) | Smaller experts with more activated simultaneously | | Online Merging Optimizer | Lu et al. (2024) | Mitigating alignment tax during RLHF | | DPO | Rafailov et al. (2023) | Offline preference optimization stage |

๐Ÿ”ท Qwen 2.5 โ€” September 2024

๐Ÿ“… Released: September 19, 2024  |  ๐Ÿ“„ arXiv:2412.15115

Summary

  • Data scaling landmark โ€” pre-training dataset expanded from 7T to 18 trillion tokens, representing one of the largest known training runs for open-weight models
  • Architecture identical to Qwen 2 at the model level โ€” same GQA, DCA+YARN, SwiGLU, RoPE, 151K vocab โ€” the improvements were entirely from data quality and scale
  • Introduced three new model sizes: 3B (for mobile), 14B and 32B (for production) โ€” filling gaps that the community had been requesting
  • Knowledge dramatically improved: MMLU jumped from 84.2 (Qwen 2) to 86.1 (Qwen 2.5) for the 72B base model โ€” a significant gain at the top of the benchmark
  • Long text generation breakthrough: models could now generate up to 8K tokens per response (vs. ~1K in Qwen 2) โ€” enabled by post-training on long-form data
  • Structured output support added โ€” models reliably produce JSON, tables, and formatted data โ€” a critical feature for production agentic applications
  • Post-training evolved to over 1 million SFT samples plus multi-stage RL โ€” incorporating techniques from Qwen2.5-Math and Qwen2.5-Coder specialist models
  • Code performance surged thanks to Qwen2.5-Coder integration: LiveCodeBench jumped from 32.2 (Qwen 2) to 55.5 (Qwen 2.5) for the 72B instruct model
  • Math equally improved via Qwen2.5-Math technology: MATH benchmark went from 69.0 to 83.1 for the 72B instruct model
  • Qwen2.5-72B demonstrated competitive with or superior to LLaMA-3.1-405B on many benchmarks despite being ~5x smaller

Architecture Diagram โ€” Qwen 2.5

Qwen 2.5 โ€” Same Architecture, Massive Data & Post-Training Upgrades
๐Ÿ”„ Architecture Unchanged from Qwen 2
GQA DCA + YARN SwiGLU RoPE RMSNorm 128K context 151K vocab
โœจ NEW IN 2.5
Data & Training Improvements
18T
Training Tokens
โ†‘ from 7T (2.6ร—)
1M+
SFT Samples
โ†‘ from 500K (2ร—)
8K
Max Generation
โ†‘ from ~1K (8ร—)
3 new sizes: 3B, 14B, 32B JSON/structured output Multi-stage RL Code + Math specialist fusion

Official Paper Figures

Qwen2.5-72B Instruct Performance

Source: Qwen2.5 Blog โ€” 72B-Instruct benchmark comparison

Qwen2.5 Model Card

Source: Qwen2.5 Blog โ€” Model specifications overview

Community Perspective

  • The 18T token dataset was a headline number โ€” more than Llama 3โ€™s 15T and signaling massive investment in data curation
  • Qwen2.5-32B outperforming Qwen2-72B demonstrated that data quality matters more than model size at this scale
  • The structured output capabilities made Qwen 2.5 the go-to choice for many agentic/tool-use applications
  • The code and math improvements were directly attributable to specialist model techniques โ€” showing the value of the Qwen ecosystem approach
  • Community noted that same-architecture improvements have diminishing returns โ€” expectations built for an architecture refresh in Qwen 3

Model Variants

Model Total Params Non-Emb Params Layers Q Heads / KV Heads Emb. Tying Context Gen. Length
Qwen2.5-0.5B 0.49B 0.36B 24 14 / 2 Yes 32K 8K
Qwen2.5-1.5B 1.54B 1.31B 28 12 / 2 Yes 32K 8K
Qwen2.5-3B 3.09B 2.77B 36 16 / 2 Yes 32K 8K
Qwen2.5-7B 7.61B 6.53B 28 28 / 4 No 128K 8K
Qwen2.5-14B 14.7B 13.1B 48 40 / 8 No 128K 8K
Qwen2.5-32B 32.5B 31.0B 64 40 / 8 No 128K 8K
Qwen2.5-72B 72.7B 70.0B 80 64 / 8 No 128K 8K

API-only models: Qwen2.5-Turbo (MoE) and Qwen2.5-Plus (MoE) were also released through Alibaba Cloud Model Studio.

Key Industry Ideas Incorporated | Technique | Origin | How Qwen 2.5 Used It | |:----------|:-------|:-------------------| | Specialist Model Distillation | Multi-task learning research | Fused Qwen2.5-Coder and Qwen2.5-Math capabilities into the general model | | Multi-stage RL | DeepSeek, OpenAI o1 (2024) | Multiple RL stages for different capability domains | | Structured Output Training | GPT-4 function calling (2023) | Reliable JSON/structured data generation | | Long-form Generation SFT | โ€” | Dedicated training for 8K+ token outputs | | System Prompt Robustness | โ€” | Training on diverse system prompts for better role-play |

๐ŸŸข Qwen 3 โ€” April 2025

๐Ÿ“… Released: April 29, 2025  |  ๐Ÿ“„ arXiv:2505.09388

Summary

  • Paradigm shift: introduced hybrid thinking modes โ€” models can seamlessly switch between โ€œThinkingโ€ mode (step-by-step reasoning, like o1/QwQ) and โ€œNon-Thinkingโ€ mode (fast direct responses) within a single model
  • Massive scale-up: flagship Qwen3-235B-A22B has 235B total parameters with 22B activated โ€” the largest Qwen MoE to date, plus Qwen3-30B-A3B as an efficient smaller MoE
  • Released 8 models total: 6 dense (0.6B, 1.7B, 4B, 8B, 14B, 32B) + 2 MoE (30B-A3B, 235B-A22B) โ€” all open-weighted under Apache 2.0
  • Training data nearly doubled to 36 trillion tokens covering 119 languages and dialects โ€” a dramatic jump from Qwen 2.5โ€™s 29 languages
  • Used Qwen2.5-VL to extract text from PDF-like documents and Qwen2.5-Math/Coder to generate synthetic training data โ€” the โ€œmodels training modelsโ€ paradigm
  • 4-stage post-training pipeline: (1) Long CoT cold start, (2) Reasoning-based RL with rule-based rewards, (3) Thinking mode fusion โ€” blending thinking and non-thinking data, (4) General RL across 20+ domains
  • Thinking budget mechanism allows users to control how much reasoning compute to allocate per query โ€” enabling smooth latency vs. quality tradeoffs
  • Three-stage pre-training: S1 (30T+ tokens, 4K context) โ†’ S2 (5T tokens, knowledge-intensive STEM/code/reasoning) โ†’ S3 (high-quality long-context data, extend to 32K)
  • Dense models match performance of Qwen 2.5 models 2ร— their size: e.g., Qwen3-8B โ‰ˆ Qwen2.5-14B, Qwen3-4B โ‰ˆ Qwen2.5-7B
  • MoE models achieve similar performance to Qwen 2.5 dense models at only ~10% of active parameters โ€” Qwen3-30B-A3B outperforms QwQ-32B with 10ร— fewer active params

Architecture Diagram โ€” Qwen 3

Qwen 3 โ€” Hybrid Thinking + Scaled MoE
โœจ PARADIGM SHIFT
Hybrid Thinking/Non-Thinking Mode
๐Ÿง  Thinking Mode
  • Step-by-step reasoning in <think>...</think>
  • Complex math, coding, logic
  • User controls thinking budget
โšก Non-Thinking Mode
  • Direct, fast responses
  • Simple queries, chat, translation
  • Toggle via /think or /no_think
Single unified model โ€” no need to switch between chat and reasoning model variants
4-Stage Post-Training Pipeline
Stage 1
Long CoT
Cold Start
โ†’
Stage 2
Reasoning
RL
โ†’
Stage 3
Think Mode
Fusion
โ†’
Stage 4
General RL
(20+ tasks)
โœจ LARGEST MoE
Qwen3-235B-A22B MoE Architecture
235B total โ†’ 22B active 94 layers 128 routed experts 8 shared experts top-8 routing GQA 64Q / 4KV

Official Paper Figures

Qwen3-235B-A22B Benchmarks

Source: Qwen3 Blog โ€” Qwen3-235B-A22B benchmark comparison against DeepSeek-R1, o1, o3-mini, Grok-3, Gemini-2.5-Pro

Qwen3-30B-A3B Benchmarks

Source: Qwen3 Blog โ€” Qwen3-30B-A3B outperforming QwQ-32B with 10ร— fewer active parameters

Thinking Budget Scaling

Source: Qwen3 Blog โ€” Thinking budget mechanism showing smooth performance scaling with compute

4-Stage Post-Training

Source: Qwen3 Blog โ€” 4-stage post-training pipeline overview

Community Perspective

  • The hybrid thinking mode was seen as a direct answer to OpenAIโ€™s o1/o3 and DeepSeek-R1 โ€” but more elegant because itโ€™s a single model rather than separate chat vs. reasoning models
  • Qwen3-30B-A3B outperforming QwQ-32B was a landmark result โ€” demonstrating extreme MoE efficiency
  • 119 language support (from 29) was a massive expansion โ€” making Qwen 3 one of the most multilingual open-weight models available
  • The 4-stage post-training pipeline was praised as a well-engineered approach to combining reasoning and general capabilities
  • Open-source community quickly adopted the /think and /no_think toggles as an intuitive user interface for controlling reasoning depth

Model Variants โ€” Dense

Model Params Layers Q Heads / KV Heads Emb. Tying Context
Qwen3-0.6B 0.6B 28 16 / 8 Yes 32K
Qwen3-1.7B 1.7B 28 16 / 8 Yes 32K
Qwen3-4B 4B 36 32 / 8 Yes 32K
Qwen3-8B 8B 36 32 / 8 No 128K
Qwen3-14B 14B 40 40 / 8 No 128K
Qwen3-32B 32B 64 64 / 8 No 128K

Model Variants โ€” MoE

Model Total Params Active Params Layers Q/KV Heads Routed/Shared Experts Context
Qwen3-30B-A3B 30B 3B 48 32 / 4 128 / 8 128K
Qwen3-235B-A22B 235B 22B 94 64 / 4 128 / 8 128K
Key Industry Ideas Incorporated | Technique | Origin | How Qwen 3 Used It | |:----------|:-------|:-------------------| | Hybrid Thinking/Non-Thinking | OpenAI o1 (2024), DeepSeek-R1 (2025) | Unified single model with switchable reasoning modes | | Thinking Budget Control | โ€” | User-configurable compute allocation per query | | Rule-based RL Rewards | DeepSeek-R1 (2025) | Used in Stage 2 of post-training for reasoning RL | | Synthetic Data from Models | Phi-series (Microsoft), Qwen2.5 | Training data generated by Qwen2.5-VL, Math, Coder | | Multi-stage Pre-training | Industry practice (2024-2025) | S1 (general) โ†’ S2 (knowledge) โ†’ S3 (long-context) | | MCP Tool Protocol | Anthropic (2024) | Enhanced agentic capabilities with MCP support |

๐Ÿ“š References

Technical Papers

Version Title Link Date
Qwen 1 Qwen Technical Report arXiv:2309.16609 Sep 2023
Qwen 2 Qwen2 Technical Report arXiv:2407.10671 Jul 2024
Qwen 2.5 Qwen2.5 Technical Report arXiv:2412.15115 Dec 2024
Qwen 3 Qwen3 Technical Report arXiv:2505.09388 May 2025

Official Blog Posts

Title Link
Introducing Qwen1.5 qwenlm.github.io/blog/qwen1.5
Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters qwenlm.github.io/blog/qwen-moe
Qwen1.5-110B: The First 100B+ Model of the Qwen1.5 Series qwenlm.github.io/blog/qwen1.5-110b
Qwen2.5: A Party of Foundation Models! qwenlm.github.io/blog/qwen2.5
Qwen2.5-LLM: Extending the Boundary of LLMs qwenlm.github.io/blog/qwen2.5-llm
Qwen3: Think Deeper, Act Faster qwenlm.github.io/blog/qwen3

GitHub & Model Repositories

Resource Link
Qwen GitHub (Main) github.com/QwenLM/Qwen
Qwen1.5 GitHub github.com/QwenLM/Qwen1.5
Qwen2.5 GitHub github.com/QwenLM/Qwen2.5
Qwen3 GitHub github.com/QwenLM/Qwen3
Hugging Face Collection huggingface.co/Qwen
ModelScope Collection modelscope.cn/organization/qwen

Cited Techniques

Technique Paper Link
SwiGLU Activation Dauphin et al., โ€œLanguage Modeling with Gated Convolutional Networksโ€ (ICML 2017) โ€”
RoPE Su et al., โ€œRoFormer: Enhanced Transformer with Rotary Position Embeddingโ€ (2021) arXiv:2104.09864
RMSNorm Jiang et al., โ€œPre-RMSNorm and Pre-CRMSNorm Transformersโ€ (2023) arXiv:2305.14858
GQA Ainslie et al., โ€œGQA: Training Generalized Multi-Query Transformer Modelsโ€ (EMNLP 2023) arXiv:2305.13245
YARN Peng et al., โ€œYaRN: Efficient Context Window Extensionโ€ (2023) arXiv:2309.00071
DCA An et al., โ€œTraining-Free Long-Context Scalingโ€ (2024) arXiv:2402.17463
DeepSeek-MoE Dai et al., โ€œDeepSeekMoE: Towards Ultimate Expert Specializationโ€ (2024) arXiv:2401.06066
DPO Rafailov et al., โ€œDirect Preference Optimizationโ€ (NeurIPS 2023) arXiv:2305.18290
Upcycling Komatsuzaki et al., โ€œSparse Upcycling: Training MoE from Dense Checkpointsโ€ (ICLR 2023) โ€”
DeepSeek-R1 DeepSeek Team, โ€œDeepSeek-R1โ€ (2025) arXiv:2501.12948

Built with data from official Qwen technical papers and blog posts. All benchmark numbers sourced directly from the referenced publications.

โ† Back to Index