🟣 Qwen — Model Architecture Across Generations

From a competitive Chinese LLM to a global open-weight powerhouse — tracing 5 generations of architecture evolution.

📑 Table of Contents

Executive Summary
Version Release Timeline
Cross-Version Benchmark Comparison
Master Architecture Diagram
Qwen 1 (September 2023)
Qwen 1.5 (February 2024)
Qwen 2 (July 2024)
Qwen 2.5 (September 2024)
Qwen 3 (April 2025)
References

📋 Executive Summary

This document covers five generations of the Qwen large language model family developed by Alibaba Cloud’s Qwen Team:

Qwen 1 — The foundation: SwiGLU, RoPE, RMSNorm, 151K BPE vocab, 3T tokens
Qwen 1.5 — Scale-out: 6 dense sizes (0.5B–72B) + first MoE model, 32K context, HF transformers native
Qwen 2 — Architecture leap: GQA, DCA+YARN for 128K context, MoE 57B-A14B, 7T tokens, 30 languages
Qwen 2.5 — Data scaling: 18T tokens, new 3B/14B/32B sizes, structured output, 8K generation
Qwen 3 — Reasoning era: Hybrid think/non-think modes, 36T tokens, 119 languages, 4-stage RL

📝 Note: The Qwen family also includes specialized variants — Qwen-VL (vision-language), Qwen-Audio, Qwen-Coder, and Qwen-Math — which are documented separately. This document focuses on the core LLM architecture.

📅 Version Release Timeline

| Version | Release Date | Paper / Blog | Flagship Size | Training Tokens | Context Length | Headline Feature | |:-------:|:----------:|:------------|:------------:|:--------------:|:------------:|:----------------| |

| Sep 28, 2023 | [arXiv:2309.16609](https://arxiv.org/abs/2309.16609) | 72B | 3T | 8K (32K ext.) | First competitive Alibaba LLM | |

| Feb 4, 2024 | [Blog](https://qwenlm.github.io/blog/qwen1.5/) | 72B + MoE-A2.7B | ~3T | 32K | HF-native + first MoE | |

| Jul 15, 2024 | [arXiv:2407.10671](https://arxiv.org/abs/2407.10671) | 72B + MoE-57B-A14B | 7T | 128K | GQA + DCA/YARN + MoE | |

| Sep 19, 2024 | [arXiv:2412.15115](https://arxiv.org/abs/2412.15115) | 72B | 18T | 128K | Data scaling + structured output | |

| Apr 29, 2025 | [arXiv:2505.09388](https://arxiv.org/abs/2505.09388) | 235B-A22B | 36T | 128K | Hybrid thinking + 119 languages |

📊 Cross-Version Benchmark Comparison

All numbers are for the flagship base model of each generation (largest dense model). Sources: official technical papers.

Benchmark	Qwen 1 (72B)	Qwen 1.5 (72B)	Qwen 2 (72B)	Qwen 2.5 (72B)	Qwen 3 (32B)
MMLU	74.5	77.5	84.2	86.1	~83*
HumanEval	37.2	41.5	64.6	59.1	~65*
MATH	17.4	34.1	51.1	62.1	~68*
GSM8K	78.9	79.5	89.5	91.5	~92*
BBH	67.4	65.5	82.4	86.3	~85*
Context Length	8K (32K ext.)	32K	128K	128K	128K
Languages	2 (en/zh)	~12	~30	~29	119
Training Tokens	3T	~3T	7T	18T	36T
Vocabulary	151,646	151,646	151,646	151,646	151,646

_{*Qwen 3 base model benchmarks are for the largest dense model (Qwen3-32B-Base) from the Qwen 3 technical report pre-training performance curves. Exact numbers may vary.}

🏗️ Master Architecture Diagram

This diagram shows the core Transformer decoder architecture shared across all Qwen versions, with color-coded annotations indicating which generation introduced each component.

Qwen Architecture — Component Evolution

🟣 Qwen 1 🟪 Qwen 1.5 🔵 Qwen 2 🔷 Qwen 2.5 🟢 Qwen 3

Token Embedding — 151,646 vocab, byte-level BPE

Introduced in Qwen 1 • Shared across all versions

↓

× N TRANSFORMER LAYERS

RMSNorm (Pre-Normalization) — Qwen 1

Self-Attention Block

RoPE QKV Bias MHA → used until 1.5

GQA replaces MHA DCA + YARN (128K ctx)

+ Residual Connection

RMSNorm (Pre-Normalization) — Qwen 1

Feed-Forward Block

SwiGLU Activation Dense FFN → replaced in MoE variants

MoE FFN (first in 1.5, 64 experts)

Fine-grained Experts + Shared Experts MoE 128 routed + 8 shared

+ Residual Connection

↓

LM Head — Next-token prediction

Structured Output / JSON (2.5+) Think/No-Think modes (3)

🟣 Qwen 1 — September 2023

📅 Released: September 28, 2023 | 📄 arXiv:2309.16609

Summary

First large-scale open-weight LLM from Alibaba Cloud, establishing the Qwen brand in the competitive Chinese LLM landscape alongside Yi, Baichuan, and ChatGLM
Built on a standard decoder-only Transformer architecture with causal attention masks, following the paradigm set by GPT and LLaMA
Adopted SwiGLU (Swish-Gated Linear Unit) activation in FFN layers — borrowed from PaLM/LLaMA, replacing the conventional ReLU/GELU, providing smoother gradients and better performance
Used Rotary Positional Embeddings (RoPE) for encoding position information, enabling better length generalization than absolute positional embeddings
Applied RMSNorm with pre-normalization for improved training stability — a technique popularized by LLaMA and now standard practice
Introduced QKV bias in attention layers — an uncommon design choice at the time that later papers showed helps with RoPE-based length extrapolation
Trained on 3 trillion tokens of multilingual data (primarily English and Chinese) using a byte-level BPE tokenizer with 151,646 vocabulary — one of the largest vocabs at the time, designed for strong multilingual compression
Released in two sizes: 7B and 72B parameters, using standard Multi-Head Attention (MHA) across all layers
Included specialized variants: Qwen-Chat (aligned via SFT + RLHF), Code-Qwen, and Math-Qwen — demonstrating a full-stack approach from day one
Supported 8K context natively with NTK-aware interpolation for extending to 32K — an industry technique for RoPE-based dynamic length extrapolation

Architecture Diagram — Qwen 1

Qwen 1 — Transformer Decoder Block

Input Embeddings (151,646 vocab) + RoPE

↓

Multi-Head Attention (MHA)

RoPE Positional QKV Bias Causal Mask 72B: 64 heads, d=128

+ Residual → RMSNorm

↓

Feed-Forward Network (Dense)

SwiGLU Activation 2/3 intermediate ratio 72B: 49,152 intermediate

+ Residual → RMSNorm

↓

LM Head → Next Token Prediction

Community Perspective

Received strong reception in the Chinese AI community as a capable alternative to LLaMA for Chinese-English tasks
The 72B model demonstrated that Chinese AI labs could produce GPT-3.5-competitive models
The large 151K vocabulary was praised for efficient multilingual tokenization — many competitors used smaller vocabs
Tool-use and code interpreter capabilities in Qwen-Chat were ahead of most open-source alternatives at launch
Some concern about training data transparency compared to fully open models like LLaMA

Model Variants

Model	Parameters	Layers	Heads (Q/KV)	Context	Embedding Tying
Qwen-7B	7.7B	32	32 / 32	8K (32K)	No
Qwen-72B	72B	80	64 / 64	8K (32K)	No
Qwen-7B-Chat	7.7B	32	32 / 32	8K (32K)	No
Qwen-72B-Chat	72B	80	64 / 64	8K (32K)	No

Key Industry Ideas Incorporated

| Technique | Origin | How Qwen 1 Used It | |:----------|:-------|:-------------------| | SwiGLU | PaLM (Google, 2022) | FFN activation function replacing GELU | | RoPE | Su et al. (RoFormer, 2021) | Positional encoding for all attention layers | | RMSNorm | Jiang et al. (2023) | Replaced LayerNorm for faster, stabler training | | BPE Tokenizer | Sennrich et al. (2015) | Byte-level BPE with 151K vocab for multilingual | | NTK-aware Interpolation | Reddit/community (2023) | Dynamic RoPE scaling for context extension |

🟪 Qwen 1.5 — February 2024

📅 Released: February 4, 2024 | 📄 Blog Post

Summary

Incremental refinement rather than architecture overhaul — focused on improving base model quality and massively expanding the developer experience
Expanded to 8 dense model sizes: 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B, and 110B — the 110B was the first 100B+ model in the Qwen family
Introduced the first Qwen MoE model: Qwen1.5-MoE-A2.7B with 14.3B total parameters, 2.7B activated — achieving 7B-class performance at 1/3 the compute
Architecture identical to Qwen 1 (MHA, SwiGLU, RoPE, RMSNorm, QKV bias) — improvements came from better data, longer training, and alignment techniques
Uniformly 32K context across all model sizes — up from the 8K default of Qwen 1 — achieved through RoPE frequency adjustments
Native Hugging Face transformers integration — no more trust_remote_code=True, making deployment frictionless with transformers>=4.37.0
Alignment enhanced with DPO (Direct Preference Optimization) and PPO (Proximal Policy Optimization) — producing significantly better chat models
Multilingual capabilities expanded to ~12 languages with structured evaluation on Arabic, Spanish, French, Japanese, Korean, Thai, Vietnamese, and more
MoE architecture used 64 fine-grained experts with 4 shared + 60 routed (4 activated per token) — inspired by DeepSeek-MoE’s fine-grained expert design
Upcycling initialization for MoE: started from Qwen-1.8B weights, transformed into MoE structure with randomized initialization for diversity — reduced training cost by 75% vs. training from scratch

Architecture Diagram — Qwen 1.5 MoE

Qwen 1.5-MoE — First MoE Architecture

Input Embeddings (151,646 vocab) + RoPE (32K context)

↓

Multi-Head Attention (MHA) inherited

Same as Qwen 1: RoPE + QKV Bias + Causal Mask

+ Residual → RMSNorm

↓

✨ NEW

MoE Feed-Forward Network

🔀 Gated Router — softmax → top-4 selection from 60 routed experts

4 Shared Experts
Always activated

60 Routed Experts (fine-grained)
4 activated per token → SwiGLU each

📐 Total: 14.3B params | Active: 2.7B params | Non-emb: 2.0B params

+ Residual → RMSNorm

↓

LM Head → Next Token Prediction

Community Perspective

Widely praised for the developer experience overhaul — HF-native support was a game-changer for adoption
The MoE model (A2.7B) surprised many by matching Mistral-7B and Qwen1.5-7B while being 1/3 the activated size
The 110B model was seen as a statement of scale ambition, though it didn’t get as much adoption as the 72B
Strong reception for the expanded size lineup — the 0.5B and 1.8B models enabled edge/mobile deployment
Criticism: the architecture was largely unchanged from Qwen 1, so improvements felt incremental

Model Variants

Model	Total Params	Active Params	Layers	Context	Notes
Qwen1.5-0.5B	0.5B	0.5B	24	32K	Embedding tying
Qwen1.5-1.8B	1.8B	1.8B	24	32K	Embedding tying
Qwen1.5-4B	4B	4B	40	32K	—
Qwen1.5-7B	7.7B	7.7B	32	32K	—
Qwen1.5-14B	14B	14B	40	32K	—
Qwen1.5-32B	32B	32B	64	32K	—
Qwen1.5-72B	72B	72B	80	32K	—
Qwen1.5-110B	110B	110B	80	32K	First 100B+ Qwen
Qwen1.5-MoE-A2.7B	14.3B	2.7B	24	32K	64 experts, 4 shared

Key Industry Ideas Incorporated

| Technique | Origin | How Qwen 1.5 Used It | |:----------|:-------|:-------------------| | Fine-grained MoE Experts | DeepSeek-MoE (Jan 2024) | 64 fine-grained experts instead of 8 coarse experts | | Shared + Routed Experts | DeepSeek-MoE, Rajbhandari et al. (2022) | 4 shared experts always active alongside routed ones | | Upcycling | Komatsuzaki et al. (2023) | Initialize MoE from dense model weights | | DPO | Rafailov et al. (2023) | Direct preference optimization for alignment | | PPO | Schulman et al. (2017) | Proximal policy optimization for RLHF |

🔵 Qwen 2 — July 2024

📅 Released: July 15, 2024 | 📄 arXiv:2407.10671

Summary

Major architecture upgrade — the most significant changes since Qwen’s inception, introducing multiple new attention and positional mechanisms
Grouped Query Attention (GQA) replaced MHA across all models — dramatically reducing KV cache memory during inference while maintaining quality
Dual Chunk Attention (DCA) + YARN enabled 128K context by segmenting long sequences into manageable chunks with rescaled attention weights
Expanded to 5 model sizes: 0.5B, 1.5B, 7B, 72B (dense) + 57B-A14B (MoE) — the MoE model had 57B total parameters with 14B active per token
MoE architecture advanced significantly: fine-grained experts with smaller expert size, shared + routing experts (8 shared + 64 routed, 8 activated), and upcycled from Qwen2-7B
Training data scaled to 7 trillion tokens (from 3T) with dramatically expanded code, math, and multilingual content — supporting ~30 languages
Smaller models (0.5B, 1.5B) used embedding tying and were trained on 12T and 7T tokens respectively — more tokens per parameter than larger models
Post-training involved SFT with 500K+ examples followed by both offline DPO and online RLHF with a reward model — the most sophisticated alignment pipeline in the Qwen family at the time
Online Merging Optimizer was used to mitigate alignment tax — reducing performance degradation from RLHF
RoPE base frequency increased from 10,000 to 1,000,000 in the long-context training phase — enabling much longer effective sequence lengths

Architecture Diagram — Qwen 2

Qwen 2 — Key Architecture Changes from Qwen 1/1.5

❌ Removed (Qwen 1/1.5)

Multi-Head Attention

8K/32K context

RoPE base freq = 10,000

3T training tokens

→

✅ Added (Qwen 2)

Grouped Query Attention

128K context (DCA + YARN)

RoPE base freq = 1,000,000

7T training tokens

✨ UPGRADED MoE

Qwen2-57B-A14B MoE

57B total → 14B active 8 shared experts 64 routed experts top-8 routing

Upcycled from Qwen2-7B | Expert intermediate size: 2,560 | Shuffled + 50% re-init for diversity

GQA Head Configurations

Model	Q Heads	KV Heads	GQA Ratio
0.5B	14	2	7:1
1.5B	12	2	6:1
7B	28	4	7:1
72B	64	8	8:1
57B-A14B (MoE)	28	4	7:1

Official Paper Figure

Needle in a Haystack test results for Qwen2 instruction-tuned models showing capability across 128K context:

Qwen2 Needle in a Haystack

_{Source: Qwen2 Technical Report (arXiv:2407.10671), Figure 1}

Community Perspective

GQA adoption was welcomed as overdue — competitors like LLaMA 2 (70B) had already adopted it for KV cache efficiency
The 128K context via DCA+YARN was a major selling point, though real-world performance degraded at extreme lengths
57B-A14B MoE model showcased that Qwen’s MoE expertise had matured — fine-grained experts were more efficient than Mixtral’s coarse approach
Qwen2-72B’s competitiveness with LLaMA-3-70B established Qwen as a top-tier global open-weight model — not just a Chinese alternative
The 7T token dataset with 30 language support marked Qwen’s transition from a bilingual to a truly multilingual model family

Model Variants

Model	Total Params	Hidden	Layers	Q Heads / KV Heads	Context	Tokens
Qwen2-0.5B	0.5B	896	24	14 / 2	128K	12T
Qwen2-1.5B	1.5B	1,536	28	12 / 2	128K	7T
Qwen2-7B	7B	3,584	28	28 / 4	128K	7T
Qwen2-72B	72B	8,192	80	64 / 8	128K	7T
Qwen2-57B-A14B	57B (14B active)	3,584	28	28 / 4	128K	4.5T

Key Industry Ideas Incorporated

| Technique | Origin | How Qwen 2 Used It | |:----------|:-------|:-------------------| | GQA | Ainslie et al. (2023) | Replaced MHA for all Qwen 2 models | | Dual Chunk Attention | An et al. (2024) | Long sequence handling for 128K | | YARN | Peng et al. (2023) | Attention weight rescaling for length extrapolation | | Fine-grained MoE | Dai et al. (DeepSeek, 2024) | Smaller experts with more activated simultaneously | | Online Merging Optimizer | Lu et al. (2024) | Mitigating alignment tax during RLHF | | DPO | Rafailov et al. (2023) | Offline preference optimization stage |

🔷 Qwen 2.5 — September 2024

📅 Released: September 19, 2024 | 📄 arXiv:2412.15115

Summary

Data scaling landmark — pre-training dataset expanded from 7T to 18 trillion tokens, representing one of the largest known training runs for open-weight models
Architecture identical to Qwen 2 at the model level — same GQA, DCA+YARN, SwiGLU, RoPE, 151K vocab — the improvements were entirely from data quality and scale
Introduced three new model sizes: 3B (for mobile), 14B and 32B (for production) — filling gaps that the community had been requesting
Knowledge dramatically improved: MMLU jumped from 84.2 (Qwen 2) to 86.1 (Qwen 2.5) for the 72B base model — a significant gain at the top of the benchmark
Long text generation breakthrough: models could now generate up to 8K tokens per response (vs. ~1K in Qwen 2) — enabled by post-training on long-form data
Structured output support added — models reliably produce JSON, tables, and formatted data — a critical feature for production agentic applications
Post-training evolved to over 1 million SFT samples plus multi-stage RL — incorporating techniques from Qwen2.5-Math and Qwen2.5-Coder specialist models
Code performance surged thanks to Qwen2.5-Coder integration: LiveCodeBench jumped from 32.2 (Qwen 2) to 55.5 (Qwen 2.5) for the 72B instruct model
Math equally improved via Qwen2.5-Math technology: MATH benchmark went from 69.0 to 83.1 for the 72B instruct model
Qwen2.5-72B demonstrated competitive with or superior to LLaMA-3.1-405B on many benchmarks despite being ~5x smaller

Architecture Diagram — Qwen 2.5

Qwen 2.5 — Same Architecture, Massive Data & Post-Training Upgrades

🔄 Architecture Unchanged from Qwen 2

GQA DCA + YARN SwiGLU RoPE RMSNorm 128K context 151K vocab

✨ NEW IN 2.5

Data & Training Improvements

18T

Training Tokens

↑ from 7T (2.6×)

1M+

SFT Samples

↑ from 500K (2×)

Max Generation

↑ from ~1K (8×)

3 new sizes: 3B, 14B, 32B JSON/structured output Multi-stage RL Code + Math specialist fusion

Official Paper Figures

Qwen2.5-72B Instruct Performance

_{Source: Qwen2.5 Blog — 72B-Instruct benchmark comparison}

Qwen2.5 Model Card

_{Source: Qwen2.5 Blog — Model specifications overview}

Community Perspective

The 18T token dataset was a headline number — more than Llama 3’s 15T and signaling massive investment in data curation
Qwen2.5-32B outperforming Qwen2-72B demonstrated that data quality matters more than model size at this scale
The structured output capabilities made Qwen 2.5 the go-to choice for many agentic/tool-use applications
The code and math improvements were directly attributable to specialist model techniques — showing the value of the Qwen ecosystem approach
Community noted that same-architecture improvements have diminishing returns — expectations built for an architecture refresh in Qwen 3

Model Variants

Model	Total Params	Non-Emb Params	Layers	Q Heads / KV Heads	Emb. Tying	Context	Gen. Length
Qwen2.5-0.5B	0.49B	0.36B	24	14 / 2	Yes	32K	8K
Qwen2.5-1.5B	1.54B	1.31B	28	12 / 2	Yes	32K	8K
Qwen2.5-3B	3.09B	2.77B	36	16 / 2	Yes	32K	8K
Qwen2.5-7B	7.61B	6.53B	28	28 / 4	No	128K	8K
Qwen2.5-14B	14.7B	13.1B	48	40 / 8	No	128K	8K
Qwen2.5-32B	32.5B	31.0B	64	40 / 8	No	128K	8K
Qwen2.5-72B	72.7B	70.0B	80	64 / 8	No	128K	8K

API-only models: Qwen2.5-Turbo (MoE) and Qwen2.5-Plus (MoE) were also released through Alibaba Cloud Model Studio.

Key Industry Ideas Incorporated

| Technique | Origin | How Qwen 2.5 Used It | |:----------|:-------|:-------------------| | Specialist Model Distillation | Multi-task learning research | Fused Qwen2.5-Coder and Qwen2.5-Math capabilities into the general model | | Multi-stage RL | DeepSeek, OpenAI o1 (2024) | Multiple RL stages for different capability domains | | Structured Output Training | GPT-4 function calling (2023) | Reliable JSON/structured data generation | | Long-form Generation SFT | — | Dedicated training for 8K+ token outputs | | System Prompt Robustness | — | Training on diverse system prompts for better role-play |

🟢 Qwen 3 — April 2025

📅 Released: April 29, 2025 | 📄 arXiv:2505.09388

Summary

Paradigm shift: introduced hybrid thinking modes — models can seamlessly switch between “Thinking” mode (step-by-step reasoning, like o1/QwQ) and “Non-Thinking” mode (fast direct responses) within a single model
Massive scale-up: flagship Qwen3-235B-A22B has 235B total parameters with 22B activated — the largest Qwen MoE to date, plus Qwen3-30B-A3B as an efficient smaller MoE
Released 8 models total: 6 dense (0.6B, 1.7B, 4B, 8B, 14B, 32B) + 2 MoE (30B-A3B, 235B-A22B) — all open-weighted under Apache 2.0
Training data nearly doubled to 36 trillion tokens covering 119 languages and dialects — a dramatic jump from Qwen 2.5’s 29 languages
Used Qwen2.5-VL to extract text from PDF-like documents and Qwen2.5-Math/Coder to generate synthetic training data — the “models training models” paradigm
4-stage post-training pipeline: (1) Long CoT cold start, (2) Reasoning-based RL with rule-based rewards, (3) Thinking mode fusion — blending thinking and non-thinking data, (4) General RL across 20+ domains
Thinking budget mechanism allows users to control how much reasoning compute to allocate per query — enabling smooth latency vs. quality tradeoffs
Three-stage pre-training: S1 (30T+ tokens, 4K context) → S2 (5T tokens, knowledge-intensive STEM/code/reasoning) → S3 (high-quality long-context data, extend to 32K)
Dense models match performance of Qwen 2.5 models 2× their size: e.g., Qwen3-8B ≈ Qwen2.5-14B, Qwen3-4B ≈ Qwen2.5-7B
MoE models achieve similar performance to Qwen 2.5 dense models at only ~10% of active parameters — Qwen3-30B-A3B outperforms QwQ-32B with 10× fewer active params

Architecture Diagram — Qwen 3

Qwen 3 — Hybrid Thinking + Scaled MoE

✨ PARADIGM SHIFT

Hybrid Thinking/Non-Thinking Mode

🧠 Thinking Mode

Step-by-step reasoning in <think>...</think>
Complex math, coding, logic
User controls thinking budget

⚡ Non-Thinking Mode

Direct, fast responses
Simple queries, chat, translation
Toggle via /think or /no_think

Single unified model — no need to switch between chat and reasoning model variants

4-Stage Post-Training Pipeline

Stage 1
Long CoT
Cold Start

→

Stage 2
Reasoning
RL

→

Stage 3
Think Mode
Fusion

→

Stage 4
General RL
(20+ tasks)

✨ LARGEST MoE

Qwen3-235B-A22B MoE Architecture

235B total → 22B active 94 layers 128 routed experts 8 shared experts top-8 routing GQA 64Q / 4KV

Official Paper Figures

Qwen3-235B-A22B Benchmarks

_{Source: Qwen3 Blog — Qwen3-235B-A22B benchmark comparison against DeepSeek-R1, o1, o3-mini, Grok-3, Gemini-2.5-Pro}

Qwen3-30B-A3B Benchmarks

_{Source: Qwen3 Blog — Qwen3-30B-A3B outperforming QwQ-32B with 10× fewer active parameters}

Thinking Budget Scaling

_{Source: Qwen3 Blog — Thinking budget mechanism showing smooth performance scaling with compute}

4-Stage Post-Training

_{Source: Qwen3 Blog — 4-stage post-training pipeline overview}

Community Perspective

The hybrid thinking mode was seen as a direct answer to OpenAI’s o1/o3 and DeepSeek-R1 — but more elegant because it’s a single model rather than separate chat vs. reasoning models
Qwen3-30B-A3B outperforming QwQ-32B was a landmark result — demonstrating extreme MoE efficiency
119 language support (from 29) was a massive expansion — making Qwen 3 one of the most multilingual open-weight models available
The 4-stage post-training pipeline was praised as a well-engineered approach to combining reasoning and general capabilities
Open-source community quickly adopted the /think and /no_think toggles as an intuitive user interface for controlling reasoning depth

Model Variants — Dense

Model	Params	Layers	Q Heads / KV Heads	Emb. Tying	Context
Qwen3-0.6B	0.6B	28	16 / 8	Yes	32K
Qwen3-1.7B	1.7B	28	16 / 8	Yes	32K
Qwen3-4B	4B	36	32 / 8	Yes	32K
Qwen3-8B	8B	36	32 / 8	No	128K
Qwen3-14B	14B	40	40 / 8	No	128K
Qwen3-32B	32B	64	64 / 8	No	128K

Model Variants — MoE

Model	Total Params	Active Params	Layers	Q/KV Heads	Routed/Shared Experts	Context
Qwen3-30B-A3B	30B	3B	48	32 / 4	128 / 8	128K
Qwen3-235B-A22B	235B	22B	94	64 / 4	128 / 8	128K

Key Industry Ideas Incorporated

| Technique | Origin | How Qwen 3 Used It | |:----------|:-------|:-------------------| | Hybrid Thinking/Non-Thinking | OpenAI o1 (2024), DeepSeek-R1 (2025) | Unified single model with switchable reasoning modes | | Thinking Budget Control | — | User-configurable compute allocation per query | | Rule-based RL Rewards | DeepSeek-R1 (2025) | Used in Stage 2 of post-training for reasoning RL | | Synthetic Data from Models | Phi-series (Microsoft), Qwen2.5 | Training data generated by Qwen2.5-VL, Math, Coder | | Multi-stage Pre-training | Industry practice (2024-2025) | S1 (general) → S2 (knowledge) → S3 (long-context) | | MCP Tool Protocol | Anthropic (2024) | Enhanced agentic capabilities with MCP support |

📚 References

Technical Papers

Version	Title	Link	Date
Qwen 1	Qwen Technical Report	arXiv:2309.16609	Sep 2023
Qwen 2	Qwen2 Technical Report	arXiv:2407.10671	Jul 2024
Qwen 2.5	Qwen2.5 Technical Report	arXiv:2412.15115	Dec 2024
Qwen 3	Qwen3 Technical Report	arXiv:2505.09388	May 2025

Official Blog Posts

Title	Link
Introducing Qwen1.5	qwenlm.github.io/blog/qwen1.5
Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters	qwenlm.github.io/blog/qwen-moe
Qwen1.5-110B: The First 100B+ Model of the Qwen1.5 Series	qwenlm.github.io/blog/qwen1.5-110b
Qwen2.5: A Party of Foundation Models!	qwenlm.github.io/blog/qwen2.5
Qwen2.5-LLM: Extending the Boundary of LLMs	qwenlm.github.io/blog/qwen2.5-llm
Qwen3: Think Deeper, Act Faster	qwenlm.github.io/blog/qwen3

GitHub & Model Repositories

Resource	Link
Qwen GitHub (Main)	github.com/QwenLM/Qwen
Qwen1.5 GitHub	github.com/QwenLM/Qwen1.5
Qwen2.5 GitHub	github.com/QwenLM/Qwen2.5
Qwen3 GitHub	github.com/QwenLM/Qwen3
Hugging Face Collection	huggingface.co/Qwen
ModelScope Collection	modelscope.cn/organization/qwen

Cited Techniques

Technique	Paper	Link
SwiGLU Activation	Dauphin et al., “Language Modeling with Gated Convolutional Networks” (ICML 2017)	—
RoPE	Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding” (2021)	arXiv:2104.09864
RMSNorm	Jiang et al., “Pre-RMSNorm and Pre-CRMSNorm Transformers” (2023)	arXiv:2305.14858
GQA	Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models” (EMNLP 2023)	arXiv:2305.13245
YARN	Peng et al., “YaRN: Efficient Context Window Extension” (2023)	arXiv:2309.00071
DCA	An et al., “Training-Free Long-Context Scaling” (2024)	arXiv:2402.17463
DeepSeek-MoE	Dai et al., “DeepSeekMoE: Towards Ultimate Expert Specialization” (2024)	arXiv:2401.06066
DPO	Rafailov et al., “Direct Preference Optimization” (NeurIPS 2023)	arXiv:2305.18290
Upcycling	Komatsuzaki et al., “Sparse Upcycling: Training MoE from Dense Checkpoints” (ICLR 2023)	—
DeepSeek-R1	DeepSeek Team, “DeepSeek-R1” (2025)	arXiv:2501.12948

_{Built with data from official Qwen technical papers and blog posts. All benchmark numbers sourced directly from the referenced publications.}

← Back to Index