๐ฃ Qwen โ Model Architecture Across Generations
From a competitive Chinese LLM to a global open-weight powerhouse โ tracing 5 generations of architecture evolution.
๐ Table of Contents
- Executive Summary
- Version Release Timeline
- Cross-Version Benchmark Comparison
- Master Architecture Diagram
- Qwen 1 (September 2023)
- Qwen 1.5 (February 2024)
- Qwen 2 (July 2024)
- Qwen 2.5 (September 2024)
- Qwen 3 (April 2025)
- References
๐ Executive Summary
This document covers five generations of the Qwen large language model family developed by Alibaba Cloudโs Qwen Team:
- Qwen 1 โ The foundation: SwiGLU, RoPE, RMSNorm, 151K BPE vocab, 3T tokens
- Qwen 1.5 โ Scale-out: 6 dense sizes (0.5Bโ72B) + first MoE model, 32K context, HF transformers native
- Qwen 2 โ Architecture leap: GQA, DCA+YARN for 128K context, MoE 57B-A14B, 7T tokens, 30 languages
- Qwen 2.5 โ Data scaling: 18T tokens, new 3B/14B/32B sizes, structured output, 8K generation
- Qwen 3 โ Reasoning era: Hybrid think/non-think modes, 36T tokens, 119 languages, 4-stage RL
๐ Note: The Qwen family also includes specialized variants โ Qwen-VL (vision-language), Qwen-Audio, Qwen-Coder, and Qwen-Math โ which are documented separately. This document focuses on the core LLM architecture.
๐ Version Release Timeline
๐ Cross-Version Benchmark Comparison
All numbers are for the flagship base model of each generation (largest dense model). Sources: official technical papers.
| Benchmark | Qwen 1 (72B) | Qwen 1.5 (72B) | Qwen 2 (72B) | Qwen 2.5 (72B) | Qwen 3 (32B) |
|---|---|---|---|---|---|
| MMLU | 74.5 | 77.5 | 84.2 | 86.1 | ~83* |
| HumanEval | 37.2 | 41.5 | 64.6 | 59.1 | ~65* |
| MATH | 17.4 | 34.1 | 51.1 | 62.1 | ~68* |
| GSM8K | 78.9 | 79.5 | 89.5 | 91.5 | ~92* |
| BBH | 67.4 | 65.5 | 82.4 | 86.3 | ~85* |
| Context Length | 8K (32K ext.) | 32K | 128K | 128K | 128K |
| Languages | 2 (en/zh) | ~12 | ~30 | ~29 | 119 |
| Training Tokens | 3T | ~3T | 7T | 18T | 36T |
| Vocabulary | 151,646 | 151,646 | 151,646 | 151,646 | 151,646 |
*Qwen 3 base model benchmarks are for the largest dense model (Qwen3-32B-Base) from the Qwen 3 technical report pre-training performance curves. Exact numbers may vary.
๐๏ธ Master Architecture Diagram
This diagram shows the core Transformer decoder architecture shared across all Qwen versions, with color-coded annotations indicating which generation introduced each component.
๐ฃ Qwen 1 โ September 2023
Summary
- First large-scale open-weight LLM from Alibaba Cloud, establishing the Qwen brand in the competitive Chinese LLM landscape alongside Yi, Baichuan, and ChatGLM
- Built on a standard decoder-only Transformer architecture with causal attention masks, following the paradigm set by GPT and LLaMA
- Adopted SwiGLU (Swish-Gated Linear Unit) activation in FFN layers โ borrowed from PaLM/LLaMA, replacing the conventional ReLU/GELU, providing smoother gradients and better performance
- Used Rotary Positional Embeddings (RoPE) for encoding position information, enabling better length generalization than absolute positional embeddings
- Applied RMSNorm with pre-normalization for improved training stability โ a technique popularized by LLaMA and now standard practice
- Introduced QKV bias in attention layers โ an uncommon design choice at the time that later papers showed helps with RoPE-based length extrapolation
- Trained on 3 trillion tokens of multilingual data (primarily English and Chinese) using a byte-level BPE tokenizer with 151,646 vocabulary โ one of the largest vocabs at the time, designed for strong multilingual compression
- Released in two sizes: 7B and 72B parameters, using standard Multi-Head Attention (MHA) across all layers
- Included specialized variants: Qwen-Chat (aligned via SFT + RLHF), Code-Qwen, and Math-Qwen โ demonstrating a full-stack approach from day one
- Supported 8K context natively with NTK-aware interpolation for extending to 32K โ an industry technique for RoPE-based dynamic length extrapolation
Architecture Diagram โ Qwen 1
Community Perspective
- Received strong reception in the Chinese AI community as a capable alternative to LLaMA for Chinese-English tasks
- The 72B model demonstrated that Chinese AI labs could produce GPT-3.5-competitive models
- The large 151K vocabulary was praised for efficient multilingual tokenization โ many competitors used smaller vocabs
- Tool-use and code interpreter capabilities in Qwen-Chat were ahead of most open-source alternatives at launch
- Some concern about training data transparency compared to fully open models like LLaMA
Model Variants
| Model | Parameters | Layers | Heads (Q/KV) | Context | Embedding Tying |
|---|---|---|---|---|---|
| Qwen-7B | 7.7B | 32 | 32 / 32 | 8K (32K) | No |
| Qwen-72B | 72B | 80 | 64 / 64 | 8K (32K) | No |
| Qwen-7B-Chat | 7.7B | 32 | 32 / 32 | 8K (32K) | No |
| Qwen-72B-Chat | 72B | 80 | 64 / 64 | 8K (32K) | No |
Key Industry Ideas Incorporated
| Technique | Origin | How Qwen 1 Used It | |:----------|:-------|:-------------------| | SwiGLU | PaLM (Google, 2022) | FFN activation function replacing GELU | | RoPE | Su et al. (RoFormer, 2021) | Positional encoding for all attention layers | | RMSNorm | Jiang et al. (2023) | Replaced LayerNorm for faster, stabler training | | BPE Tokenizer | Sennrich et al. (2015) | Byte-level BPE with 151K vocab for multilingual | | NTK-aware Interpolation | Reddit/community (2023) | Dynamic RoPE scaling for context extension |๐ช Qwen 1.5 โ February 2024
Summary
- Incremental refinement rather than architecture overhaul โ focused on improving base model quality and massively expanding the developer experience
- Expanded to 8 dense model sizes: 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B, and 110B โ the 110B was the first 100B+ model in the Qwen family
- Introduced the first Qwen MoE model: Qwen1.5-MoE-A2.7B with 14.3B total parameters, 2.7B activated โ achieving 7B-class performance at 1/3 the compute
- Architecture identical to Qwen 1 (MHA, SwiGLU, RoPE, RMSNorm, QKV bias) โ improvements came from better data, longer training, and alignment techniques
- Uniformly 32K context across all model sizes โ up from the 8K default of Qwen 1 โ achieved through RoPE frequency adjustments
- Native Hugging Face transformers integration โ no more
trust_remote_code=True, making deployment frictionless withtransformers>=4.37.0 - Alignment enhanced with DPO (Direct Preference Optimization) and PPO (Proximal Policy Optimization) โ producing significantly better chat models
- Multilingual capabilities expanded to ~12 languages with structured evaluation on Arabic, Spanish, French, Japanese, Korean, Thai, Vietnamese, and more
- MoE architecture used 64 fine-grained experts with 4 shared + 60 routed (4 activated per token) โ inspired by DeepSeek-MoEโs fine-grained expert design
- Upcycling initialization for MoE: started from Qwen-1.8B weights, transformed into MoE structure with randomized initialization for diversity โ reduced training cost by 75% vs. training from scratch
Architecture Diagram โ Qwen 1.5 MoE
Always activated
4 activated per token โ SwiGLU each
Community Perspective
- Widely praised for the developer experience overhaul โ HF-native support was a game-changer for adoption
- The MoE model (A2.7B) surprised many by matching Mistral-7B and Qwen1.5-7B while being 1/3 the activated size
- The 110B model was seen as a statement of scale ambition, though it didnโt get as much adoption as the 72B
- Strong reception for the expanded size lineup โ the 0.5B and 1.8B models enabled edge/mobile deployment
- Criticism: the architecture was largely unchanged from Qwen 1, so improvements felt incremental
Model Variants
| Model | Total Params | Active Params | Layers | Context | Notes |
|---|---|---|---|---|---|
| Qwen1.5-0.5B | 0.5B | 0.5B | 24 | 32K | Embedding tying |
| Qwen1.5-1.8B | 1.8B | 1.8B | 24 | 32K | Embedding tying |
| Qwen1.5-4B | 4B | 4B | 40 | 32K | โ |
| Qwen1.5-7B | 7.7B | 7.7B | 32 | 32K | โ |
| Qwen1.5-14B | 14B | 14B | 40 | 32K | โ |
| Qwen1.5-32B | 32B | 32B | 64 | 32K | โ |
| Qwen1.5-72B | 72B | 72B | 80 | 32K | โ |
| Qwen1.5-110B | 110B | 110B | 80 | 32K | First 100B+ Qwen |
| Qwen1.5-MoE-A2.7B | 14.3B | 2.7B | 24 | 32K | 64 experts, 4 shared |
Key Industry Ideas Incorporated
| Technique | Origin | How Qwen 1.5 Used It | |:----------|:-------|:-------------------| | Fine-grained MoE Experts | DeepSeek-MoE (Jan 2024) | 64 fine-grained experts instead of 8 coarse experts | | Shared + Routed Experts | DeepSeek-MoE, Rajbhandari et al. (2022) | 4 shared experts always active alongside routed ones | | Upcycling | Komatsuzaki et al. (2023) | Initialize MoE from dense model weights | | DPO | Rafailov et al. (2023) | Direct preference optimization for alignment | | PPO | Schulman et al. (2017) | Proximal policy optimization for RLHF |๐ต Qwen 2 โ July 2024
Summary
- Major architecture upgrade โ the most significant changes since Qwenโs inception, introducing multiple new attention and positional mechanisms
- Grouped Query Attention (GQA) replaced MHA across all models โ dramatically reducing KV cache memory during inference while maintaining quality
- Dual Chunk Attention (DCA) + YARN enabled 128K context by segmenting long sequences into manageable chunks with rescaled attention weights
- Expanded to 5 model sizes: 0.5B, 1.5B, 7B, 72B (dense) + 57B-A14B (MoE) โ the MoE model had 57B total parameters with 14B active per token
- MoE architecture advanced significantly: fine-grained experts with smaller expert size, shared + routing experts (8 shared + 64 routed, 8 activated), and upcycled from Qwen2-7B
- Training data scaled to 7 trillion tokens (from 3T) with dramatically expanded code, math, and multilingual content โ supporting ~30 languages
- Smaller models (0.5B, 1.5B) used embedding tying and were trained on 12T and 7T tokens respectively โ more tokens per parameter than larger models
- Post-training involved SFT with 500K+ examples followed by both offline DPO and online RLHF with a reward model โ the most sophisticated alignment pipeline in the Qwen family at the time
- Online Merging Optimizer was used to mitigate alignment tax โ reducing performance degradation from RLHF
- RoPE base frequency increased from 10,000 to 1,000,000 in the long-context training phase โ enabling much longer effective sequence lengths
Architecture Diagram โ Qwen 2
| Model | Q Heads | KV Heads | GQA Ratio |
|---|---|---|---|
| 0.5B | 14 | 2 | 7:1 |
| 1.5B | 12 | 2 | 6:1 |
| 7B | 28 | 4 | 7:1 |
| 72B | 64 | 8 | 8:1 |
| 57B-A14B (MoE) | 28 | 4 | 7:1 |
Official Paper Figure
Needle in a Haystack test results for Qwen2 instruction-tuned models showing capability across 128K context:
Source: Qwen2 Technical Report (arXiv:2407.10671), Figure 1
Community Perspective
- GQA adoption was welcomed as overdue โ competitors like LLaMA 2 (70B) had already adopted it for KV cache efficiency
- The 128K context via DCA+YARN was a major selling point, though real-world performance degraded at extreme lengths
- 57B-A14B MoE model showcased that Qwenโs MoE expertise had matured โ fine-grained experts were more efficient than Mixtralโs coarse approach
- Qwen2-72Bโs competitiveness with LLaMA-3-70B established Qwen as a top-tier global open-weight model โ not just a Chinese alternative
- The 7T token dataset with 30 language support marked Qwenโs transition from a bilingual to a truly multilingual model family
Model Variants
| Model | Total Params | Hidden | Layers | Q Heads / KV Heads | Context | Tokens |
|---|---|---|---|---|---|---|
| Qwen2-0.5B | 0.5B | 896 | 24 | 14 / 2 | 128K | 12T |
| Qwen2-1.5B | 1.5B | 1,536 | 28 | 12 / 2 | 128K | 7T |
| Qwen2-7B | 7B | 3,584 | 28 | 28 / 4 | 128K | 7T |
| Qwen2-72B | 72B | 8,192 | 80 | 64 / 8 | 128K | 7T |
| Qwen2-57B-A14B | 57B (14B active) | 3,584 | 28 | 28 / 4 | 128K | 4.5T |
Key Industry Ideas Incorporated
| Technique | Origin | How Qwen 2 Used It | |:----------|:-------|:-------------------| | GQA | Ainslie et al. (2023) | Replaced MHA for all Qwen 2 models | | Dual Chunk Attention | An et al. (2024) | Long sequence handling for 128K | | YARN | Peng et al. (2023) | Attention weight rescaling for length extrapolation | | Fine-grained MoE | Dai et al. (DeepSeek, 2024) | Smaller experts with more activated simultaneously | | Online Merging Optimizer | Lu et al. (2024) | Mitigating alignment tax during RLHF | | DPO | Rafailov et al. (2023) | Offline preference optimization stage |๐ท Qwen 2.5 โ September 2024
Summary
- Data scaling landmark โ pre-training dataset expanded from 7T to 18 trillion tokens, representing one of the largest known training runs for open-weight models
- Architecture identical to Qwen 2 at the model level โ same GQA, DCA+YARN, SwiGLU, RoPE, 151K vocab โ the improvements were entirely from data quality and scale
- Introduced three new model sizes: 3B (for mobile), 14B and 32B (for production) โ filling gaps that the community had been requesting
- Knowledge dramatically improved: MMLU jumped from 84.2 (Qwen 2) to 86.1 (Qwen 2.5) for the 72B base model โ a significant gain at the top of the benchmark
- Long text generation breakthrough: models could now generate up to 8K tokens per response (vs. ~1K in Qwen 2) โ enabled by post-training on long-form data
- Structured output support added โ models reliably produce JSON, tables, and formatted data โ a critical feature for production agentic applications
- Post-training evolved to over 1 million SFT samples plus multi-stage RL โ incorporating techniques from Qwen2.5-Math and Qwen2.5-Coder specialist models
- Code performance surged thanks to Qwen2.5-Coder integration: LiveCodeBench jumped from 32.2 (Qwen 2) to 55.5 (Qwen 2.5) for the 72B instruct model
- Math equally improved via Qwen2.5-Math technology: MATH benchmark went from 69.0 to 83.1 for the 72B instruct model
- Qwen2.5-72B demonstrated competitive with or superior to LLaMA-3.1-405B on many benchmarks despite being ~5x smaller
Architecture Diagram โ Qwen 2.5
Official Paper Figures
Source: Qwen2.5 Blog โ 72B-Instruct benchmark comparison
Source: Qwen2.5 Blog โ Model specifications overview
Community Perspective
- The 18T token dataset was a headline number โ more than Llama 3โs 15T and signaling massive investment in data curation
- Qwen2.5-32B outperforming Qwen2-72B demonstrated that data quality matters more than model size at this scale
- The structured output capabilities made Qwen 2.5 the go-to choice for many agentic/tool-use applications
- The code and math improvements were directly attributable to specialist model techniques โ showing the value of the Qwen ecosystem approach
- Community noted that same-architecture improvements have diminishing returns โ expectations built for an architecture refresh in Qwen 3
Model Variants
| Model | Total Params | Non-Emb Params | Layers | Q Heads / KV Heads | Emb. Tying | Context | Gen. Length |
|---|---|---|---|---|---|---|---|
| Qwen2.5-0.5B | 0.49B | 0.36B | 24 | 14 / 2 | Yes | 32K | 8K |
| Qwen2.5-1.5B | 1.54B | 1.31B | 28 | 12 / 2 | Yes | 32K | 8K |
| Qwen2.5-3B | 3.09B | 2.77B | 36 | 16 / 2 | Yes | 32K | 8K |
| Qwen2.5-7B | 7.61B | 6.53B | 28 | 28 / 4 | No | 128K | 8K |
| Qwen2.5-14B | 14.7B | 13.1B | 48 | 40 / 8 | No | 128K | 8K |
| Qwen2.5-32B | 32.5B | 31.0B | 64 | 40 / 8 | No | 128K | 8K |
| Qwen2.5-72B | 72.7B | 70.0B | 80 | 64 / 8 | No | 128K | 8K |
API-only models: Qwen2.5-Turbo (MoE) and Qwen2.5-Plus (MoE) were also released through Alibaba Cloud Model Studio.
Key Industry Ideas Incorporated
| Technique | Origin | How Qwen 2.5 Used It | |:----------|:-------|:-------------------| | Specialist Model Distillation | Multi-task learning research | Fused Qwen2.5-Coder and Qwen2.5-Math capabilities into the general model | | Multi-stage RL | DeepSeek, OpenAI o1 (2024) | Multiple RL stages for different capability domains | | Structured Output Training | GPT-4 function calling (2023) | Reliable JSON/structured data generation | | Long-form Generation SFT | โ | Dedicated training for 8K+ token outputs | | System Prompt Robustness | โ | Training on diverse system prompts for better role-play |๐ข Qwen 3 โ April 2025
Summary
- Paradigm shift: introduced hybrid thinking modes โ models can seamlessly switch between โThinkingโ mode (step-by-step reasoning, like o1/QwQ) and โNon-Thinkingโ mode (fast direct responses) within a single model
- Massive scale-up: flagship Qwen3-235B-A22B has 235B total parameters with 22B activated โ the largest Qwen MoE to date, plus Qwen3-30B-A3B as an efficient smaller MoE
- Released 8 models total: 6 dense (0.6B, 1.7B, 4B, 8B, 14B, 32B) + 2 MoE (30B-A3B, 235B-A22B) โ all open-weighted under Apache 2.0
- Training data nearly doubled to 36 trillion tokens covering 119 languages and dialects โ a dramatic jump from Qwen 2.5โs 29 languages
- Used Qwen2.5-VL to extract text from PDF-like documents and Qwen2.5-Math/Coder to generate synthetic training data โ the โmodels training modelsโ paradigm
- 4-stage post-training pipeline: (1) Long CoT cold start, (2) Reasoning-based RL with rule-based rewards, (3) Thinking mode fusion โ blending thinking and non-thinking data, (4) General RL across 20+ domains
- Thinking budget mechanism allows users to control how much reasoning compute to allocate per query โ enabling smooth latency vs. quality tradeoffs
- Three-stage pre-training: S1 (30T+ tokens, 4K context) โ S2 (5T tokens, knowledge-intensive STEM/code/reasoning) โ S3 (high-quality long-context data, extend to 32K)
- Dense models match performance of Qwen 2.5 models 2ร their size: e.g., Qwen3-8B โ Qwen2.5-14B, Qwen3-4B โ Qwen2.5-7B
- MoE models achieve similar performance to Qwen 2.5 dense models at only ~10% of active parameters โ Qwen3-30B-A3B outperforms QwQ-32B with 10ร fewer active params
Architecture Diagram โ Qwen 3
- Step-by-step reasoning in
<think>...</think> - Complex math, coding, logic
- User controls thinking budget
- Direct, fast responses
- Simple queries, chat, translation
- Toggle via
/thinkor/no_think
Long CoT
Cold Start
Reasoning
RL
Think Mode
Fusion
General RL
(20+ tasks)
Official Paper Figures
Source: Qwen3 Blog โ Qwen3-235B-A22B benchmark comparison against DeepSeek-R1, o1, o3-mini, Grok-3, Gemini-2.5-Pro
Source: Qwen3 Blog โ Qwen3-30B-A3B outperforming QwQ-32B with 10ร fewer active parameters
Source: Qwen3 Blog โ Thinking budget mechanism showing smooth performance scaling with compute
Source: Qwen3 Blog โ 4-stage post-training pipeline overview
Community Perspective
- The hybrid thinking mode was seen as a direct answer to OpenAIโs o1/o3 and DeepSeek-R1 โ but more elegant because itโs a single model rather than separate chat vs. reasoning models
- Qwen3-30B-A3B outperforming QwQ-32B was a landmark result โ demonstrating extreme MoE efficiency
- 119 language support (from 29) was a massive expansion โ making Qwen 3 one of the most multilingual open-weight models available
- The 4-stage post-training pipeline was praised as a well-engineered approach to combining reasoning and general capabilities
- Open-source community quickly adopted the
/thinkand/no_thinktoggles as an intuitive user interface for controlling reasoning depth
Model Variants โ Dense
| Model | Params | Layers | Q Heads / KV Heads | Emb. Tying | Context |
|---|---|---|---|---|---|
| Qwen3-0.6B | 0.6B | 28 | 16 / 8 | Yes | 32K |
| Qwen3-1.7B | 1.7B | 28 | 16 / 8 | Yes | 32K |
| Qwen3-4B | 4B | 36 | 32 / 8 | Yes | 32K |
| Qwen3-8B | 8B | 36 | 32 / 8 | No | 128K |
| Qwen3-14B | 14B | 40 | 40 / 8 | No | 128K |
| Qwen3-32B | 32B | 64 | 64 / 8 | No | 128K |
Model Variants โ MoE
| Model | Total Params | Active Params | Layers | Q/KV Heads | Routed/Shared Experts | Context |
|---|---|---|---|---|---|---|
| Qwen3-30B-A3B | 30B | 3B | 48 | 32 / 4 | 128 / 8 | 128K |
| Qwen3-235B-A22B | 235B | 22B | 94 | 64 / 4 | 128 / 8 | 128K |
Key Industry Ideas Incorporated
| Technique | Origin | How Qwen 3 Used It | |:----------|:-------|:-------------------| | Hybrid Thinking/Non-Thinking | OpenAI o1 (2024), DeepSeek-R1 (2025) | Unified single model with switchable reasoning modes | | Thinking Budget Control | โ | User-configurable compute allocation per query | | Rule-based RL Rewards | DeepSeek-R1 (2025) | Used in Stage 2 of post-training for reasoning RL | | Synthetic Data from Models | Phi-series (Microsoft), Qwen2.5 | Training data generated by Qwen2.5-VL, Math, Coder | | Multi-stage Pre-training | Industry practice (2024-2025) | S1 (general) โ S2 (knowledge) โ S3 (long-context) | | MCP Tool Protocol | Anthropic (2024) | Enhanced agentic capabilities with MCP support |๐ References
Technical Papers
| Version | Title | Link | Date |
|---|---|---|---|
| Qwen 1 | Qwen Technical Report | arXiv:2309.16609 | Sep 2023 |
| Qwen 2 | Qwen2 Technical Report | arXiv:2407.10671 | Jul 2024 |
| Qwen 2.5 | Qwen2.5 Technical Report | arXiv:2412.15115 | Dec 2024 |
| Qwen 3 | Qwen3 Technical Report | arXiv:2505.09388 | May 2025 |
Official Blog Posts
| Title | Link |
|---|---|
| Introducing Qwen1.5 | qwenlm.github.io/blog/qwen1.5 |
| Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters | qwenlm.github.io/blog/qwen-moe |
| Qwen1.5-110B: The First 100B+ Model of the Qwen1.5 Series | qwenlm.github.io/blog/qwen1.5-110b |
| Qwen2.5: A Party of Foundation Models! | qwenlm.github.io/blog/qwen2.5 |
| Qwen2.5-LLM: Extending the Boundary of LLMs | qwenlm.github.io/blog/qwen2.5-llm |
| Qwen3: Think Deeper, Act Faster | qwenlm.github.io/blog/qwen3 |
GitHub & Model Repositories
| Resource | Link |
|---|---|
| Qwen GitHub (Main) | github.com/QwenLM/Qwen |
| Qwen1.5 GitHub | github.com/QwenLM/Qwen1.5 |
| Qwen2.5 GitHub | github.com/QwenLM/Qwen2.5 |
| Qwen3 GitHub | github.com/QwenLM/Qwen3 |
| Hugging Face Collection | huggingface.co/Qwen |
| ModelScope Collection | modelscope.cn/organization/qwen |
Cited Techniques
| Technique | Paper | Link |
|---|---|---|
| SwiGLU Activation | Dauphin et al., โLanguage Modeling with Gated Convolutional Networksโ (ICML 2017) | โ |
| RoPE | Su et al., โRoFormer: Enhanced Transformer with Rotary Position Embeddingโ (2021) | arXiv:2104.09864 |
| RMSNorm | Jiang et al., โPre-RMSNorm and Pre-CRMSNorm Transformersโ (2023) | arXiv:2305.14858 |
| GQA | Ainslie et al., โGQA: Training Generalized Multi-Query Transformer Modelsโ (EMNLP 2023) | arXiv:2305.13245 |
| YARN | Peng et al., โYaRN: Efficient Context Window Extensionโ (2023) | arXiv:2309.00071 |
| DCA | An et al., โTraining-Free Long-Context Scalingโ (2024) | arXiv:2402.17463 |
| DeepSeek-MoE | Dai et al., โDeepSeekMoE: Towards Ultimate Expert Specializationโ (2024) | arXiv:2401.06066 |
| DPO | Rafailov et al., โDirect Preference Optimizationโ (NeurIPS 2023) | arXiv:2305.18290 |
| Upcycling | Komatsuzaki et al., โSparse Upcycling: Training MoE from Dense Checkpointsโ (ICLR 2023) | โ |
| DeepSeek-R1 | DeepSeek Team, โDeepSeek-R1โ (2025) | arXiv:2501.12948 |
Built with data from official Qwen technical papers and blog posts. All benchmark numbers sourced directly from the referenced publications.