Qwen Model Family

Qwen (Chinese: 通义千问; pinyin: Tongyi Qianwen) represents the advanced large language and multimodal model series developed by the Qwen Team at Alibaba Group. These models excel in a variety of tasks, including natural language understanding, text generation, visual and audio comprehension, tool utilization, role-playing, and acting as AI agents. The Qwen models are pre-trained on extensive multilingual and multimodal datasets, followed by post-training on high-quality data to align them with human preferences.

Qwen is available in both a proprietary version, hosted exclusively on Alibaba Cloud [zh], and an open-weight version accessible to the public.

The spectrum for the open-weight models spans over

Qwen: the language models
- Qwen: 1.8B, 7B, 14B, and 72B models
- Qwen1.5: 0.5B, 1.8B, 4B, 14BA2.7B, 7B, 14B, 32B, 72B, and 110B models
- Qwen2: 0.5B, 1.5B, 7B, 57A14B, and 72B models
- Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B models
Qwen-VL: the vision-language models
- Qwen-VL: 7B-based models
- Qwen2-VL: 2B, 7B, and 72B-based models
Qwen-Audio: the audio-language models
- Qwen-Audio: 7B-based model
- Qwen2-Audio: 7B-based models
CodeQwen/Qwen-Coder: the language models for coding
- CodeQwen1.5: 7B models
- Qwen2.5-Coder: 7B models
Qwen-Math: the language models for mathematics
- Qwen2-Math: 1.5B, 7B, and 72B models
- Qwen2.5-Math: 1.5B, 7B, and 72B models

Causal Language Models

Causal language models, also called autoregressive or decoder-only models, predict the next token in a sequence using only the preceding tokens. This means they generate text token by token, relying solely on the context of previously generated tokens, without considering any future information. The term “causal” refers to this unidirectional focus, making the models suitable for tasks like text generation and completion, where generating coherent and contextually accurate responses is crucial.

Key Takeaway: Qwen models are causal language models ideal for generating coherent text and completing sequences.

Pre-training & Base Models

Base models serve as foundational language models trained on large text corpora to learn language patterns and structures. These models generate fluent and contextually appropriate text, making them versatile for various language processing tasks when fine-tuned. However, they may require in-context learning or additional training to excel in specific instructions or complex reasoning. For Qwen, base models lack the “-Instruct” suffix, such as Qwen2.5-7B and Qwen2.5-72B.

Key Takeaway: Base models are best suited for in-context learning and further fine-tuning.

Post-training & Instruction-Tuned Models

Instruction-tuned models are specialized to understand and follow user instructions in a conversational format. Through additional training using datasets with specific instructions and outcomes, these models are particularly effective at tasks like summarization, translation, and question answering. They maintain the fluency of base models while improving accuracy in executing commands. Qwen models with the “-Instruct” suffix, such as Qwen2.5-7B-Instruct and Qwen2.5-72B-Instruct, belong to this category.

Key Takeaway: Instruction-tuned models are ideal for interactive tasks and precise command execution.

Tokens & Tokenization

Tokens are the core units processed by models, representing either human language text (regular tokens) or specialized functions (control tokens). Tokenizers break down text into tokens—words, subwords, or characters—to create input sequences for the model. A model’s vocabulary size influences its adaptability and performance, with larger vocabularies supporting greater linguistic diversity. Qwen models utilize a vocabulary of 151,646 tokens, balancing size and linguistic range effectively.

Key Takeaway: Tokenization methods and vocabulary size impact model performance.

Byte-Level Byte Pair Encoding

Qwen uses Byte Pair Encoding (BPE) for subword tokenization, which breaks down words into smaller parts to minimize the number of tokens. This approach ensures no unknown words, allowing all text inputs to be converted into token sequences. Qwen’s vocabulary contains 151,643 tokens through BPE, optimizing for diverse language support. For instance, one token typically represents 3-4 English characters or 1.5-1.8 Chinese characters.

Key Takeaway: Qwen efficiently processes text into subword tokens with no unknown words.

Control Tokens & Chat Template

Control tokens and chat templates are tools for guiding model responses. Control tokens, like “<|endoftext|>,” mark meta information, such as document boundaries during pre-training. Chat templates format conversation turns using placeholders, ensuring responses align with desired dialogue structures. Qwen models use the ChatML format, which includes specific tokens like “<|im_start|>” and “<|im_end|>” to manage user-assistant interactions.

Key Takeaway: Qwen uses ChatML and control tokens for structured conversation flow.

Length Limit

As causal models, Qwen’s length limits depend on the training process and use case, with packed sequence lengths reaching 32,768 tokens. For Qwen2.5, the maximum sequence length during pre-training is 32,768 tokens, while assistant responses can extend up to 8,192 tokens during interactions.

Key Takeaway: Qwen2.5Qwen 2.5 models can handle sequences up to 32K tokens, with 8K tokens for assistant outputs.

Read related articles: