Qwen2 Concepts

Qwen2 Concepts

Qwen2 (Chinese: 通义千问; pinyin: Tongyi Qianwen) is a series of large language and multimodal models developed by the Qwen Team at Alibaba Group. These models excel in natural language understanding, text generation, vision and audio comprehension, tool usage, role-playing, and acting as AI agents. They are pre-trained on extensive multilingual and multimodal data and further refined with high-quality data to align with human preferences.

There are both proprietary versions hosted exclusively on Alibaba Cloud [zh] and open-weight versions available. The open-weight models include:

  • Qwen: Language models available in 1.8B, 7B, 14B, and 72B sizes.
  • Qwen1.5: Models ranging from 0.5B to 110B parameters.
  • Qwen2: Models in 0.5B, 1.5B, 7B, 14B, and 72B sizes.
  • Qwen-VL: Vision-language models based on 7B parameters.
  • Qwen-Audio: Audio-language models, including Qwen-Audio and Qwen2-Audio, both 7B-based.
  • CodeQwen: Coding-focused language models, with CodeQwen1.5 models at 7B parameters.

This document focuses on Qwen, the language models.

Causal Qwen Language Models

Causal language models, also known as autoregressive or decoder-only models, are designed to predict the next token in a sequence by using the preceding tokens as context. These models generate text one token at a time, relying solely on the previously generated tokens, without considering future ones. This “causal” approach ensures that the model only uses past information when predicting the next token.

These models are fundamental in natural language processing, particularly for tasks like text completion and generation. Their ability to produce coherent and contextually appropriate text has made them integral to modern natural language understanding and generation systems.

Takeaway: Qwen models are causal language models ideal for text completion tasks.

Pre-training & Base Models

Base language models are foundational models trained on vast text corpora to predict the next word in a sequence. Their primary purpose is to capture the statistical patterns and structures of language, enabling them to generate coherent and contextually relevant text. These models are versatile and can be fine-tuned for various natural language processing tasks. While they excel at producing fluent text, they may require in-context learning or additional training to follow specific instructions or handle complex reasoning tasks effectively. In the Qwen model series, base models are those without the “-Instruct” suffix, such as Qwen2-7B and Qwen2-72B.

Takeaway: Use base models for in-context learning and downstream fine-tuning.

Post-training & Instruction-tuned Models

Instruction-tuned language models are specialized models fine-tuned to understand and execute specific instructions in a conversational style. These models are trained to interpret user commands accurately and perform tasks such as summarization, translation, and question answering with enhanced accuracy and consistency. Unlike base models, which are trained on large text corpora, instruction-tuned models undergo additional training with datasets that include examples of instructions and their desired outcomes, often in multiple turns. This additional training makes them ideal for applications requiring targeted functionalities while maintaining the ability to generate fluent and coherent text. In the Qwen model series, instruction-tuned models have the “-Instruct” suffix, such as Qwen2-7B-Instruct and Qwen2-72B-Instruct.

Takeaway: Use instruction-tuned models for task-oriented conversations and downstream fine-tuning.

Tokens & Tokenization

Tokens are the fundamental units processed and generated by language models. They can represent text in human languages (regular tokens) or specific functionality, such as keywords in programming languages (control tokens). A tokenizer splits text into regular tokens, which may be words, subwords, or characters, depending on the tokenization scheme. The tokenization process may also include control tokens as needed. The size of the vocabulary, or the total number of unique tokens a model recognizes, significantly impacts its performance and versatility. Larger models often use sophisticated tokenization methods to handle the vast diversity of human language while keeping vocabulary size manageable. Qwen uses a large vocabulary of 151,646 tokens.

Takeaway: Tokenization method and vocabulary size are crucial to model performance.

Byte-level Byte Pair Encoding (BPE)

Qwen employs a subword tokenization method called Byte Pair Encoding (BPE), which learns the composition of tokens that can represent the text with the fewest tokens. For instance, the word “tokenization” is split into “token” and “ization” (with the space included as part of the token). Qwen’s tokenization ensures that there are no unknown words, meaning all text can be transformed into token sequences. The Qwen vocabulary includes 151,643 tokens as a result of BPE, which is efficient for diverse languages. Typically, 1 token represents 3-4 characters in English and 1.5-1.8 characters in Chinese.

Takeaway: Qwen processes text at the subword level, ensuring that no words are unknown.

Control Tokens & Chat Templates

Control tokens and chat templates are essential tools for guiding a model’s behavior and outputs.

  • Control Tokens: These are special tokens embedded within a sequence to convey meta information. For instance, when pre-training involves multiple documents in a single sequence, Qwen uses the control token “<|endoftext|>” to mark the end of one document and the start of another.
  • Chat Templates: These provide a structured format for conversations, using predefined placeholders or prompts to ensure that responses align with the desired dialogue flow or context. Different models may use different chat templates, and using the correct one is crucial for precise control over the LLM’s generation process.

Qwen utilizes the ChatML format, incorporating control tokens to structure each turn in conversations:

<|im_start|>{{role}}
{{content}}<|im_end|>

In this format, user inputs are labeled as “user,” and model outputs are labeled as “assistant.” Qwen also supports meta messages, which can instruct the model to perform specific actions or generate text with certain characteristics, such as adjusting tone, style, or content. These meta messages are given the role of “system,” with the default content being, “You are a helpful assistant.”

Control Tokens in Qwen

Qwen’s vocabulary includes three control tokens, bringing the total vocabulary size to 151,646 tokens. These control tokens are integral to the ChatML format used by Qwen for structuring conversational interactions.

Takeaway: Qwen employs ChatML with three control tokens to manage its chat template structure.

Length Limitations

Qwen models, as causal language models, have a single length limit for the entire sequence. However, the effective length can vary depending on the use case. This is influenced by how sequences are packed during training, with each sequence potentially containing multiple individual text pieces.

For Qwen2 models, the training sequence length is capped at 32,768 tokens. This is the maximum document length during pre-training. In post-training scenarios, the length of messages varies, with assistant messages typically reaching up to 2048 tokens, and for specialized tasks like converting tables to HTML, it can extend to 6-8K tokens.

Takeaway: Qwen2 models can handle sequences of up to 32K or 128K tokens, but not all can be output at once.