Following the release of Qwen2.5, the Alibaba development team responded to the community’s requests for models capable of handling longer contexts. Over the past few months, numerous optimizations have been made to enhance the model’s capabilities and inference performance for extremely long contexts. Today, the team proudly introduces Qwen2.5-Turbo, which brings the following advancements:
- Extended Context Length: The context length has been significantly increased from 128k to 1M tokens—approximately equivalent to 1 million English words or 1.5 million Chinese characters. This capacity can accommodate 10 full-length novels, 150 hours of speech transcripts, or 30,000 lines of code. The model achieves 100% accuracy in the 1M-token Passkey Retrieval task and scores 93.1 on the RULER long-text evaluation benchmark, outperforming GPT-4’s 91.6 and GLM4-9B-1M’s 89.9. Despite these improvements, the model retains exceptional short-sequence capabilities, comparable to GPT-4o-mini.
- Faster Inference Speed: By integrating sparse attention mechanisms, the team has reduced the time to generate the first token for a 1M-token context from 4.9 minutes to just 68 seconds, resulting in a 4.3x speedup.
- Lower Costs: The processing cost remains ¥0.3 per 1M tokens. At this price, Qwen2.5-Turbo can handle 3.6 times the tokens as GPT-4o-mini, offering a more cost-efficient solution.
Qwen2.5-Turbo sets a new standard for handling long contexts efficiently and affordably while maintaining high performance across various use cases.
Now, you can use it through the API service of Alibaba Cloud Model Studio [Chinese], or through HuggingFace Demo or ModelScope Demo.
How to Use the API
The Qwen2.5-Turbo model, now supporting 1M tokens, is fully compatible with the standard Qwen API and the OpenAI API. Below is a simple Python example demonstrating how to use it. Make sure to set your environment variable YOUR_API_KEY
with your API key. For additional details, visit the Quick Start of Alibaba Cloud Model Studio (Chinese).
import os
from openai import OpenAI
# Load a long text file
with open("example.txt", "r", encoding="utf-8") as f:
text = f.read()
user_input = text + "\n\nSummarize the above text."
client = OpenAI(
api_key=os.getenv("YOUR_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-turbo-latest",
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': user_input},
],
)
print(completion.choices[0].message)
Qwen2.5-Turbo Model Performance
Qwen2.5-Turbo’s capabilities were evaluated across various benchmarks, showcasing its remarkable advancements in handling both long and short context tasks.
Passkey Retrieval
In the 1M-token Passkey Retrieval task, Qwen2.5-Turbo demonstrated its ability to identify detailed information embedded in ultra-long contexts with 100% accuracy.
Long Text Understanding
Several datasets were used to evaluate the model’s long-context comprehension:
- RULER: Focused on tasks like finding “needles” in irrelevant text, answering multiple questions, or analyzing word frequency, this benchmark includes contexts up to 128K tokens. Qwen2.5-Turbo scored 93.1, outperforming both GPT-4o-mini and GPT-4.
- LV-Eval: Tests comprehension of numerous evidence fragments across contexts up to 256K tokens. Adjusted metrics prevent false negatives, ensuring a fair evaluation.
- LongBench-Chat: Evaluates human preference alignment for tasks requiring context lengths up to 100K tokens.
In these tests, Qwen2.5-Turbo consistently surpassed GPT-4o-mini, handling tasks with over 128K tokens and demonstrating its dominance in long-context understanding.
Short Text Tasks
Unlike many extended-context models that sacrifice short-text performance, Qwen2.5-Turbo maintains strong results for shorter inputs. Its benchmarks reveal it outperforms other 1M-token models and matches the performance of GPT-4o-mini and Qwen2.5-14B-Instruct, while supporting 8x longer contexts.
Inference Speed
Qwen2.5-Turbo leverages sparse attention mechanisms to optimize inference speed:
- For inputs of 1M tokens, sparse attention compresses computation by approximately 12.5x, resulting in a 3.2x to 4.3x speedup across various hardware setups.
- The time to first token (TTFT) for 1M-token sequences was reduced from 4.9 minutes to just 68 seconds, significantly improving efficiency.
Looking Ahead
The expansion of Qwen2.5-Turbo to support 1M-token contexts marks a significant milestone, but challenges remain. While the model excels in many benchmarks, there are areas for improvement:
- Long-sequence tasks: Performance can be less stable in real-world scenarios.
- Inference costs: Larger models require optimization to reduce computational expense.
The team is committed to addressing these challenges by refining human preference alignment for long sequences, enhancing inference efficiency, and exploring even larger, more capable long-context models. Stay tuned for updates on the next advancements in the Qwen long-context series!