Introduction to Qwen2

Alibaba Group recently released Qwen2, a new AI model that surpasses LLaMA 3 in various aspects. This article delves into the specifics of Qwen2, comparing its performance across different versions and exploring its unique capabilities. We will test both the smallest and largest variants of Qwen2 to assess their speed and quality. Additionally, we will discuss the model’s performance in various tasks, from basic coding to complex reasoning.

Variants of Qwen2: A Closer Look

Qwen2 comes in several versions, ranging from a 0.5 billion parameter model to a 72 billion parameter model. These models are pre-trained and instruction-tuned, catering to different use cases.

Notably, the 7B and 72B instruct versions support an extended context length of up to 128k tokens. This extended context length significantly enhances the model’s ability to process and generate complex outputs.

Our focus today will be on testing the smallest (0.5B) and largest (72B) variants to evaluate their performance across various tasks.

Performance Comparison: Qwen2 vs. LLaMA 3 and Others

In head-to-head evaluations, Qwen2 consistently outperforms its competitors, including LLaMA 3 and Mixel 8. The 72B version of Qwen2, in particular, excels in tasks involving code and math, demonstrating superior performance in both raw and instruction-tuned versions.

Qwen2 vs. LLaMA 3 and Others Benchmark

While the model shows some limitations in specific tasks, such as the MBPP and GSM8K benchmarks, it generally provides better results across the board. This makes Qwen2 a formidable contender in the AI landscape.

We present the performance of BF16, Int8, and Int4 models on the benchmark, demonstrating that the quantized model maintains its performance without significant degradation.

The results are shown below:

Quantization	MMLU	CEval (val)	GSM8K	Humaneval
Qwen-7B-Chat (BF16)	55.8	59.7	50.3	37.2
Qwen-7B-Chat (Int8)	55.4	59.4	48.3	34.8
Qwen-7B-Chat (Int4)	55.1	59.2	49.7	29.9
Qwen-14B-Chat (BF16)	64.6	69.8	60.1	43.9
Qwen-14B-Chat (Int8)	63.6	68.6	60.0	48.2
Qwen-14B-Chat (Int4)	63.3	69.0	59.8	45.7

For additional experimental results, including detailed model performance on more benchmark datasets, please refer to the Qwen2 technical report paper.

Qwen-14B and Qwen-7B surpass baseline models of comparable sizes across various benchmark datasets, such as MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, and others.

These benchmarks assess the models’ abilities in natural language understanding, mathematical problem-solving, coding, and more. However, Qwen-14B still significantly lags behind GPT-3.5, not to mention GPT-4.

The results are shown below:

Qwen2 HumanEval Benchmark

Testing Qwen2: Practical Applications

To put Qwen2 to the test, we ran several practical tasks using both the smallest and largest models. For simple coding tasks, the 0.5B model performed adequately, although it struggled with more complex requests.

Qwen2 Code and Math Capabilities

On the other hand, the 72B model handled intricate tasks with ease, producing accurate and well-structured code.

Additionally, the large model displayed better reasoning capabilities, particularly in logic problems, where it provided well-thought-out explanations.

Conclusion

Qwen2 by Alibaba Group represents a significant advancement in AI technology. Its ability to outperform models like LLaMA 3 across various tasks highlights its potential for a wide range of applications. Whether for basic tasks or more complex operations, Qwen2 offers a robust solution, particularly in its larger variants.

As AI continues to evolve, models like Qwen2 will undoubtedly play a crucial role in shaping the future of technology.

Read other articles about Qwen2 Model Suite in our Blog.