The evaluation of Qwen2 Language Model Family primarily emphasizes their performance in natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, and multilingual capabilities.
The Qwen2 Language Model evaluation encompasses a diverse set of datasets across various tasks:
English tasks include MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5-shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), and ARC-C (25-shot).
For coding, EvalPlus (0-shot) covers HumanEval, MBPP, HumanEval+, and MBPP+, while MultiPL-E (0-shot) evaluates models on languages such as Python, C++, JAVA, PHP, TypeScript, C#, Bash, and JavaScript.
Math tasks are assessed using GSM8K (4-shot) and MATH (4-shot). Chinese tasks include C-Eval (5-shot) and CMMLU (5-shot).
For multilingual tasks, the evaluation includes Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), and Multi-Translation (Flores-101 5-shot).
Qwen2-72B performance
Datasets | DeepSeek-V2 | Mixtral-8x22B | Llama-3-70B | Qwen1.5-72B | Qwen1.5-110B | Qwen2-72B |
---|---|---|---|---|---|---|
Architecture | MoE | MoE | Dense | Dense | Dense | Dense |
#Activated Params | 21B | 39B | 70B | 72B | 110B | 72B |
#Params | 236B | 140B | 70B | 72B | 110B | 72B |
English | ||||||
MMLU | 78.5 | 77.8 | 79.5 | 77.5 | 80.4 | 84.2 |
MMLU-Pro | – | 49.5 | 52.8 | 45.8 | 49.4 | 55.6 |
GPQA | – | 34.3 | 36.3 | 36.3 | 35.9 | 37.9 |
Theorem QA | – | 35.9 | 32.3 | 29.3 | 34.9 | 43.1 |
BBH | 78.9 | 78.9 | 81.0 | 65.5 | 74.8 | 82.4 |
HellaSwag | 87.8 | 88.7 | 88.0 | 86.0 | 87.5 | 87.6 |
WindoGrande | 84.8 | 85.0 | 85.3 | 83.0 | 83.5 | 85.1 |
ARC-C | 70.0 | 70.7 | 68.8 | 65.9 | 69.6 | 68.9 |
TruthfulQA | 42.2 | 51.0 | 45.6 | 59.6 | 49.6 | 54.8 |
Coding | ||||||
HumanEval | 45.7 | 46.3 | 48.2 | 46.3 | 54.3 | 64.6 |
MBPP | 73.9 | 71.7 | 70.4 | 66.9 | 70.9 | 76.9 |
EvalPlus | 55.0 | 54.1 | 54.8 | 52.9 | 57.7 | 65.4 |
MultiPL-E | 44.4 | 46.7 | 46.3 | 41.8 | 52.7 | 59.6 |
Mathematics | ||||||
GSM8K | 79.2 | 83.7 | 83.0 | 79.5 | 85.4 | 89.5 |
MATH | 43.6 | 41.7 | 42.5 | 34.1 | 49.6 | 51.1 |
Chinese | ||||||
C-Eval | 81.7 | 54.6 | 65.2 | 84.1 | 89.1 | 91.0 |
CMMLU | 84.0 | 53.4 | 67.2 | 83.5 | 88.3 | 90.1 |
Multilingual | ||||||
Mulit-Exam | 67.5 | 63.5 | 70.0 | 66.4 | 75.6 | 76.6 |
Multi-Understanding | 77.0 | 77.7 | 79.9 | 78.2 | 78.2 | 80.7 |
Multi-Mathematics | 58.8 | 62.9 | 67.1 | 61.7 | 64.4 | 76.0 |
Multi-Translation | 36.0 | 23.3 | 38.0 | 35.6 | 36.2 | 37.8 |
Qwen2-57B-A14B
Datasets | Jamba | Mixtral-8x7B | Yi-1.5-34B | Qwen1.5-32B | Qwen2-57B-A14B |
---|---|---|---|---|---|
Architecture | MoE | MoE | Dense | Dense | MoE |
#Activated Params | 12B | 12B | 34B | 32B | 14B |
#Params | 52B | 47B | 34B | 32B | 57B |
English | |||||
MMLU | 67.4 | 71.8 | 77.1 | 74.3 | 76.5 |
MMLU-Pro | – | 41.0 | 48.3 | 44.0 | 43.0 |
GPQA | – | 29.2 | – | 30.8 | 34.3 |
Theorem QA | – | 23.2 | – | 28.8 | 33.5 |
BBH | 45.4 | 50.3 | 76.4 | 66.8 | 67.0 |
HellaSwag | 87.1 | 86.5 | 85.9 | 85.0 | 85.2 |
Winogrande | 82.5 | 81.9 | 84.9 | 81.5 | 79.5 |
ARC-C | 64.4 | 66.0 | 65.6 | 63.6 | 64.1 |
TruthfulQA | 46.4 | 51.1 | 53.9 | 57.4 | 57.7 |
Coding | |||||
HumanEval | 29.3 | 37.2 | 46.3 | 43.3 | 53.0 |
MBPP | – | 63.9 | 65.5 | 64.2 | 71.9 |
EvalPlus | – | 46.4 | 51.9 | 50.4 | 57.2 |
MultiPL-E | – | 39.0 | 39.5 | 38.5 | 49.8 |
Mathematics | |||||
GSM8K | 59.9 | 62.5 | 82.7 | 76.8 | 80.7 |
MATH | – | 30.8 | 41.7 | 36.1 | 43.0 |
Chinese | |||||
C-Eval | – | – | – | 83.5 | 87.7 |
CMMLU | – | – | 84.8 | 82.3 | 88.5 |
Multilingual | |||||
Multi-Exam | – | 56.1 | 58.3 | 61.6 | 65.5 |
Multi-Understanding | – | 70.7 | 73.9 | 76.5 | 77.0 |
Multi-Mathematics | – | 45.0 | 49.3 | 56.1 | 62.3 |
Multi-Translation | – | 29.8 | 30.0 | 33.5 | 34.5 |
Qwen2-7B
Datasets | Mistral-7B | Gemma-7B | Llama-3-8B | Qwen1.5-7B | Qwen2-7B |
---|---|---|---|---|---|
# Params | 7.2B | 8.5B | 8.0B | 7.7B | 7.6B |
# Non-emb Params | 7.0B | 7.8B | 7.0B | 6.5B | 6.5B |
English | |||||
MMLU | 64.2 | 64.6 | 66.6 | 61.0 | 70.3 |
MMLU-Pro | 30.9 | 33.7 | 35.4 | 29.9 | 40.0 |
GPQA | 24.7 | 25.7 | 25.8 | 26.7 | 31.8 |
Theorem QA | 19.2 | 21.5 | 22.1 | 14.2 | 31.1 |
BBH | 56.1 | 55.1 | 57.7 | 40.2 | 62.6 |
HellaSwag | 83.2 | 82.2 | 82.1 | 78.5 | 80.7 |
Winogrande | 78.4 | 79.0 | 77.4 | 71.3 | 77.0 |
ARC-C | 60.0 | 61.1 | 59.3 | 54.2 | 60.6 |
TruthfulQA | 42.2 | 44.8 | 44.0 | 51.1 | 54.2 |
Coding | |||||
HumanEval | 29.3 | 37.2 | 33.5 | 36.0 | 51.2 |
MBPP | 51.1 | 50.6 | 53.9 | 51.6 | 65.9 |
EvalPlus | 36.4 | 39.6 | 40.3 | 40.0 | 54.2 |
MultiPL-E | 29.4 | 29.7 | 22.6 | 28.1 | 46.3 |
Mathematics | |||||
GSM8K | 52.2 | 46.4 | 56.0 | 62.5 | 79.9 |
MATH | 13.1 | 24.3 | 20.5 | 20.3 | 44.2 |
Chinese | |||||
C-Eval | 47.4 | 43.6 | 49.5 | 74.1 | 83.2 |
CMMLU | – | – | 50.8 | 73.1 | 83.9 |
Multilingual | |||||
Multi-Exam | 47.1 | 42.7 | 52.3 | 47.7 | 59.2 |
Multi-Understanding | 63.3 | 58.3 | 68.6 | 67.6 | 72.0 |
Multi-Mathematics | 26.3 | 39.1 | 36.3 | 37.3 | 57.5 |
Multi-Translation | 23.3 | 31.2 | 31.9 | 28.4 | 31.5 |
Qwen2-0.5B & Qwen2-1.5B
Datasets | Phi-2 | Gemma-2B | MiniCPM | Qwen1.5-1.8B | Qwen2-0.5B | Qwen2-1.5B |
---|---|---|---|---|---|---|
#Non-Emb Params | 2.5B | 2.0B | 2.4B | 1.3B | 0.35B | 1.3B |
MMLU | 52.7 | 42.3 | 53.5 | 46.8 | 45.4 | 56.5 |
MMLU-Pro | – | 15.9 | – | – | 14.7 | 21.8 |
Theorem QA | – | – | – | – | 8.9 | 15.0 |
HumanEval | 47.6 | 22.0 | 50.0 | 20.1 | 22.0 | 31.1 |
MBPP | 55.0 | 29.2 | 47.3 | 18.0 | 22.0 | 37.4 |
GSM8K | 57.2 | 17.7 | 53.8 | 38.4 | 36.5 | 58.5 |
MATH | 3.5 | 11.8 | 10.2 | 10.1 | 10.7 | 21.7 |
BBH | 43.4 | 35.2 | 36.9 | 24.2 | 28.4 | 37.2 |
HellaSwag | 73.1 | 71.4 | 68.3 | 61.4 | 49.3 | 66.6 |
Winogrande | 74.4 | 66.8 | – | 60.3 | 56.8 | 66.2 |
ARC-C | 61.1 | 48.5 | – | 37.9 | 31.5 | 43.9 |
TruthfulQA | 44.5 | 33.1 | – | 39.4 | 39.7 | 45.9 |
C-Eval | 23.4 | 28.0 | 51.1 | 59.7 | 58.2 | 70.6 |
CMMLU | 24.2 | – | 51.1 | 57.8 | 55.1 | 70.3 |
Instruction-tuned Model Evaluation
Qwen2-72B-Instruct
Datasets | Llama-3-70B-Instruct | Qwen1.5-72B-Chat | Qwen2-72B-Instruct |
---|---|---|---|
English | |||
MMLU | 82.0 | 75.6 | 82.3 |
MMLU-Pro | 56.2 | 51.7 | 64.4 |
GPQA | 41.9 | 39.4 | 42.4 |
TheroemQA | 42.5 | 28.8 | 44.4 |
MT-Bench | 8.95 | 8.61 | 9.12 |
Arena-Hard | 41.1 | 36.1 | 48.1 |
IFEval (Prompt Strict-Acc.) | 77.3 | 55.8 | 77.6 |
Coding | |||
HumanEval | 81.7 | 71.3 | 86.0 |
MBPP | 82.3 | 71.9 | 80.2 |
MultiPL-E | 63.4 | 48.1 | 69.2 |
EvalPlus | 75.2 | 66.9 | 79.0 |
LiveCodeBench | 29.3 | 17.9 | 35.7 |
Mathematics | |||
GSM8K | 93.0 | 82.7 | 91.1 |
MATH | 50.4 | 42.5 | 59.7 |
Chinese | |||
C-Eval | 61.6 | 76.1 | 83.8 |
AlignBench | 7.42 | 7.28 | 8.27 |
Qwen2-57B-A14B-Instruct
Datasets | Mixtral-8x7B-Instruct-v0.1 | Yi-1.5-34B-Chat | Qwen1.5-32B-Chat | Qwen2-57B-A14B-Instruct |
---|---|---|---|---|
Architecture | MoE | Dense | Dense | MoE |
#Activated Params | 12B | 34B | 32B | 14B |
#Params | 47B | 34B | 32B | 57B |
English | ||||
MMLU | 71.4 | 76.8 | 74.8 | 75.4 |
MMLU-Pro | 43.3 | 52.3 | 46.4 | 52.8 |
GPQA | – | – | 30.8 | 34.3 |
TheroemQA | – | – | 30.9 | 33.1 |
MT-Bench | 8.30 | 8.50 | 8.30 | 8.55 |
Coding | ||||
HumanEval | 45.1 | 75.2 | 68.3 | 79.9 |
MBPP | 59.5 | 74.6 | 67.9 | 70.9 |
MultiPL-E | – | – | 50.7 | 66.4 |
EvalPlus | 48.5 | – | 63.6 | 71.6 |
LiveCodeBench | 12.3 | – | 15.2 | 25.5 |
Mathematics | ||||
GSM8K | 65.7 | 90.2 | 83.6 | 79.6 |
MATH | 30.7 | 50.1 | 42.4 | 49.1 |
Chinese | ||||
C-Eval | – | – | 76.7 | 80.5 |
AlignBench | 5.70 | 7.20 | 7.19 | 7.36 |
Qwen2-7B-Instruct
Datasets | Llama-3-8B-Instruct | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen1.5-7B-Chat | Qwen2-7B-Instruct |
---|---|---|---|---|---|
English | |||||
MMLU | 68.4 | 69.5 | 72.4 | 59.5 | 70.5 |
MMLU-Pro | 41.0 | – | – | 29.1 | 44.1 |
GPQA | 34.2 | – | – | 27.8 | 25.3 |
TheroemQA | 23.0 | – | – | 14.1 | 25.3 |
MT-Bench | 8.05 | 8.20 | 8.35 | 7.60 | 8.41 |
Coding | |||||
Humaneval | 62.2 | 66.5 | 71.8 | 46.3 | 79.9 |
MBPP | 67.9 | – | – | 48.9 | 67.2 |
MultiPL-E | 48.5 | – | – | 27.2 | 59.1 |
Evalplus | 60.9 | – | – | 44.8 | 70.3 |
LiveCodeBench | 17.3 | – | – | 6.0 | 26.6 |
Mathematics | |||||
GSM8K | 79.6 | 84.8 | 79.6 | 60.3 | 82.3 |
MATH | 30.0 | 47.7 | 50.6 | 23.2 | 49.6 |
Chinese | |||||
C-Eval | 45.9 | – | 75.6 | 67.3 | 77.2 |
AlignBench | 6.20 | 6.90 | 7.01 | 6.20 | 7.21 |
Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct
Datasets | Qwen1.5-0.5B-Chat | Qwen2-0.5B-Instruct | Qwen1.5-1.8B-Chat | Qwen2-1.5B-Instruct |
---|---|---|---|---|
MMLU | 35.0 | 37.9 | 43.7 | 52.4 |
HumanEval | 9.1 | 17.1 | 25.0 | 37.8 |
GSM8K | 11.3 | 40.1 | 35.3 | 61.6 |
C-Eval | 37.2 | 45.2 | 55.3 | 63.8 |
IFEval (Prompt Strict-Acc.) | 14.6 | 20.0 | 16.8 | 29.0 |
Multilingual Capability of Instruction-Tuned Models
Qwen Dev Team assessed Qwen2 instruction-tuned models against other recent large language models (LLMs) using several cross-lingual open benchmarks and human evaluations. For benchmarking, they focused on two evaluation datasets:
- M-MMLU from Okapi: A multilingual commonsense evaluation, tested on a subset of languages including Arabic (ar), German (de), Spanish (es), French (fr), Italian (it), Dutch (nl), Russian (ru), Ukrainian (uk), Vietnamese (vi), and Chinese (zh).
- MGSM: A math evaluation across languages such as German (de), English (en), Spanish (es), French (fr), Japanese (ja), Russian (ru), Thai (th), Chinese (zh), and Bengali (bn).
The results were averaged over languages for each benchmark and are presented as follows:
Models | M-MMLU (5-shot) | MGSM (0-shot, CoT) |
---|---|---|
Proprietary LLMs | ||
GPT-4-0613 | 78.0 | 87.0 |
GPT-4-Turbo-0409 | 79.3 | 90.5 |
GPT-4o-0513 | 83.2 | 89.6 |
Claude-3-Opus-20240229 | 80.1 | 91.0 |
Claude-3-Sonnet-20240229 | 71.0 | 85.6 |
Open-source LLMs | ||
command-r-plus-110b | 65.5 | 63.5 |
Qwen1.5-7B-Chat | 50.0 | 37.0 |
Qwen1.5-32B-Chat | 65.0 | 65.0 |
Qwen1.5-72B-Chat | 68.4 | 71.7 |
Qwen2-7B-Instruct | 60.0 | 57.0 |
Qwen2-57B-A14B-Instruct | 68.0 | 74.0 |
Qwen2-72B-Instruct | 78.0 | 86.6 |
For human evaluation, they compare Qwen2-72B-Instruct with GPT3.5, GPT4 and Claude-3-Opus using in-house evaluation set, which includes 10 languages ar, es, fr, ko, th, vi, pt, id, ja and ru (the scores range from 1~5):
Models | ar | es | fr | ko | th | vi | pt | id | ja | ru | Average |
---|---|---|---|---|---|---|---|---|---|---|---|
Claude-3-Opus-20240229 | 4.15 | 4.31 | 4.23 | 4.23 | 4.01 | 3.98 | 4.09 | 4.40 | 3.85 | 4.25 | 4.15 |
GPT-4o-0513 | 3.55 | 4.26 | 4.16 | 4.40 | 4.09 | 4.14 | 3.89 | 4.39 | 3.72 | 4.32 | 4.09 |
GPT-4-Turbo-0409 | 3.44 | 4.08 | 4.19 | 4.24 | 4.11 | 3.84 | 3.86 | 4.09 | 3.68 | 4.27 | 3.98 |
Qwen2-72B-Instruct | 3.86 | 4.10 | 4.01 | 4.14 | 3.75 | 3.91 | 3.97 | 3.83 | 3.63 | 4.15 | 3.93 |
GPT-4-0613 | 3.55 | 3.92 | 3.94 | 3.87 | 3.83 | 3.95 | 3.55 | 3.77 | 3.06 | 3.63 | 3.71 |
GPT-3.5-Turbo-1106 | 2.52 | 4.07 | 3.47 | 2.37 | 3.38 | 2.90 | 3.37 | 3.56 | 2.75 | 3.24 | 3.16 |
Grouped by task types, the results are shown as follows:
Models | Knowledge | Understanding | Creation | Math |
---|---|---|---|---|
Claude-3-Opus-20240229 | 3.64 | 4.45 | 4.42 | 3.81 |
GPT-4o-0513 | 3.76 | 4.35 | 4.45 | 3.53 |
GPT-4-Turbo-0409 | 3.42 | 4.29 | 4.35 | 3.58 |
Qwen2-72B-Instruct | 3.41 | 4.07 | 4.36 | 3.61 |
GPT-4-0613 | 3.42 | 4.09 | 4.10 | 3.32 |
GPT-3.5-Turbo-1106 | 3.37 | 3.67 | 3.89 | 2.97 |
These results demonstrate the strong multilingual capabilities of Qwen2 instruction-tuned models.
Check out Qwen2 Family Models:
Read our Blog: