Qwen2 Language Model Evaluation

Qwen2 Language Model Evaluation

The evaluation of Qwen2 Language Model Family primarily emphasizes their performance in natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, and multilingual capabilities.

The Qwen2 Language Model evaluation encompasses a diverse set of datasets across various tasks:

English tasks include MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5-shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), and ARC-C (25-shot).

For coding, EvalPlus (0-shot) covers HumanEval, MBPP, HumanEval+, and MBPP+, while MultiPL-E (0-shot) evaluates models on languages such as Python, C++, JAVA, PHP, TypeScript, C#, Bash, and JavaScript.

Math tasks are assessed using GSM8K (4-shot) and MATH (4-shot). Chinese tasks include C-Eval (5-shot) and CMMLU (5-shot).

For multilingual tasks, the evaluation includes Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), and Multi-Translation (Flores-101 5-shot).

Qwen2-72B performance

DatasetsDeepSeek-V2Mixtral-8x22BLlama-3-70BQwen1.5-72BQwen1.5-110BQwen2-72B
ArchitectureMoEMoEDenseDenseDenseDense
#Activated Params21B39B70B72B110B72B
#Params236B140B70B72B110B72B
English
MMLU78.577.879.577.580.484.2
MMLU-Pro49.552.845.849.455.6
GPQA34.336.336.335.937.9
Theorem QA35.932.329.334.943.1
BBH78.978.981.065.574.882.4
HellaSwag87.888.788.086.087.587.6
WindoGrande84.885.085.383.083.585.1
ARC-C70.070.768.865.969.668.9
TruthfulQA42.251.045.659.649.654.8
Coding
HumanEval45.746.348.246.354.364.6
MBPP73.971.770.466.970.976.9
EvalPlus55.054.154.852.957.765.4
MultiPL-E44.446.746.341.852.759.6
Mathematics
GSM8K79.283.783.079.585.489.5
MATH43.641.742.534.149.651.1
Chinese
C-Eval81.754.665.284.189.191.0
CMMLU84.053.467.283.588.390.1
Multilingual
Mulit-Exam67.563.570.066.475.676.6
Multi-Understanding77.077.779.978.278.280.7
Multi-Mathematics58.862.967.161.764.476.0
Multi-Translation36.023.338.035.636.237.8

Qwen2-57B-A14B

DatasetsJambaMixtral-8x7BYi-1.5-34BQwen1.5-32BQwen2-57B-A14B
ArchitectureMoEMoEDenseDenseMoE
#Activated Params12B12B34B32B14B
#Params52B47B34B32B57B
English
MMLU67.471.877.174.376.5
MMLU-Pro41.048.344.043.0
GPQA29.230.834.3
Theorem QA23.228.833.5
BBH45.450.376.466.867.0
HellaSwag87.186.585.985.085.2
Winogrande82.581.984.981.579.5
ARC-C64.466.065.663.664.1
TruthfulQA46.451.153.957.457.7
Coding
HumanEval29.337.246.343.353.0
MBPP63.965.564.271.9
EvalPlus46.451.950.457.2
MultiPL-E39.039.538.549.8
Mathematics
GSM8K59.962.582.776.880.7
MATH30.841.736.143.0
Chinese
C-Eval83.587.7
CMMLU84.882.388.5
Multilingual
Multi-Exam56.158.361.665.5
Multi-Understanding70.773.976.577.0
Multi-Mathematics45.049.356.162.3
Multi-Translation29.830.033.534.5

Qwen2-7B

DatasetsMistral-7BGemma-7BLlama-3-8BQwen1.5-7BQwen2-7B
# Params7.2B8.5B8.0B7.7B7.6B
# Non-emb Params7.0B7.8B7.0B6.5B6.5B
English
MMLU64.264.666.661.070.3
MMLU-Pro30.933.735.429.940.0
GPQA24.725.725.826.731.8
Theorem QA19.221.522.114.231.1
BBH56.155.157.740.262.6
HellaSwag83.282.282.178.580.7
Winogrande78.479.077.471.377.0
ARC-C60.061.159.354.260.6
TruthfulQA42.244.844.051.154.2
Coding
HumanEval29.337.233.536.051.2
MBPP51.150.653.951.665.9
EvalPlus36.439.640.340.054.2
MultiPL-E29.429.722.628.146.3
Mathematics
GSM8K52.246.456.062.579.9
MATH13.124.320.520.344.2
Chinese
C-Eval47.443.649.574.183.2
CMMLU50.873.183.9
Multilingual
Multi-Exam47.142.752.347.759.2
Multi-Understanding63.358.368.667.672.0
Multi-Mathematics26.339.136.337.357.5
Multi-Translation23.331.231.928.431.5

Qwen2-0.5B & Qwen2-1.5B

DatasetsPhi-2Gemma-2BMiniCPMQwen1.5-1.8BQwen2-0.5BQwen2-1.5B
#Non-Emb Params2.5B2.0B2.4B1.3B0.35B1.3B
MMLU52.742.353.546.845.456.5
MMLU-Pro15.914.721.8
Theorem QA8.915.0
HumanEval47.622.050.020.122.031.1
MBPP55.029.247.318.022.037.4
GSM8K57.217.753.838.436.558.5
MATH3.511.810.210.110.721.7
BBH43.435.236.924.228.437.2
HellaSwag73.171.468.361.449.366.6
Winogrande74.466.860.356.866.2
ARC-C61.148.537.931.543.9
TruthfulQA44.533.139.439.745.9
C-Eval23.428.051.159.758.270.6
CMMLU24.251.157.855.170.3

Instruction-tuned Model Evaluation

Qwen2-72B-Instruct

DatasetsLlama-3-70B-InstructQwen1.5-72B-ChatQwen2-72B-Instruct
English
MMLU82.075.682.3
MMLU-Pro56.251.764.4
GPQA41.939.442.4
TheroemQA42.528.844.4
MT-Bench8.958.619.12
Arena-Hard41.136.148.1
IFEval (Prompt Strict-Acc.)77.355.877.6
Coding
HumanEval81.771.386.0
MBPP82.371.980.2
MultiPL-E63.448.169.2
EvalPlus75.266.979.0
LiveCodeBench29.317.935.7
Mathematics
GSM8K93.082.791.1
MATH50.442.559.7
Chinese
C-Eval61.676.183.8
AlignBench7.427.288.27

Qwen2-57B-A14B-Instruct

DatasetsMixtral-8x7B-Instruct-v0.1Yi-1.5-34B-ChatQwen1.5-32B-ChatQwen2-57B-A14B-Instruct
ArchitectureMoEDenseDenseMoE
#Activated Params12B34B32B14B
#Params47B34B32B57B
English
MMLU71.476.874.875.4
MMLU-Pro43.352.346.452.8
GPQA30.834.3
TheroemQA30.933.1
MT-Bench8.308.508.308.55
Coding
HumanEval45.175.268.379.9
MBPP59.574.667.970.9
MultiPL-E50.766.4
EvalPlus48.563.671.6
LiveCodeBench12.315.225.5
Mathematics
GSM8K65.790.283.679.6
MATH30.750.142.449.1
Chinese
C-Eval76.780.5
AlignBench5.707.207.197.36

Qwen2-7B-Instruct

DatasetsLlama-3-8B-InstructYi-1.5-9B-ChatGLM-4-9B-ChatQwen1.5-7B-ChatQwen2-7B-Instruct
English
MMLU68.469.572.459.570.5
MMLU-Pro41.029.144.1
GPQA34.227.825.3
TheroemQA23.014.125.3
MT-Bench8.058.208.357.608.41
Coding
Humaneval62.266.571.846.379.9
MBPP67.948.967.2
MultiPL-E48.527.259.1
Evalplus60.944.870.3
LiveCodeBench17.36.026.6
Mathematics
GSM8K79.684.879.660.382.3
MATH30.047.750.623.249.6
Chinese
C-Eval45.975.667.377.2
AlignBench6.206.907.016.207.21

Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct

DatasetsQwen1.5-0.5B-ChatQwen2-0.5B-InstructQwen1.5-1.8B-ChatQwen2-1.5B-Instruct
MMLU35.037.943.752.4
HumanEval9.117.125.037.8
GSM8K11.340.135.361.6
C-Eval37.245.255.363.8
IFEval (Prompt Strict-Acc.)14.620.016.829.0

Multilingual Capability of Instruction-Tuned Models

Qwen Dev Team assessed Qwen2 instruction-tuned models against other recent large language models (LLMs) using several cross-lingual open benchmarks and human evaluations. For benchmarking, they focused on two evaluation datasets:

  • M-MMLU from Okapi: A multilingual commonsense evaluation, tested on a subset of languages including Arabic (ar), German (de), Spanish (es), French (fr), Italian (it), Dutch (nl), Russian (ru), Ukrainian (uk), Vietnamese (vi), and Chinese (zh).
  • MGSM: A math evaluation across languages such as German (de), English (en), Spanish (es), French (fr), Japanese (ja), Russian (ru), Thai (th), Chinese (zh), and Bengali (bn).

The results were averaged over languages for each benchmark and are presented as follows:

ModelsM-MMLU (5-shot)MGSM (0-shot, CoT)
Proprietary LLMs
GPT-4-061378.087.0
GPT-4-Turbo-040979.390.5
GPT-4o-051383.289.6
Claude-3-Opus-2024022980.191.0
Claude-3-Sonnet-2024022971.085.6
Open-source LLMs
command-r-plus-110b65.563.5
Qwen1.5-7B-Chat50.037.0
Qwen1.5-32B-Chat65.065.0
Qwen1.5-72B-Chat68.471.7
Qwen2-7B-Instruct60.057.0
Qwen2-57B-A14B-Instruct68.074.0
Qwen2-72B-Instruct78.086.6

For human evaluation, they compare Qwen2-72B-Instruct with GPT3.5, GPT4 and Claude-3-Opus using in-house evaluation set, which includes 10 languages ar, es, fr, ko, th, vi, pt, id, ja and ru (the scores range from 1~5):

ModelsaresfrkothviptidjaruAverage
Claude-3-Opus-202402294.154.314.234.234.013.984.094.403.854.254.15
GPT-4o-05133.554.264.164.404.094.143.894.393.724.324.09
GPT-4-Turbo-04093.444.084.194.244.113.843.864.093.684.273.98
Qwen2-72B-Instruct3.864.104.014.143.753.913.973.833.634.153.93
GPT-4-06133.553.923.943.873.833.953.553.773.063.633.71
GPT-3.5-Turbo-11062.524.073.472.373.382.903.373.562.753.243.16

Grouped by task types, the results are shown as follows:

ModelsKnowledgeUnderstandingCreationMath
Claude-3-Opus-202402293.644.454.423.81
GPT-4o-05133.764.354.453.53
GPT-4-Turbo-04093.424.294.353.58
Qwen2-72B-Instruct3.414.074.363.61
GPT-4-06133.424.094.103.32
GPT-3.5-Turbo-11063.373.673.892.97

These results demonstrate the strong multilingual capabilities of Qwen2 instruction-tuned models.

Check out Qwen2 Family Models:

Read our Blog: