The Alibaba Cloud Dev Team has open-sourced the Qwen2.5-Math series, which includes base models Qwen2.5-Math-1.5B/7B/72B, instruction-tuned models Qwen2.5-Math-1.5B/7B/72B-Instruct, and the Qwen2.5-Math-RM-72B reward model.
In contrast to the Qwen2-Math series, which only utilized Chain-of-Thought (CoT) reasoning for solving English math problems, the Qwen2.5-Math series has been expanded to support both Chain-of-Thought and Tool-integrated Reasoning (TIR) for math problem-solving in both Chinese and English. The Qwen2.5-Math series demonstrates marked performance improvements over the Qwen2-Math models across Chinese and English math benchmarks with CoT reasoning.
Qwen2.5-Math mainly supports solving English and Chinese math problems through CoT and TIR. We do not recommend using this series of models for other tasks.
While Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models, it encounters limitations with computational accuracy and complex mathematical or algorithmic tasks, like finding quadratic roots or calculating matrix eigenvalues.
Tool-integrated Reasoning (TIR) helps address these challenges by enabling models to perform precise computations, symbolic manipulation, and algorithmic processing. Using TIR, Qwen2.5-Math-1.5B, 7B, and 72B-Instruct models achieve scores of 79.7, 85.3, and 87.8 on the MATH benchmark, respectively.
Qwen2.5-Math: Base Models
The specialization pipelines for both Qwen2-Math and Qwen2.5-Math models are illustrated above. After training the Qwen2-Math base models, Qwen2.5-Math models were developed through three key improvements:
- Expanded Mathematical Data: Using Qwen2-Math-72B-Instruct models, additional high-quality mathematical data was synthesized for enhanced pre-training.
- Broader Data Aggregation: Aggregated more high-quality mathematical data, especially in Chinese, sourced from web content, books, and coding data across multiple data collection cycles.
- Enhanced Initialization: Leveraged the Qwen2.5 base model for parameter initialization, boosting the models’ capabilities in language understanding, code generation, and text-based reasoning.
As a result, the Qwen Math Corpus v2 was created for pre-training Qwen2.5-Math-1.5B/7B/72B models, supporting a 4K context length. Compared to Qwen Math Corpus v1, which was used for Qwen2-Math training, Corpus v2’s token count increased from 700B to over 1T.
To assess the performance of the Qwen2.5-Math base models, evaluations were conducted on three prominent English math benchmarks—GSM8K, Math, and MMLU-STEM—as well as three Chinese benchmarks—CMATH, GaoKao Math Cloze, and GaoKao Math QA—all with few-shot chain-of-thought prompting.
Qwen2.5-Math-1.5B/7B/72B models have shown substantial performance gains over Qwen2-Math-1.5B/7B/72B across all benchmarks. For instance, on the MATH benchmark, Qwen2.5-Math models achieved score improvements of 5.4, 5.0, and 6.3 points, respectively. In the Gaokao Math QA benchmark, they saw even more notable increases, with improvements of 3.4, 12.2, and 19.8 points.
Qwen2.5-Math-Instruct: Instruction-Tuned Models
The Qwen2.5-Math-Instruct model builds on Qwen2.5-Math-72B with a math-specific reward model, Qwen2.5-Math-RM-72B, leveraging Rejection Sampling for SFT data construction and Group Relative Policy Optimization (GRPO) post-SFT reinforcement learning. An additional refinement phase using Qwen2-Math-Instruct models and Qwen2.5-Math-RM-72B further enhances response quality during Rejection Sampling.
Compared to Qwen2-Math, Qwen2.5 incorporates TIR data and expanded SFT data in both Chinese and English. Evaluations span benchmarks in both languages, such as GSM8K, Math, OlympiadBench, CollegeMath, GaoKao, AIME2024, and AMC2023 for English; CMATH, GaoKao 2024, and CN Middle School 24 for Chinese. The Qwen2.5-Math-72B-Instruct model demonstrates an average improvement of 4.4 points in English and 6.1 points in Chinese over Qwen2-Math-72B-Instruct, solidifying it as the leading open-source mathematical model. It even surpasses proprietary models like GPT-4o and Gemini Math-Specialized 1.5 Pro, achieving a high score of 92.9 on MATH under the RM@8 TIR setting.
Further, the 7B version of Qwen2.5-Math-Instruct outperforms Qwen2-Math-72B-Instruct, scoring 83.6 under CoT and 85.3 under TIR on MATH. Even the smallest 1.5B model delivers impressive performance, achieving nearly 80 on MATH using the Python Interpreter, outperforming many existing models.
In advanced mathematical competition benchmarks like AIME 2024 and AMC 2023, Qwen2.5-Math-Instruct achieves impressive results across a range of evaluation settings, including Greedy, Maj@64, RM@64, and RM@256. With the support of the Qwen2.5-Math-RM-72B, Qwen2.5-Math-1.5B-Instruct solves 29 out of 40 AMC 2023 problems in CoT mode using RM@256.
In TIR mode, the Qwen2.5-Math-72B-Instruct model achieves nearly perfect scores, solving almost every problem on AMC 2023. On the challenging AIME 2024 benchmark, where top models like Claude3 Opus, GPT-4 Turbo, and Gemini 1.5 Pro only solve 1-2 problems out of 30, Qwen2.5-Math-72B-Instruct solves 9 problems in Greedy decoding CoT mode and 12 in TIR mode. With RM support, Qwen2.5-Math-7B-Instruct extends its capabilities even further, solving up to 21 AIME 2024 problems, underscoring its advanced mathematical problem-solving strengths.
Decontamination Process
Decontamination is essential for unbiased model performance evaluation. Building on previous methods from Qwen2, we exclude potentially contaminated training samples through a 13-gram matching technique, enhanced with text normalization to remove irrelevant punctuation and symbols for greater precision.
To further reduce false negatives, especially with common mathematical expressions, we implement an additional criterion: a sample is marked as contaminated if the longest common subsequence ratio exceeds 0.6. For pre-training data, we filter out potentially contaminated samples against datasets like GSM8K and MATH. In post-training data—including SFT data, RM training data, and RL query sets—we exclude any potentially contaminated problems or solutions across all evaluation datasets: GSM8K, MATH, Minerva Math, Gaokao 2023 En, Olympiad Bench, College Math, MMLU STEM, GaoKao, CMATH, CN Middle School 24, AIME 24, and AMC 23.
Analyzing contaminated samples reveals that some training datasets, such as MATH, include numerous problems with similar concepts or structures to those in test datasets. While these are not identical duplicates, their presence may affect evaluation integrity. As a result, we continue to exclude such samples from the training corpus.
Demo
Alibaba developed a demo for Qwen-Agent with TIR mode, enabling users to locally experience the Tool-Integrated Reasoning capabilities of Qwen2.5-Math.
Multi-Modal Mathematics Demo
Devs also offer a multi-modal mathematics demo on Hugging Face and Modelscope. This WebUI integrates Qwen2-VL for optical character recognition (OCR) with Qwen2-Math for mathematical reasoning, allowing users to input images, text, or sketches of mathematical or arithmetic problems.
Summary
Qwen2.5-Math introduces several major technical advancements:
- Extensive use of synthesized mathematical data from Qwen2-Math during the pre-training phase.
- Iterative fine-tuning data generation and reinforcement training, guided by the reward model, in the post-training phase.
- Support for bilingual (English and Chinese) queries with chain-of-thought and tool-integrated reasoning capabilities.
As a result, Qwen2.5-Math stands as the most advanced open-source math model series to date. The Qwen2.5-Math-1.5B-Instruct model outperforms most previous 70B math models, while Qwen2.5-Math-7B-Instruct matches the performance of Qwen2-Math-72B-Instruct. The flagship model, Qwen2.5-Math-72B-Instruct, achieves an average score increase of 4.7 points across seven tasks compared to its predecessor.
We believe the advancements in specialized models like Qwen2.5-Math will continue to enhance the overall capabilities of the Qwen series, bringing us closer to achieving artificial general intelligence.