Qwen2.5-Coder-32B-Instruct

Qwen2.5-Coder-32B-Instruct

Now it is the time, Nov. 11! The perfect time for Alibaba Qwen’s best coder model ever! Qwen2.5-Coder-32B-Instruct! Wait wait… it’s more than a big coder! It is a family of coder models! Besides the 32B coder, they have coders of 0.5B / 1.5B / 3B / 7B / 14B! As usual, Dev Team not only share base and instruct models, but also provide quantized models in the format of GPTQ, AWQ, as well as the popular GGUF!

Qwen2.5-Coder-32B-Instruct model overview

The flagship model, Qwen2.5-Coder-32B-Instruct, reaches top-tier performance, highly competitive (or even surpassing) proprietary models like GPT-4o, in a series of benchmark evaluation, including HumanEval, MBPP, LiveCodeBench, BigCodeBench, McEval, Aider, etc. It reaches 92.7 in HumanEval, 90.2 in MBPP, 31.4 in LiveCodeBench, 73.7 in Aider, 85.1 in Spider, and 68.9 in CodeArena!

Introduction into Qwen2.5-Coder-32B-Instruct Model

Alibaba Dev Team thrilled to open source the Qwen2.5-Coder series, designed as a “Powerful,” “Diverse,” and “Practical” suite of models to drive the progress of Open CodeLLMs.

Powerful: The Qwen2.5-Coder-32B-Instruct model stands as the current state-of-the-art (SOTA) open-source code model, rivaling GPT-4o in coding capability. Not only does it exhibit exceptional and versatile coding performance, but it also demonstrates robust general and mathematical abilities.

Diverse: Expanding on the previously released 1.5B and 7B models, this release introduces four additional sizes: 0.5B, 3B, 14B, and 32B. With these options, Qwen2.5-Coder now covers six key model sizes, allowing developers to select the ideal model to suit their specific project requirements.

Practical: The Qwen2.5-Coder series has been tested for practicality across scenarios such as code assistants and Artifacts, with examples that highlight how these models can be applied effectively in real-world development contexts.

With this release, we’re eager to see the innovative ways developers will leverage the Qwen2.5-Coder models.

Code capabilities reach SOTA for open-source models

Code Generation: As the flagship model of this release, Qwen2.5-Coder-32B-Instruct excels in code generation, achieving top performance among open-source models across multiple popular benchmarks, including EvalPlus, LiveCodeBench, and BigCodeBench. Its capabilities in generating efficient and accurate code rival those of proprietary models like GPT-4o, making it a strong choice for code generation tasks.

Code Repair: Error correction is a fundamental programming skill, and Qwen2.5-Coder-32B-Instruct is designed to assist users in swiftly identifying and fixing code errors. On the popular Aider benchmark for code repair, Qwen2.5-Coder-32B-Instruct achieved a score of 73.7, showcasing its comparable performance to GPT-4o. This makes it a valuable tool for improving code reliability and reducing debugging time.

Code Reasoning: Code reasoning, or the ability to understand the execution flow of code and accurately predict input-output relationships, is another strength of the Qwen2.5-Coder series. While the 7B model has already demonstrated notable performance in code reasoning tasks, the 32B model builds on these capabilities, offering even more accurate and insightful predictions.

These advancements highlight Qwen2.5-Coder-32B-Instruct as a powerful resource in both code generation and error resolution, contributing meaningfully to more efficient programming workflows.

code generation and error resolution capabilities

Multiple Programming Languages: A highly capable programming assistant must be proficient across a wide range of programming languages, and Qwen2.5-Coder-32B-Instruct rises to the challenge. This model performs exceptionally well in over 40 languages, including niche languages like Haskell and Racket. It achieved an impressive score of 65.9 on the McEval benchmark, underscoring its versatility and adaptability to diverse coding environments. This success can be attributed to meticulous data cleaning and balancing strategies employed during pre-training, which ensure that the model’s capabilities are robust and well-rounded across various programming languages.

With these strengths, Qwen2.5-Coder-32B-Instruct is positioned as a powerful, language-agnostic code assistant that meets the needs of developers working across different programming paradigms.

Multiple Programming Languages support

Enhanced Multi-Language Code Repair: Qwen2.5-Coder-32B-Instruct also excels in multi-language code repair, enabling developers to understand and refine code across a wide array of programming languages. This feature significantly lowers the learning curve for unfamiliar languages, making it easier for users to troubleshoot and modify code in languages they may not know deeply. In the multi-language code repair benchmark MdEval, Qwen2.5-Coder-32B-Instruct achieved an impressive score of 75.2, securing the top position among all open-source models. This high performance not only highlights the model’s adaptability but also underscores its practical value for developers working in diverse coding environments.

Enhanced Multi-Language Code Repair capabilities

Human Preference Alignment in Qwen2.5-Coder-32B-Instruct: To assess how well Qwen2.5-Coder-32B-Instruct aligns with human coding preferences, the development team designed an annotated internal benchmark named Code Arena. This benchmark mirrors the structure of Arena Hard and provides a structured evaluation of code alignment. Using GPT-4o as the evaluation model, the assessment follows an “A vs. B win” approach, where the test set records the percentage of instances in which model A’s code solutions are rated higher than model B’s.

This approach reveals valuable insights into the alignment capabilities of Qwen2.5-Coder-32B-Instruct, with results that underscore its advantages in human preference alignment. The strong performance on Code Arena confirms the model’s ability to produce code that resonates well with user expectations and preferred solutions.

Human Preference Alignment in Qwen2.5-Coder-32B-Instruct

Diverse Range of Model Sizes in Qwen2.5-Coder

In the latest release, Qwen2.5-Coder offers a comprehensive selection of model sizes: 0.5B, 1.5B, 3B, 7B, 14B, and 32B parameters. This diverse range not only accommodates developers with varying computational resources but also provides a valuable experimental platform for the research community to further explore advancements in code language models. Below is a table summarizing key specifications for each model:

ModelParamsNon-Emb ParamsLayersHeads (KV)Tie EmbeddingContext LengthLicense
Qwen2.5-Coder-0.5B0.49B0.36B2414 / 2Yes32KApache 2.0
Qwen2.5-Coder-1.5B1.54B1.31B2812 / 2Yes32KApache 2.0
Qwen2.5-Coder-3B3.09B2.77B3616 / 2Yes32KQwen Research
Qwen2.5-Coder-7B7.61B6.53B2828 / 4No128KApache 2.0
Qwen2.5-Coder-14B14.7B13.1B4840 / 8No128KApache 2.0
Qwen2.5-Coder-32B32.5B31.0B6440 / 8No128KApache 2.0
Diverse Range of Model Sizes in Qwen2.5-Coder series

Following the principles of Scaling Laws, Qwen2.5-Coder models have been rigorously evaluated across multiple datasets, affirming that model performance improves consistently with scale. Both Base and Instruct versions of each model size have been made available. The Base model offers developers a versatile foundation for custom fine-tuning, while the Instruct model is optimized for aligned, conversational interactions out of the box.

This structured, open-source approach reaffirms the Qwen project’s commitment to advancing accessible, scalable, and powerful tools for code language modeling.

Here are the performances of the Base models of different sizes:

Here are the performances of the Instruct models of different sizes:

Qwen2.5-Coder Model Performance Comparison

In this release, we compare various sizes of Qwen2.5-Coder with other open-source models across key datasets, demonstrating its strong performance. Here’s a breakdown of the evaluations for the Base and Instruct models:

  • Base Model Evaluation: For assessing base model capabilities, we used the MBPP-3shot metric, chosen for its strong correlation with actual model performance in coding tasks. Extensive testing has shown that MBPP-3shot is especially suitable for evaluating base models, offering a reliable benchmark for general code generation abilities.
  • Instruct Model Evaluation: To gauge out-of-distribution (OOD) capabilities, we evaluated the Instruct model using LiveCodeBench questions from the most recent four months (July to November 2024). These questions were selected as they are new and, therefore, highly unlikely to have influenced the training data, giving a fresh perspective on the model’s ability to handle unseen challenges.

Our results reveal a consistent positive correlation between model size and performance, with Qwen2.5-Coder models achieving state-of-the-art (SOTA) performance across all sizes. This robust performance across the spectrum from small to large models motivates us to continue scaling up Qwen2.5-Coder, exploring even larger model sizes to push the boundaries of open-source code language models.

These findings underscore the Qwen2.5-Coder’s versatility and scalability, offering powerful tools for developers across diverse computational environments.

Qwen2.5-Coder Model Performance Comparison

Practical: Encountering Cursor and Artifacts

The vision behind Qwen2.5-Coder has always been to create a practical, versatile tool that developers can use in real-world scenarios. With this in mind, we explored its performance in real-life coding environments, specifically focusing on code assistants and artifacts.

Qwen2.5-Coder & Cursor

Code assistants have become an essential tool for developers, helping streamline coding tasks and improve productivity. However, many current options rely on closed-source models, limiting accessibility and transparency. With the release of Qwen2.5-Coder, we aim to provide developers with a powerful and open-source alternative.

A key use case for Qwen2.5-Coder is its integration with Cursor, a popular code editor. By leveraging Qwen2.5-Coder, Cursor becomes an even more effective tool for developers, offering intelligent code suggestions, debugging assistance, and even auto-completion in a variety of programming languages. This collaboration ensures that developers have access to a flexible, cost-effective coding assistant that aligns with the ethos of open-source software.

Here’s a glimpse into how Qwen2.5-Coder enhances the coding experience within the Cursor editor, providing a seamless, powerful environment for developers to write, optimize, and debug code. With these advancements, we hope to empower developers by combining robust functionality with the freedom of open-source tools.

Example: Qwen2.5-Coder & Cursor:

Additionally, Qwen2.5-Coder-32B has shown exceptional code completion capabilities, setting a new standard in performance across several key benchmarks. It has achieved SOTA (state-of-the-art) results on five prominent code completion benchmarks: Humaneval-Infilling, CrossCodeEval, CrossCodeLongEval, RepoEval, and SAFIM.

To ensure a fair and consistent evaluation, we controlled the maximum sequence length to 8k tokens and utilized the Fill-in-the-Middle mode during testing. This approach allowed for a focused assessment of the model’s ability to complete code snippets efficiently and accurately.

In terms of evaluating the generated content, we used Exact Match (EM) as the metric for four benchmarks: CrossCodeEval, CrossCodeLongEval, RepoEval, and Humaneval-Infilling. This measure evaluates whether the output generated by the model exactly matches the ground truth labels. For SAFIM, a Pass@1 metric was used, which measures the model’s success rate in achieving a correct solution on its first attempt.

The impressive performance of Qwen2.5-Coder-32B across these benchmarks highlights its effectiveness in code completion tasks, establishing it as a top contender in the open-source space for developers seeking accurate and efficient code generation solutions.

Qwen2.5-Coder-32B – exceptional code completion capabilities

Qwen2.5-Coder & Artifacts

Artifacts play a crucial role in the world of code generation, particularly in aiding users to create visually compelling works through code. As part of exploring Qwen2.5-Coder’s capabilities in this area, we selected Open WebUI to delve into how the model can be leveraged for generating artifacts.

In the Artifacts scenario, Qwen2.5-Coder has shown impressive potential by assisting in the creation of visual projects, ranging from graphical user interfaces (GUIs) to interactive websites and even visual elements for larger applications. Here are some specific examples of how Qwen2.5-Coder can be applied:

  1. Automated UI Design: Using Qwen2.5-Coder, developers can input requirements for a website or application interface, and the model will generate the underlying code for the layout and design. This includes HTML, CSS, and JavaScript, making it easier to translate a visual concept into functional code.
  2. Data Visualizations: Qwen2.5-Coder is also capable of helping users generate code to produce complex data visualizations. Whether it’s graphs, charts, or interactive dashboards, the model can write the necessary code to bring data to life.
  3. Game Assets: In the realm of game development, Qwen2.5-Coder can help generate code for interactive environments and even assist in the procedural generation of visual assets such as textures, animations, and more.

These examples demonstrate how Qwen2.5-Coder isn’t just limited to traditional coding tasks but is also adept at assisting in the creation of artifacts that have real-world applications in design, development, and creative industries. As this technology continues to evolve, we can expect even more powerful integrations in various creative workflows.

Example: Three-Body Problem Simulation

We will soon be launching the code mode on the official Tongyi website at https://tongyi.aliyun.com, which will support one-click generation of websites, mini-games, data charts, and other visual applications. We invite everyone to try it out!

Model License

Qwen2.5-Coder models with 0.5B, 1.5B, 7B, 14B, and 32B parameters are licensed under Apache 2.0, while the 3B model is licensed under the Qwen-Research license.

What’s Next for Qwen-Coder?

We believe this release will significantly assist developers and help explore more innovative application scenarios with the community. Furthermore, we are working on advanced reasoning models focused on code, and we look forward to sharing more exciting updates with you soon!

Links