QwQ-32B-Preview

QwQ-32B-Preview

QwQ-32B-Preview is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities. As a preview release, it demonstrates promising analytical abilities while having several important limitations:

  1. Language Mixing and Code-Switching: The model may mix languages or switch between them unexpectedly, affecting response clarity.
  2. Recursive Reasoning Loops: The model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer.
  3. Safety and Ethical Considerations: The model requires enhanced safety measures to ensure reliable and secure performance, and users should exercise caution when deploying it.
  4. Performance and Benchmark Limitations: The model excels in math and coding but has room for improvement in other areas, such as common sense reasoning and nuanced language understanding.

Specification

  • Type: Causal Language Models
  • Training Stage: Pretraining & Post-training
  • Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
  • Number of Parameters: 32.5B
  • Number of Paramaters (Non-Embedding): 31.0B
  • Number of Layers: 64
  • Number of Attention Heads (GQA): 40 for Q and 8 for KV
  • Context Length: Full 32,768 tokens

For more details, please refer to our blog. You can also check Qwen2.5 GitHub, and Documentation.

QwQ-32B-Preview Model Performance

Through extensive exploration and countless experiments, a profound realization emerged: when the model is given the opportunity to reflect, question, and deliberate, its understanding of mathematics and programming deepens significantly. Much like a diligent student who learns through careful analysis and self-correction, the model achieves remarkable insights through thoughtful introspection. This reflective process unlocks its potential to solve even the most intricate challenges.

Our journey of refinement has demonstrated the model’s exceptional capabilities in addressing complex problems in both mathematics and programming, as evidenced by its performance on the following benchmarks:

  • GPQA (Graduate-Level Google-Proof Q&A): A rigorous benchmark designed to evaluate scientific problem-solving skills using questions at a graduate-school level.
  • AIME (American Invitational Mathematics Examination): A highly challenging test covering topics such as arithmetic, algebra, geometry, number theory, and probability, rooted in secondary school mathematics.
  • MATH-500: A subset of 500 test cases from the comprehensive MATH benchmark, assessing advanced mathematical problem-solving abilities.
  • LiveCodeBench: A real-world programming benchmark that evaluates the model’s code generation and problem-solving performance in practical scenarios.

By fostering a process of deliberate reflection, the model has unlocked new levels of precision and capability, proving its excellence in tackling some of the most demanding problems across both domains.

QwQ-32B-Preview Model Performance

Specifically, QwQ delivers outstanding results across these benchmarks, achieving:

  • 65.2% on GPQA, reflecting its graduate-level scientific reasoning skills.
  • 50.0% on AIME, highlighting its robust mathematical problem-solving abilities across complex topics.
  • 90.6% on MATH-500, demonstrating exceptional comprehension and versatility in advanced mathematics.
  • 50.0% on LiveCodeBench, validating its strong programming capabilities in real-world coding scenarios.

These impressive scores illustrate QwQ’s remarkable advancements in analytical and problem-solving capabilities, especially in technical fields demanding deep reasoning and precision.

Links