Research · best for
Best AI model for Math Proofs (2026)
Formal proof construction and verification. Ranked from 346 live models on the OpenRouter catalog, weighted for reasoning quality, context window.
| # | Model | Score | In / 1M | Out / 1M | Context | |
|---|---|---|---|---|---|---|
| 1 | Qwen: Qwen3.6 Plusqwen/qwen3.6-plus | 128 | $0.33 | $1.95 | 1,000,000 | Try → |
| 2 | xAI: Grok 4.20x-ai/grok-4.20 | 128 | $2.00 | $6.00 | 2,000,000 | Try → |
| 3 | OpenAI: GPT-5.4 Nanoopenai/gpt-5.4-nano | 128 | $0.20 | $1.25 | 400,000 | Try → |
| 4 | OpenAI: GPT-5.4 Miniopenai/gpt-5.4-mini | 128 | $0.75 | $4.50 | 400,000 | Try → |
| 5 | OpenAI: GPT-5.4openai/gpt-5.4 | 128 | $2.50 | $15.00 | 1,050,000 | Try → |
| 6 | Google: Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview | 128 | $0.25 | $1.50 | 1,048,576 | Try → |
| 7 | Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23 | 128 | $0.07 | $0.26 | 1,000,000 | Try → |
| 8 | Google: Gemini 3.1 Pro Preview Custom Toolsgoogle/gemini-3.1-pro-preview-customtools | 128 | $2.00 | $12.00 | 1,048,576 | Try → |
| 9 | OpenAI: GPT-5.3-Codexopenai/gpt-5.3-codex | 128 | $1.75 | $14.00 | 400,000 | Try → |
| 10 | Google: Gemini 3.1 Pro Previewgoogle/gemini-3.1-pro-preview | 128 | $2.00 | $12.00 | 1,048,576 | Try → |
| 11 | Qwen: Qwen3.5 Plus 2026-02-15qwen/qwen3.5-plus-02-15 | 128 | $0.26 | $1.56 | 1,000,000 | Try → |
| 12 | Google: Gemini 3 Flash Previewgoogle/gemini-3-flash-preview | 128 | $0.50 | $3.00 | 1,048,576 | Try → |
| 13 | OpenAI: GPT-5.2openai/gpt-5.2 | 128 | $1.75 | $14.00 | 400,000 | Try → |
| 14 | Amazon: Nova 2 Liteamazon/nova-2-lite-v1 | 128 | $0.30 | $2.50 | 1,000,000 | Try → |
| 15 | xAI: Grok 4.1 Fastx-ai/grok-4.1-fast | 128 | $0.20 | $0.50 | 2,000,000 | Try → |
How we ranked these
For Math Proofs, we weight models on reasoning quality, context window. Higher means better. Scores combine OpenRouter's model metadata (context length, modality support, tool calling, structured output, reasoning capability) with public pricing. See full methodology →
Related tasks
Research
Best for Scientific Coding
NumPy, JAX, PyTorch — research-grade code.
Research
Best for Literature Review
Synthesizing across many academic papers.
Research
Best for Experiment Design
Designing rigorous A/B and lab experiments.
Research
Best for Dataset Annotation
Annotating training data at scale.