Research · best for
Best AI model for Experiment Design (2026)
Designing rigorous A/B and lab experiments. Ranked from 346 live models on the OpenRouter catalog, weighted for reasoning quality, structured output.
| # | Model | Score | In / 1M | Out / 1M | Context | |
|---|---|---|---|---|---|---|
| 1 | MoonshotAI: Kimi K2.6moonshotai/kimi-k2.6 | 120 | $0.80 | $3.50 | 262,144 | Try → |
| 2 | Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free | 120 | Free | Free | 262,144 | Try → |
| 3 | Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it | 120 | $0.07 | $0.35 | 262,144 | Try → |
| 4 | Google: Gemma 4 31B (free)google/gemma-4-31b-it:free | 120 | Free | Free | 262,144 | Try → |
| 5 | Google: Gemma 4 31Bgoogle/gemma-4-31b-it | 120 | $0.13 | $0.38 | 262,144 | Try → |
| 6 | Qwen: Qwen3.6 Plusqwen/qwen3.6-plus | 120 | $0.33 | $1.95 | 1,000,000 | Try → |
| 7 | Z.ai: GLM 5V Turboz-ai/glm-5v-turbo | 120 | $1.20 | $4.00 | 202,752 | Try → |
| 8 | xAI: Grok 4.20x-ai/grok-4.20 | 120 | $2.00 | $6.00 | 2,000,000 | Try → |
| 9 | Xiaomi: MiMo-V2-Omnixiaomi/mimo-v2-omni | 120 | $0.40 | $2.00 | 262,144 | Try → |
| 10 | OpenAI: GPT-5.4 Nanoopenai/gpt-5.4-nano | 120 | $0.20 | $1.25 | 400,000 | Try → |
| 11 | OpenAI: GPT-5.4 Miniopenai/gpt-5.4-mini | 120 | $0.75 | $4.50 | 400,000 | Try → |
| 12 | Mistral: Mistral Small 4mistralai/mistral-small-2603 | 120 | $0.15 | $0.60 | 262,144 | Try → |
| 13 | ByteDance Seed: Seed-2.0-Litebytedance-seed/seed-2.0-lite | 120 | $0.25 | $2.00 | 262,144 | Try → |
| 14 | Qwen: Qwen3.5-9Bqwen/qwen3.5-9b | 120 | $0.10 | $0.15 | 262,144 | Try → |
| 15 | OpenAI: GPT-5.4openai/gpt-5.4 | 120 | $2.50 | $15.00 | 1,050,000 | Try → |
How we ranked these
For Experiment Design, we weight models on reasoning quality, structured output. Higher means better. Scores combine OpenRouter's model metadata (context length, modality support, tool calling, structured output, reasoning capability) with public pricing. See full methodology →
Related tasks
Research
Best for Math Proofs
Formal proof construction and verification.
Research
Best for Scientific Coding
NumPy, JAX, PyTorch — research-grade code.
Research
Best for Literature Review
Synthesizing across many academic papers.
Research
Best for Dataset Annotation
Annotating training data at scale.