Google: Gemma 3 12B
Gemma 3 12B is a Google model that accepts both text and image inputs, making it usable for multimodal tasks without requiring a separate vision model. It supports a 131K-token context window, which is sufficient for long documents or extended conversations, and it supports tool use. It does not offer native reasoning mode, and structured output support is unconfirmed based on available data. At $0.05 per million input tokens and $0.15 per million output tokens, Gemma 3 12B sits at the budget end of the pricing spectrum. Its blended benchmark score of 3.9 comes from a single benchmark, so performance claims should be treated as preliminary rather than well-established. Developers running high-volume, cost-sensitive workloads who also need image understanding may find it worth testing, but buyers who require strong benchmark validation before committing should wait for broader coverage.
- Model ID
- google/gemma-3-12b-it
- Vendor
- Tokenizer
- Gemini
- Input Modalities
- text, image
- Output Modalities
- text
- Max Output
- 16,384 tokens
- Tool Calling
- ✓ supported
- Structured Output
- ✓ supported
- Reasoning Mode
- not supported
- Vision
- ✓ accepts images
- Audio
- no
- Moderated
- no