Z.ai: GLM 4.6V
GLM 4.6V is a multimodal model from Z.ai that accepts text, images, and video as input, with a 131,072-token context window and a maximum of 32,768 output tokens. It supports tool use and reasoning, which makes it capable of agentic and multi-step workflows. Structured output support is unconfirmed, so developers who depend on guaranteed JSON schemas should verify that independently before committing. At $0.30 per million input tokens and $0.90 per million output tokens, the pricing is competitive for a model handling video alongside text and images. However, its blended benchmark score of 16.8 across only one independent benchmark offers a limited basis for quality comparison, so performance claims should be treated as provisional. Teams processing multimodal content on a moderate budget may find it worth evaluating, but those prioritizing well-documented quality should wait for broader benchmark coverage before relying on it for critical workloads.
- Model ID
- z-ai/glm-4.6v
- Vendor
- z-ai
- Tokenizer
- Other
- Input Modalities
- image, text, video
- Output Modalities
- text
- Max Output
- 32,768 tokens
- Tool Calling
- ✓ supported
- Structured Output
- ✓ supported
- Reasoning Mode
- ✓ supported
- Vision
- ✓ accepts images
- Audio
- no
- Moderated
- no
Category rankings
Where Z.ai: GLM 4.6V places across the 3 categories it ranks in. How we rank →
| # | Category | Score |
|---|---|---|
| #20 | Social Media PostsWriting · of 25 ranked | 119 |
| #20 | Voice Assistant BackendVoice · of 25 ranked | 123 |
| #21 | Real-Time ChatLatency · of 25 ranked | 117 |