z-ai

Z.ai: GLM 4.6V

GLM 4.6V is a multimodal model from Z.ai that accepts text, images, and video as input, with a 131,072-token context window and a maximum of 32,768 output tokens. It supports tool use and reasoning, which makes it capable of agentic and multi-step workflows. Structured output support is unconfirmed, so developers who depend on guaranteed JSON schemas should verify that independently before committing. At $0.30 per million input tokens and $0.90 per million output tokens, the pricing is competitive for a model handling video alongside text and images. However, its blended benchmark score of 16.8 across only one independent benchmark offers a limited basis for quality comparison, so performance claims should be treated as provisional. Teams processing multimodal content on a moderate budget may find it worth evaluating, but those prioritizing well-documented quality should wait for broader benchmark coverage before relying on it for critical workloads.

Quality Score
100/100
price + capability + benchmarks
Input Price
$0.30
per 1M tokens
Output Price
$0.90
per 1M tokens
Context Window
131,072
tokens
Model ID
z-ai/glm-4.6v
Vendor
z-ai
Tokenizer
Other
Input Modalities
image, text, video
Output Modalities
text
Max Output
32,768 tokens
Tool Calling
✓ supported
Structured Output
✓ supported
Reasoning Mode
✓ supported
Vision
✓ accepts images
Audio
no
Moderated
no

Category rankings

Where Z.ai: GLM 4.6V places across the 3 categories it ranks in. How we rank →

#CategoryScore
#20 Social Media PostsWriting · of 25 ranked 119
#20 Voice Assistant BackendVoice · of 25 ranked 123
#21 Real-Time ChatLatency · of 25 ranked 117

Similar models