OpenAI: GPT Audio
GPT Audio is OpenAI's model built for workflows that combine text and audio input, accepting both modalities within a 128,000-token context window. It supports tool use, which makes it usable in agentic pipelines, but it does not offer reasoning mode or confirmed structured output support. Completions are capped at 16,384 tokens per response. At $2.50 per million input tokens and $10.00 per million output tokens, this sits in a mid-to-upper price range, and there is currently no independent benchmark coverage to validate where it stands against alternatives. Buyers who need native audio comprehension alongside text in a single API call have limited options, so GPT Audio may be worth shortlisting on capability fit alone. That said, the absence of benchmark data means performance claims are unverified, and teams with tight budgets or quality thresholds should treat this as an early-stage choice until independent evaluations are available.
- Model ID
- openai/gpt-audio
- Vendor
- openai
- Tokenizer
- GPT
- Input Modalities
- text, audio
- Output Modalities
- text, audio
- Max Output
- 16,384 tokens
- Tool Calling
- ✓ supported
- Structured Output
- ✓ supported
- Reasoning Mode
- not supported
- Vision
- text only
- Audio
- ✓ accepts audio
- Moderated
- yes
Category rankings
Where OpenAI: GPT Audio places across the 3 categories it ranks in. How we rank →
| # | Category | Score |
|---|---|---|
| #18 | Audio SummarizationVoice · of 19 ranked | 104 |
| #18 | TTS ReplacementVoice · of 19 ranked | 99 |
| #19 | TranscriptionVoice · of 19 ranked | 99 |