openai

OpenAI: GPT Audio

GPT Audio is OpenAI's model built for workflows that combine text and audio input, accepting both modalities within a 128,000-token context window. It supports tool use, which makes it usable in agentic pipelines, but it does not offer reasoning mode or confirmed structured output support. Completions are capped at 16,384 tokens per response. At $2.50 per million input tokens and $10.00 per million output tokens, this sits in a mid-to-upper price range, and there is currently no independent benchmark coverage to validate where it stands against alternatives. Buyers who need native audio comprehension alongside text in a single API call have limited options, so GPT Audio may be worth shortlisting on capability fit alone. That said, the absence of benchmark data means performance claims are unverified, and teams with tight budgets or quality thresholds should treat this as an early-stage choice until independent evaluations are available.

Quality Score
84/100
price + capability + benchmarks
Input Price
$2.50
per 1M tokens
Output Price
$10.00
per 1M tokens
Context Window
128,000
tokens
Model ID
openai/gpt-audio
Vendor
openai
Tokenizer
GPT
Input Modalities
text, audio
Output Modalities
text, audio
Max Output
16,384 tokens
Tool Calling
✓ supported
Structured Output
✓ supported
Reasoning Mode
not supported
Vision
text only
Audio
✓ accepts audio
Moderated
yes

Category rankings

Where OpenAI: GPT Audio places across the 3 categories it ranks in. How we rank →

#CategoryScore
#18 Audio SummarizationVoice · of 19 ranked 104
#18 TTS ReplacementVoice · of 19 ranked 99
#19 TranscriptionVoice · of 19 ranked 99

Similar models