Latency · best for

Top picks for Real-Time Chat (2026)

Models tuned for sub-second response. Ranked from 334 live models on the OpenRouter catalog, weighted for low latency, low cost.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Real-Time Chat, then benchmark performance refines the order. Full methodology →

#	Model	Score	In / 1M	Out / 1M	Context
1	NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free	118	Free	Free	256,000	Details →
2	Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5	118	$0.14	$0.28	1,048,576	Details →
3	Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free	118	Free	Free	262,144	Details →
4	Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it	118	$0.06	$0.33	262,144	Details →
5	Google: Gemma 4 31B (free)google/gemma-4-31b-it:free	118	Free	Free	262,144	Details →
6	Google: Gemma 4 31Bgoogle/gemma-4-31b-it	118	$0.12	$0.35	262,144	Details →
7	Qwen: Qwen3.5-9Bqwen/qwen3.5-9b	118	$0.10	$0.15	262,144	Details →
8	ByteDance Seed: Seed-2.0-Minibytedance-seed/seed-2.0-mini	118	$0.10	$0.40	262,144	Details →
9	Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23	118	$0.07	$0.26	1,000,000	Details →
10	ByteDance Seed: Seed 1.6 Flashbytedance-seed/seed-1.6-flash	118	$0.07	$0.30	262,144	Details →
11	Google: Gemini 2.5 Flash Lite Preview 09-2025google/gemini-2.5-flash-lite-preview-09-2025	118	$0.10	$0.40	1,048,576	Details →
12	OpenAI: GPT-5 Nanoopenai/gpt-5-nano	118	$0.05	$0.40	400,000	Details →
13	Google: Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite	118	$0.10	$0.40	1,048,576	Details →
14	OpenAI: GPT-4.1 Nanoopenai/gpt-4.1-nano	118	$0.10	$0.40	1,047,576	Details →
15	StepFun: Step 3.7 Flashstepfun/step-3.7-flash	117	$0.20	$1.15	256,000	Details →

How we ranked these

For Real-Time Chat, we weight models on low latency, low cost. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Real-Time Chat

Real-Time Chat is the task of generating conversational responses in under one second, typically 200-800ms per turn. You need this when users expect immediate feedback during dialogue, such as customer support bots, in-app assistants, or voice interfaces where latency breaks the illusion of conversation. A good model for this task combines low parameter count with efficient inference: smaller fine-tuned models like Llama 2 7B or Mistral 7B outperform larger ones here. Bad models are either too large (requiring batching that adds delay) or poorly quantized (losing coherence to gain speed). The practical tradeoff: sub-second response often means accepting slightly lower reasoning depth or restricting context window to 2K-4K tokens. Inference cost scales directly with model size and context length, so a 70B parameter model will rarely hit sub-second latency on commodity hardware. # WHEN_TO_USE Use this when you're building a chatbot, voice assistant, or live support tool where users notice delays longer than a second and will perceive the system as slow or unresponsive. # FAQ_Q1 Which AI models can actually respond in under one second for chat? # FAQ_A1 Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.

When to use: Use this when you're building a chatbot, voice assistant, or live support tool where users notice delays longer than a second and will perceive the system as slow or unresponsive. # FAQ_Q1 Which AI models can actually respond in under one second for chat? # FAQ_A1 Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.

Common questions

Which AI models can actually respond in under one second for chat? # FAQ_A1 Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.

Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.

How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.

Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.