Speech-to-Text
Speech-to-Text
Speech-to-Text
OpenRouter supports speech-to-text (STT) via a dedicated /api/v1/audio/transcriptions endpoint. Send base64-encoded audio and receive a JSON response with the transcribed text and usage statistics.
You can find STT models in several ways:
Use the output_modalities query parameter on the Models API to discover STT models:
Visit the Models page and filter by output modalities to find models capable of audio transcription. You can also browse the Speech-to-Text collection for a curated list.
Send a POST request to /api/v1/audio/transcriptions with a JSON body containing base64-encoded audio. The response is JSON with the transcribed text and optional usage statistics.
You can pass provider-specific options using the provider parameter. Options are keyed by provider slug, and only the options for the matched provider are forwarded:
The STT endpoint returns a JSON response with the transcribed text:
Supported audio formats vary by provider. Common formats include:
STT models use different pricing strategies depending on the provider:
You can check the cost for each model on the Models page or via the Models API. The usage.cost field in the response shows the actual cost for each request.
STT supports BYOK, allowing you to use your own provider API keys. When configured, requests are routed directly to the provider using your key, and OpenRouter charges only its platform fee rather than the per-usage model cost.
You can test STT models directly in the browser using the OpenRouter Playground. Navigate to any STT model’s page and use the playground tab to upload an audio file and see the transcription result.
OpenRouter supports two ways to process audio:
Speech-to-Text (this page): A dedicated /api/v1/audio/transcriptions endpoint optimized for transcription. Returns structured JSON with the transcribed text and usage data. Best for converting audio to text.
Audio input via Chat Completions (Audio docs): Send audio as part of a /api/v1/chat/completions request using the input_audio content type. The model processes the audio alongside text and responds conversationally. Best for audio analysis, question answering about audio content, or combining audio with other modalities.
Empty or incorrect transcription?
format field in your requestRequest timing out?
Model not found?
output_modalities=transcription to find available STT modelsopenai/whisper-1, not whisper-1)Authentication error?