Skip to content

Speech-to-Text API

Transcribe audio to text or translate audio to English. Compatible with the OpenAI Audio API.

Base URL

https://api.getkawai.com/v1

Authentication

When authentication is enabled, include your token in the Authorization header:

Authorization: Bearer API_KEY

Transcriptions

Transcribe audio to text in the original language.

POST /audio/transcriptions

Transcribes audio into the input language. Supports multiple response formats including verbose JSON with timestamps.

Authentication: Required when auth is enabled. Token must have 'audio-transcriptions' endpoint access.

Headers

Header Required Description
Authorization Yes Bearer token for authentication
Content-Type Yes Must be multipart/form-data

Request Body

Content-Type: multipart/form-data

Field Type Required Description
file binary Yes Audio file (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm)
model string Yes Transcription model (e.g., 'tiny', 'base', 'small', 'medium', 'large')
language string No Language code (ISO-639-1). Auto-detected if not provided.
prompt string No Optional text to guide style or continue previous segment
response_format string No Format: json, text, srt, vtt, verbose_json (default: json)
temperature number No Sampling temperature 0-1 (default: 0)

Response

Returns transcription text. Verbose JSON includes segments, timestamps, and language detection.

Content-Type: application/json or text

Examples

Basic transcription:

curl -X POST https://api.getkawai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@audio.mp3" \
  -F "model=base" \
  -F "language=en"

Verbose JSON with timestamps:

curl -X POST https://api.getkawai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@audio.mp3" \
  -F "model=base" \
  -F "response_format=verbose_json"

Translations

Translate audio from any language to English.

POST /audio/translations

Translates audio into English. The source language is automatically detected.

Authentication: Required when auth is enabled. Token must have 'audio-translations' endpoint access.

Headers

Header Required Description
Authorization Yes Bearer token for authentication
Content-Type Yes Must be multipart/form-data

Request Body

Content-Type: multipart/form-data

Field Type Required Description
file binary Yes Audio file (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm)
model string Yes Translation model (e.g., 'tiny', 'base', 'small', 'medium', 'large')
prompt string No Optional text to guide style
response_format string No Format: json, text, srt, vtt, verbose_json (default: json)
temperature number No Sampling temperature 0-1 (default: 0)

Response

Returns English translation text. Verbose JSON includes segments and timestamps.

Content-Type: application/json or text

Examples

Translate Spanish audio to English:

curl -X POST https://api.getkawai.com/v1/audio/translations \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@spanish-audio.mp3" \
  -F "model=base"

Translate with verbose output:

curl -X POST https://api.getkawai.com/v1/audio/translations \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@french-audio.mp3" \
  -F "model=base" \
  -F "response_format=verbose_json"

Response Formats

Available response formats for transcription and translation.

JSON (default)

Simple JSON response with only the transcribed/translated text.

Examples

{
  "text": "Hello, this is the transcribed text."
}

Verbose JSON

Detailed JSON response with language detection, duration, segments, and word-level timestamps.

Examples

{
  "task": "transcribe",
  "language": "en",
  "duration": 5.2,
  "text": "Hello, this is the transcribed text.",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is",
      "tokens": [123, 456, 789]
    }
  ],
  "words": [
    {"word": "Hello", "start": 0.0, "end": 0.5}
  ]
}

Supported Models

Whisper models available for transcription and translation.

Whisper Models

OpenAI Whisper models for speech recognition.

Examples

Available models:

tiny    - 39M parameters, fastest
base    - 74M parameters, good balance
small   - 244M parameters, better accuracy
medium  - 769M parameters, high accuracy
large   - 1550M parameters, best accuracy