Realtime TTS API
A dedicated WebSocket endpoint for real-time text-to-speech synthesis. Send text incrementally — from an LLM token stream or user input — and receive synthesised audio as base64-encoded chunks with minimal latency.
Connection
The base URL is returned in the ws_tts_url field of the Create Streaming Session response — use that value rather than hardcoding the URL. Append the session's publisher JWT as the token query parameter.
{ws_tts_url}?token={JWT}
Authentication uses a LiveKit-compatible JWT passed as a query parameter. The connection is rate-limited to 20 new connections per minute per token.
TTS Session lifecycle
- Connect to the WebSocket endpoint
- Send an
initmessage to establish voice and output settings - Send
textmessages — each up to 256 characters, mark the last chunk of each sentence withis_eos: true - Receive
audio_chunkmessages with base64-encoded audio as chunks arrive - Send
cancelat any time to stop the current synthesis
The session persists until you disconnect. Settings from init apply to the entire session and can be overridden within text message.
Supported TTS Languages
The table below lists the language codes accepted by the language field in init message. Codes are lowercase.
| Code | Language |
|---|---|
en | English (generic) |
en-us | English (US) |
en-gb | English (UK) |
de | German |
es | Spanish |
fr | French |
it | Italian |
nl | Dutch |
pt | Portuguese |
ru | Russian |
Client → Server Messages
init
Must be sent once, immediately after connecting.
{
"type": "init",
"language": "en",
"model": "auto",
"voice_options": {
"voice_id": "default_low",
"speed": 0.5,
"deaccent_strength": 0.0
},
"output": {
"format": "pcm",
"sample_rate": 24000
}
}
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
language | string | ✓ | — | BCP-47 language code, e.g. en, ru, de, fr. |
model | string | — | auto | TTS model ID. Use auto to select the best model for the language. |
voice_options | object | ✓ | — | Voice configuration. See fields below. |
output | object | — | — | Output audio settings. See fields below. |
voice_options
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
voice_id | string | ✓ | — | Voice identifier. Use default_low or default_high for the language default. |
speed | float | — | 0.5 | Speed multiplier. 0.5 is normal speed. Range: 0.0–1.0. |
deaccent_strength | float | — | 0.0 | Reduces foreign accent. Range: 0.0–1.0. |
output
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
format | string | — | mp3 | Output audio format. One of pcm, mp3, wav. |
sample_rate | integer | — | 24000 | Output sample rate in Hz. Range: 8000–48000. |
text
Send a chunk of text to synthesise. Each message must be 256 characters or fewer. Send multiple messages to stream longer text — mark the last chunk of each sentence with is_eos: true.
Voice options can be overridden per message — only the fields you supply are changed.
{
"type": "text",
"text": "Hello, how can I help you today?",
"is_eos": true
}
{
"type": "text",
"text": "This part is spoken faster.",
"is_eos": false,
"voice_options": {
"speed": 0.9
}
}
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
text | string | ✓ | — | Text to synthesise. Max 256 characters per message. |
generation_id | string | — | — | Optional client-supplied ID (4–256 chars). If set, it's returned as generation_id on the resulting audio_chunk messages so you can correlate output with this request. If omitted, the server assigns one. |
is_eos | boolean | — | false | End of sentence. When true -- finalizes speech synthesis for provided sentence. |
voice_options | object | — | — | Override of session voice options for this and subsequent messages. |
cancel
Stop the current synthesis immediately. The session stays open — you can send new text messages right after.
{
"type": "cancel"
}
Server → Client Messages
audio_chunk
Sent as audio is generated. Each message contains a base64-encoded audio chunk in the format specified in init.output.format.
{
"message_type": "audio_chunk",
"data": {
"audio": "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZQ==",
"size": 9600,
"generation_id": "1a2b3c4d",
"last_chunk": false,
"chunk_generation_delta": 120,
"audio_len": 0.2
}
}
| Field | Type | Description |
|---|---|---|
data.audio | string | Base64-encoded audio data. |
data.size | integer | Size of the decoded audio in bytes. |
data.generation_id | string | ID of the generation this chunk belongs to. Echoes the generation_id you sent in the text message, or a server-assigned value if you didn't. Use it to correlate chunks with a specific request. |
data.last_chunk | boolean | true on the final chunk of a generation — synthesis for that sentence is complete. Lets you detect the end without guessing. |
data.chunk_generation_delta | integer | null | Latency metric: milliseconds elapsed from the start of the generation to the moment this chunk was produced. null when the backend reports no timing for the chunk — in particular on the final end-of-generation marker (see last_chunk). Informational only. |
data.audio_len | number | null | Duration of this chunk's audio in seconds. Present on every chunk that carries audio; null on the final end-of-generation marker, which carries no audio. |
error
Sent on session-level or synthesis errors.
{
"message_type": "error",
"data": {
"code": "SERVICE_UNAVAILABLE",
"desc": "Speech synthesis timed out. Please try again."
}
}
| Code | Retryable | Description |
|---|---|---|
SERVICE_UNAVAILABLE | ✓ | Synthesis service issue. Wait and retry. |
SERVER_ERROR | ✗ | Server-side synthesis failure that is not retryable. |
UNKNOWN_ERROR | ✗ | Unexpected, unhandled internal error. Contact support if it persists. |
BAD_REQUEST | ✗ | Incoming message is not valid JSON or not valid UTF-8. The session stays open. |
VALIDATION_ERROR | ✗ | Invalid message field. See desc for details. |
CONFLICT | ✗ | Session already initialised. Reconnect to change settings. |
SESSION_NOT_FOUND | ✗ | Send init before sending text. |
UNAUTHORIZED | ✗ | JWT token missing, invalid, or expired. |
RATE_LIMIT_EXCEEDED | ✓ | Text-message rate limit exceeded (max 50 messages per second). The session stays open — wait briefly and retry. |
- Connections — max 20 new connections per minute per token. Exceeding this does not produce a
RATE_LIMIT_EXCEEDEDmessage: the server closes the WebSocket with code1008(policy violation) and reasonConnection rate limit exceeded. Wait before reconnecting. - Text messages — max 50
textmessages per second. Exceeding this returns anerrorwith codeRATE_LIMIT_EXCEEDEDand keeps the session open.
Output formats
| Format | Description |
|---|---|
pcm | Raw 16-bit signed PCM, little-endian. No container. Best for real-time playback — schedule each chunk immediately as it arrives. |
mp3 | MPEG Layer 3 compressed audio. Collect all chunks and decode together. |
wav | PCM with RIFF/WAVE container. Collect all chunks and decode together. |
Decode each data.audio as base64 → Uint8Array → Int16Array (little-endian) → Float32Array, create an AudioBuffer and schedule with AudioBufferSourceNode.start(nextTime) — accumulating nextTime += audioBuffer.duration after each chunk for gapless playback.
Example — streaming from an LLM
This example shows how sentences splitted into chunks can be streamed to the TTS API. Voice speed is increased mid-stream to demonstrate per-message overrides.
import asyncio, json, base64, requests, websockets
SENTENCES = [
(
"The sun was setting over the mountains,",
"casting long golden shadows across the valley below.",
),
(
"Birds were returning to their nests,",
"filling the air with their evening songs.",
),
(
"A gentle breeze moved through the tall grass,",
"creating waves that rippled toward the horizon.",
),
]
def create_session():
# The V2 endpoint returns the `ws_tts_url` and `publisher` token.
resp = requests.post(
"https://api.palabra.ai/session-storage/session",
headers={
"ClientID": "<your-client-id>",
"ClientSecret": "<your-client-secret>",
"Content-Type": "application/json",
},
json={"data": {"subscriber_count": 0}},
)
resp.raise_for_status()
data = resp.json()["data"]
return data["ws_tts_url"], data["publisher"]
async def main():
ws_tts_url, publisher_jwt = create_session()
url = f"{ws_tts_url}?token={publisher_jwt}"
async with websockets.connect(url) as ws:
# 1. Initialise the session
await ws.send(json.dumps({
"type": "init",
"language": "en",
"model": "auto",
"voice_options": {
"voice_id": "default_low",
"speed": 0.5,
"deaccent_strength": 0.0,
},
"output": {
"format": "pcm",
"sample_rate": 24000,
}
}))
# 2. Send each sentence chunk by chunk, is_eos=True on the last chunk of each sentence
sent = 0
for i, sentence in enumerate(SENTENCES):
for k, chunk in enumerate(sentence):
msg = {
"type": "text",
"text": chunk,
"is_eos": k == len(sentence) - 1,
}
if k == 0: # override speed only at the start of each sentence
msg["voice_options"] = {"speed": min(0.5 + i / 10, 1.0)}
await ws.send(json.dumps(msg))
sent += 1
# 3. Collect audio until every generation has signalled last_chunk
audio = bytearray()
done = 0
async for raw in ws:
data = json.loads(raw)
if data.get("message_type") == "audio_chunk":
chunk = data["data"]
audio.extend(base64.b64decode(chunk["audio"]))
print(f"Chunk: {chunk['size']} bytes, total: {len(audio)}")
if chunk.get("last_chunk"):
done += 1
if done == sent: # all sentences finished
break
elif data.get("message_type") == "error":
print(f"Error: {data['data']['code']} — {data['data']['desc']}")
break
print(f"Total: {len(audio)} bytes")
with open("output.pcm", "wb") as f:
f.write(audio)
# Play: ffplay -f s16le -ar 24000 -ac 1 output.pcm
asyncio.run(main())