Skip to main content

Realtime TTS API

A dedicated WebSocket endpoint for real-time text-to-speech synthesis. Send text incrementally — from an LLM token stream or user input — and receive synthesised audio as base64-encoded chunks with minimal latency.

Connection

The base URL is returned in the ws_tts_url field of the Create Streaming Session response — use that value rather than hardcoding the URL. Append the session's publisher JWT as the token query parameter.

{ws_tts_url}?token={JWT}

Authentication uses a LiveKit-compatible JWT passed as a query parameter. The connection is rate-limited to 20 new connections per minute per token.

TTS Session lifecycle

  1. Connect to the WebSocket endpoint
  2. Send an init message to establish voice and output settings
  3. Send text messages — each up to 256 characters, mark the last chunk of each sentence with is_eos: true
  4. Receive audio_chunk messages with base64-encoded audio as chunks arrive
  5. Send cancel at any time to stop the current synthesis

The session persists until you disconnect. Settings from init apply to the entire session and can be overridden within text message.


Supported TTS Languages

The table below lists the language codes accepted by the language field in init message. Codes are lowercase.

CodeLanguage
enEnglish (generic)
en-usEnglish (US)
en-gbEnglish (UK)
deGerman
esSpanish
frFrench
itItalian
nlDutch
ptPortuguese
ruRussian

Client → Server Messages

init

Must be sent once, immediately after connecting.

{
"type": "init",
"language": "en",
"model": "auto",
"voice_options": {
"voice_id": "default_low",
"speed": 0.5,
"deaccent_strength": 0.0
},
"output": {
"format": "pcm",
"sample_rate": 24000
}
}
FieldTypeRequiredDefaultDescription
languagestringBCP-47 language code, e.g. en, ru, de, fr.
modelstringautoTTS model ID. Use auto to select the best model for the language.
voice_optionsobjectVoice configuration. See fields below.
outputobjectOutput audio settings. See fields below.

voice_options

FieldTypeRequiredDefaultDescription
voice_idstringVoice identifier. Use default_low or default_high for the language default.
speedfloat0.5Speed multiplier. 0.5 is normal speed. Range: 0.01.0.
deaccent_strengthfloat0.0Reduces foreign accent. Range: 0.01.0.

output

FieldTypeRequiredDefaultDescription
formatstringmp3Output audio format. One of pcm, mp3, wav.
sample_rateinteger24000Output sample rate in Hz. Range: 800048000.

text

Send a chunk of text to synthesise. Each message must be 256 characters or fewer. Send multiple messages to stream longer text — mark the last chunk of each sentence with is_eos: true.

Voice options can be overridden per message — only the fields you supply are changed.

{
"type": "text",
"text": "Hello, how can I help you today?",
"is_eos": true
}
{
"type": "text",
"text": "This part is spoken faster.",
"is_eos": false,
"voice_options": {
"speed": 0.9
}
}
FieldTypeRequiredDefaultDescription
textstringText to synthesise. Max 256 characters per message.
generation_idstringOptional client-supplied ID (4–256 chars). If set, it's returned as generation_id on the resulting audio_chunk messages so you can correlate output with this request. If omitted, the server assigns one.
is_eosbooleanfalseEnd of sentence. When true -- finalizes speech synthesis for provided sentence.
voice_optionsobjectOverride of session voice options for this and subsequent messages.

cancel

Stop the current synthesis immediately. The session stays open — you can send new text messages right after.

{
"type": "cancel"
}

Server → Client Messages

audio_chunk

Sent as audio is generated. Each message contains a base64-encoded audio chunk in the format specified in init.output.format.

{
"message_type": "audio_chunk",
"data": {
"audio": "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZQ==",
"size": 9600,
"generation_id": "1a2b3c4d",
"last_chunk": false,
"chunk_generation_delta": 120,
"audio_len": 0.2
}
}
FieldTypeDescription
data.audiostringBase64-encoded audio data.
data.sizeintegerSize of the decoded audio in bytes.
data.generation_idstringID of the generation this chunk belongs to. Echoes the generation_id you sent in the text message, or a server-assigned value if you didn't. Use it to correlate chunks with a specific request.
data.last_chunkbooleantrue on the final chunk of a generation — synthesis for that sentence is complete. Lets you detect the end without guessing.
data.chunk_generation_deltainteger | nullLatency metric: milliseconds elapsed from the start of the generation to the moment this chunk was produced. null when the backend reports no timing for the chunk — in particular on the final end-of-generation marker (see last_chunk). Informational only.
data.audio_lennumber | nullDuration of this chunk's audio in seconds. Present on every chunk that carries audio; null on the final end-of-generation marker, which carries no audio.

error

Sent on session-level or synthesis errors.

{
"message_type": "error",
"data": {
"code": "SERVICE_UNAVAILABLE",
"desc": "Speech synthesis timed out. Please try again."
}
}
CodeRetryableDescription
SERVICE_UNAVAILABLESynthesis service issue. Wait and retry.
SERVER_ERRORServer-side synthesis failure that is not retryable.
UNKNOWN_ERRORUnexpected, unhandled internal error. Contact support if it persists.
BAD_REQUESTIncoming message is not valid JSON or not valid UTF-8. The session stays open.
VALIDATION_ERRORInvalid message field. See desc for details.
CONFLICTSession already initialised. Reconnect to change settings.
SESSION_NOT_FOUNDSend init before sending text.
UNAUTHORIZEDJWT token missing, invalid, or expired.
RATE_LIMIT_EXCEEDEDText-message rate limit exceeded (max 50 messages per second). The session stays open — wait briefly and retry.
Two independent limits
  • Connections — max 20 new connections per minute per token. Exceeding this does not produce a RATE_LIMIT_EXCEEDED message: the server closes the WebSocket with code 1008 (policy violation) and reason Connection rate limit exceeded. Wait before reconnecting.
  • Text messages — max 50 text messages per second. Exceeding this returns an error with code RATE_LIMIT_EXCEEDED and keeps the session open.

Output formats

FormatDescription
pcmRaw 16-bit signed PCM, little-endian. No container. Best for real-time playback — schedule each chunk immediately as it arrives.
mp3MPEG Layer 3 compressed audio. Collect all chunks and decode together.
wavPCM with RIFF/WAVE container. Collect all chunks and decode together.
Real-time PCM playback

Decode each data.audio as base64 → Uint8ArrayInt16Array (little-endian) → Float32Array, create an AudioBuffer and schedule with AudioBufferSourceNode.start(nextTime) — accumulating nextTime += audioBuffer.duration after each chunk for gapless playback.


Example — streaming from an LLM

This example shows how sentences splitted into chunks can be streamed to the TTS API. Voice speed is increased mid-stream to demonstrate per-message overrides.

import asyncio, json, base64, requests, websockets

SENTENCES = [
(
"The sun was setting over the mountains,",
"casting long golden shadows across the valley below.",
),
(
"Birds were returning to their nests,",
"filling the air with their evening songs.",
),
(
"A gentle breeze moved through the tall grass,",
"creating waves that rippled toward the horizon.",
),
]

def create_session():
# The V2 endpoint returns the `ws_tts_url` and `publisher` token.
resp = requests.post(
"https://api.palabra.ai/session-storage/session",
headers={
"ClientID": "<your-client-id>",
"ClientSecret": "<your-client-secret>",
"Content-Type": "application/json",
},
json={"data": {"subscriber_count": 0}},
)
resp.raise_for_status()
data = resp.json()["data"]
return data["ws_tts_url"], data["publisher"]

async def main():
ws_tts_url, publisher_jwt = create_session()
url = f"{ws_tts_url}?token={publisher_jwt}"

async with websockets.connect(url) as ws:
# 1. Initialise the session
await ws.send(json.dumps({
"type": "init",
"language": "en",
"model": "auto",
"voice_options": {
"voice_id": "default_low",
"speed": 0.5,
"deaccent_strength": 0.0,
},
"output": {
"format": "pcm",
"sample_rate": 24000,
}
}))

# 2. Send each sentence chunk by chunk, is_eos=True on the last chunk of each sentence
sent = 0
for i, sentence in enumerate(SENTENCES):
for k, chunk in enumerate(sentence):
msg = {
"type": "text",
"text": chunk,
"is_eos": k == len(sentence) - 1,
}
if k == 0: # override speed only at the start of each sentence
msg["voice_options"] = {"speed": min(0.5 + i / 10, 1.0)}
await ws.send(json.dumps(msg))
sent += 1

# 3. Collect audio until every generation has signalled last_chunk
audio = bytearray()
done = 0
async for raw in ws:
data = json.loads(raw)
if data.get("message_type") == "audio_chunk":
chunk = data["data"]
audio.extend(base64.b64decode(chunk["audio"]))
print(f"Chunk: {chunk['size']} bytes, total: {len(audio)}")
if chunk.get("last_chunk"):
done += 1
if done == sent: # all sentences finished
break
elif data.get("message_type") == "error":
print(f"Error: {data['data']['code']}{data['data']['desc']}")
break

print(f"Total: {len(audio)} bytes")
with open("output.pcm", "wb") as f:
f.write(audio)
# Play: ffplay -f s16le -ar 24000 -ac 1 output.pcm

asyncio.run(main())