Realtime TTS API

A dedicated WebSocket endpoint for real-time text-to-speech synthesis. Send text incrementally — from an LLM token stream or user input — and receive synthesised audio as base64-encoded chunks with minimal latency.

Connection

The base URL is returned in the ws_tts_url field of the Create Streaming Session response — use that value rather than hardcoding the URL. Append the session's publisher JWT as the token query parameter.

{ws_tts_url}?token={JWT}

Authentication uses a LiveKit-compatible JWT passed as a query parameter. The connection is rate-limited to 20 new connections per minute per token.

TTS Session lifecycle

Connect to the WebSocket endpoint
Send an init message to establish voice and output settings
Send text messages — each up to 256 characters, mark the last chunk of each sentence with is_eos: true
Receive audio_chunk messages with base64-encoded audio as chunks arrive
Send cancel at any time to stop the current synthesis

The session persists until you disconnect. Settings from init apply to the entire session and can be overridden within text message.

Supported TTS Languages

The table below lists the language codes accepted by the language field in init message. Codes are lowercase.

Code	Language
`en`	English (generic)
`en-us`	English (US)
`en-gb`	English (UK)
`de`	German
`es`	Spanish
`fr`	French
`it`	Italian
`nl`	Dutch
`pt`	Portuguese
`ru`	Russian

Client → Server Messages

`init`

Must be sent once, immediately after connecting.

{
  "type": "init",
  "language": "en",
  "model": "auto",
  "voice_options": {
    "voice_id": "default_low",
    "speed": 0.5,
    "deaccent_strength": 0.0
  },
  "output": {
    "format": "pcm",
    "sample_rate": 24000
  }
}

Field	Type	Required	Default	Description
`language`	string	✓	—	BCP-47 language code, e.g. `en`, `ru`, `de`, `fr`.
`model`	string	—	`auto`	TTS model ID. Use `auto` to select the best model for the language.
`voice_options`	object	✓	—	Voice configuration. See fields below.
`output`	object	—	—	Output audio settings. See fields below.

`voice_options`

Field	Type	Required	Default	Description
`voice_id`	string	✓	—	Voice identifier. Use `default_low` or `default_high` for the language default.
`speed`	float	—	`0.5`	Speed multiplier. `0.5` is normal speed. Range: `0.0`–`1.0`.
`deaccent_strength`	float	—	`0.0`	Reduces foreign accent. Range: `0.0`–`1.0`.

`output`

Field	Type	Required	Default	Description
`format`	string	—	`mp3`	Output audio format. One of `pcm`, `mp3`, `wav`.
`sample_rate`	integer	—	`24000`	Output sample rate in Hz. Range: `8000`–`48000`.

`text`

Send a chunk of text to synthesise. Each message must be 256 characters or fewer. Send multiple messages to stream longer text — mark the last chunk of each sentence with is_eos: true.

Voice options can be overridden per message — only the fields you supply are changed.

{
  "type": "text",
  "text": "Hello, how can I help you today?",
  "is_eos": true
}

{
  "type": "text",
  "text": "This part is spoken faster.",
  "is_eos": false,
  "voice_options": {
    "speed": 0.9
  }
}

Field	Type	Required	Default	Description
`text`	string	✓	—	Text to synthesise. Max 256 characters per message.
`generation_id`	string	—	—	Optional client-supplied ID (4–256 chars). If set, it's returned as `generation_id` on the resulting `audio_chunk` messages so you can correlate output with this request. If omitted, the server assigns one.
`is_eos`	boolean	—	`false`	End of sentence. When `true` -- finalizes speech synthesis for provided sentence.
`voice_options`	object	—	—	Override of session voice options for this and subsequent messages.

`cancel`

Stop the current synthesis immediately. The session stays open — you can send new text messages right after.

{
  "type": "cancel"
}

Server → Client Messages

`audio_chunk`

Sent as audio is generated. Each message contains a base64-encoded audio chunk in the format specified in init.output.format.

{
  "message_type": "audio_chunk",
  "data": {
    "audio": "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZQ==",
    "size": 9600,
    "generation_id": "1a2b3c4d",
    "last_chunk": false,
    "chunk_generation_delta": 120,
    "audio_len": 0.2
  }
}

Field	Type	Description
`data.audio`	string	Base64-encoded audio data.
`data.size`	integer	Size of the decoded audio in bytes.
`data.generation_id`	string	ID of the generation this chunk belongs to. Echoes the `generation_id` you sent in the `text` message, or a server-assigned value if you didn't. Use it to correlate chunks with a specific request.
`data.last_chunk`	boolean	`true` on the final chunk of a generation — synthesis for that sentence is complete. Lets you detect the end without guessing.
`data.chunk_generation_delta`	integer \| null	Latency metric: milliseconds elapsed from the start of the generation to the moment this chunk was produced. `null` when the backend reports no timing for the chunk — in particular on the final end-of-generation marker (see `last_chunk`). Informational only.
`data.audio_len`	number \| null	Duration of this chunk's audio in seconds. Present on every chunk that carries audio; `null` on the final end-of-generation marker, which carries no audio.

`error`

Sent on session-level or synthesis errors.

{
  "message_type": "error",
  "data": {
    "code": "SERVICE_UNAVAILABLE",
    "desc": "Speech synthesis timed out. Please try again."
  }
}

Code	Retryable	Description
`SERVICE_UNAVAILABLE`	✓	Synthesis service issue. Wait and retry.
`SERVER_ERROR`	✗	Server-side synthesis failure that is not retryable.
`UNKNOWN_ERROR`	✗	Unexpected, unhandled internal error. Contact support if it persists.
`BAD_REQUEST`	✗	Incoming message is not valid JSON or not valid UTF-8. The session stays open.
`VALIDATION_ERROR`	✗	Invalid message field. See `desc` for details.
`CONFLICT`	✗	Session already initialised. Reconnect to change settings.
`SESSION_NOT_FOUND`	✗	Send `init` before sending text.
`UNAUTHORIZED`	✗	JWT token missing, invalid, or expired.
`RATE_LIMIT_EXCEEDED`	✓	Text-message rate limit exceeded (max 50 messages per second). The session stays open — wait briefly and retry.

Two independent limits

Connections — max 20 new connections per minute per token. Exceeding this does not produce a RATE_LIMIT_EXCEEDED message: the server closes the WebSocket with code 1008 (policy violation) and reason Connection rate limit exceeded. Wait before reconnecting.
Text messages — max 50 text messages per second. Exceeding this returns an error with code RATE_LIMIT_EXCEEDED and keeps the session open.

Output formats

Format	Description
`pcm`	Raw 16-bit signed PCM, little-endian. No container. Best for real-time playback — schedule each chunk immediately as it arrives.
`mp3`	MPEG Layer 3 compressed audio. Collect all chunks and decode together.
`wav`	PCM with RIFF/WAVE container. Collect all chunks and decode together.

Real-time PCM playback

Decode each data.audio as base64 → Uint8Array → Int16Array (little-endian) → Float32Array, create an AudioBuffer and schedule with AudioBufferSourceNode.start(nextTime) — accumulating nextTime += audioBuffer.duration after each chunk for gapless playback.

Example — streaming from an LLM

This example shows how sentences splitted into chunks can be streamed to the TTS API. Voice speed is increased mid-stream to demonstrate per-message overrides.

import asyncio, json, base64, requests, websockets

SENTENCES = [
    (
        "The sun was setting over the mountains,",
        "casting long golden shadows across the valley below.",
    ),
    (
        "Birds were returning to their nests,",
        "filling the air with their evening songs.",
    ),
    (
        "A gentle breeze moved through the tall grass,",
        "creating waves that rippled toward the horizon.",
    ),
]

def create_session():
    # The V2 endpoint returns the `ws_tts_url` and `publisher` token.
    resp = requests.post(
        "https://api.palabra.ai/session-storage/session",
        headers={
            "ClientID": "<your-client-id>",
            "ClientSecret": "<your-client-secret>",
            "Content-Type": "application/json",
        },
        json={"data": {"subscriber_count": 0}},
    )
    resp.raise_for_status()
    data = resp.json()["data"]
    return data["ws_tts_url"], data["publisher"]

async def main():
    ws_tts_url, publisher_jwt = create_session()
    url = f"{ws_tts_url}?token={publisher_jwt}"

    async with websockets.connect(url) as ws:
        # 1. Initialise the session
        await ws.send(json.dumps({
            "type": "init",
            "language": "en",
            "model": "auto",
            "voice_options": {
                "voice_id": "default_low",
                "speed": 0.5,
                "deaccent_strength": 0.0,
            },
            "output": {
                "format": "pcm",
                "sample_rate": 24000,
            }
        }))

        # 2. Send each sentence chunk by chunk, is_eos=True on the last chunk of each sentence
        sent = 0
        for i, sentence in enumerate(SENTENCES):
            for k, chunk in enumerate(sentence):
                msg = {
                    "type": "text",
                    "text": chunk,
                    "is_eos": k == len(sentence) - 1,
                }
                if k == 0:  # override speed only at the start of each sentence
                    msg["voice_options"] = {"speed": min(0.5 + i / 10, 1.0)}
                await ws.send(json.dumps(msg))
                sent += 1

        # 3. Collect audio until every generation has signalled last_chunk
        audio = bytearray()
        done = 0
        async for raw in ws:
            data = json.loads(raw)
            if data.get("message_type") == "audio_chunk":
                chunk = data["data"]
                audio.extend(base64.b64decode(chunk["audio"]))
                print(f"Chunk: {chunk['size']} bytes, total: {len(audio)}")
                if chunk.get("last_chunk"):
                    done += 1
                    if done == sent:   # all sentences finished
                        break
            elif data.get("message_type") == "error":
                print(f"Error: {data['data']['code']} — {data['data']['desc']}")
                break

        print(f"Total: {len(audio)} bytes")
        with open("output.pcm", "wb") as f:
            f.write(audio)
        # Play: ffplay -f s16le -ar 24000 -ac 1 output.pcm

asyncio.run(main())

Connection​

TTS Session lifecycle​

Supported TTS Languages​

Client → Server Messages​

init​

voice_options​

output​

text​

cancel​

Server → Client Messages​

audio_chunk​

error​

Output formats​

Example — streaming from an LLM​