Introduction
The Realtime API provides low-latency text/audio real-time conversation capabilities through WebSocket long connections with event stream interactions. It supports both text and audio input/output modes, enabling real-time voice conversations, text conversations, and more. Endpoint:Authentication
Bearer Token, e.g.
Bearer sk-xxxxxxxxxxConnection Parameters
Model name, supported models:
gpt-realtime- GPT Realtime Standardgpt-realtime-mini- GPT Realtime Mini
Basic Information
| Item | Content |
|---|---|
| Base URL | wss://llm.ai-nebula.com |
| Endpoint | /v1/realtime?model={model} |
| Protocol | WebSocket (JSON event stream) |
| Audio Format | PCM16 mono, 24000Hz sample rate |
Event Types
Client-Sent Events
| Event Type | Description | Required Fields |
|---|---|---|
session.update | Set/update session configuration (send first after connection) | session |
conversation.item.create | Send conversation message (text or audio) | item |
input_audio_buffer.append | Stream audio data (Base64 encoded) | audio |
input_audio_buffer.commit | Commit audio buffer, trigger processing | None |
response.create | Request response generation (call after sending message) | None |
Server-Returned Events
| Event Type | Description | Included Fields |
|---|---|---|
session.created | Session created | session.id |
session.updated | Session configuration updated | session |
conversation.item.created | Conversation item created | item |
response.created | Response created | response.id |
response.text.delta | Text incremental output | delta |
response.text.done | Text output completed | None |
response.audio.delta | Audio incremental output | delta |
response.audio.done | Audio output completed | None |
response.audio_transcript.delta | Audio transcription incremental | delta |
response.function_call_arguments.delta | Function call arguments incremental | delta |
response.function_call_arguments.done | Function call arguments completed | None |
response.done | Response round completed, includes usage statistics | response.usage |
error | Error event | error |
Session Configuration
After establishing a WebSocket connection, you must first send asession.update event to configure session parameters.
Session Configuration Example
Session Parameters
Supported interaction modes, optional values:
"text"- Text mode"audio"- Audio mode Can include multiple modes, e.g.["text", "audio"]
System prompt, used to set assistant behavior and role
Voice type, optional values:
alloy, echo, fable, onyx, nova, shimmerTemperature parameter, controls output randomness, range: 0.0 - 2.0
Input audio format, currently only supports
pcm16Output audio format, currently only supports
pcm16Tool function list, supports function calling
Tool selection strategy:
auto, required, noneSending Messages
Text Message Example
Audio Message Example
Audio messages require first pushing audio data viainput_audio_buffer.append, then calling input_audio_buffer.commit to submit:
Request Response Generation
After sending a message, you need to callresponse.create to trigger generation:
Complete Examples
Python Example
JavaScript Example
Response Examples
Error Handling
Error Event Format
Common Errors
| Error Type | Trigger Scenario | Solution |
|---|---|---|
authentication_error | Invalid or unauthorized API Key | Verify API Key in Authorization header is valid |
invalid_request_error | Request format error | Check event format and required fields |
model_not_found | Incorrect model name | Only supports gpt-realtime or gpt-realtime-mini |
audio_decode_error | Incorrect audio format | Ensure PCM16 mono, 24000Hz sample rate |
rate_limit_error | Request rate too high | Reduce request frequency or retry after waiting |
server_error | Internal server error | Retry later or contact technical support |
Audio Format Requirements
Input Audio
- Format: PCM16 (16-bit PCM)
- Channels: Mono
- Sample Rate: 24000 Hz
- Encoding: Base64 encoded, sent via
input_audio_buffer.append
Output Audio
- Format: PCM16 (16-bit PCM)
- Channels: Mono
- Sample Rate: 24000 Hz
- Encoding: Base64 encoded, returned via
response.audio.deltaevent
Usage Flow
- Establish Connection: Connect via WebSocket to
wss://llm.ai-nebula.com/v1/realtime?model={model} - Configure Session: Send
session.updateevent to configure session parameters - Send Message:
- Text mode: Send
conversation.item.createevent - Audio mode: First send
input_audio_buffer.appendto push audio, theninput_audio_buffer.committo submit, finally sendconversation.item.create
- Text mode: Send
- Request Response: Send
response.createevent to trigger generation - Receive Response: Listen to
response.text.deltaorresponse.audio.deltaevents to receive incremental output - Handle Completion: After receiving
response.doneevent, checkusagestatistics
Notes
- Required Step: After establishing connection, must first send
session.updateto configure session - Trigger Response: After sending message, must call
response.createto trigger generation - Audio Format: Audio must be PCM16 mono 24000Hz, Base64 encoded
- Event ID: Recommend setting unique
event_idfor each event for tracking and debugging - Connection Management: Keep WebSocket connection active, avoid frequent disconnections and reconnections
- Error Handling: Listen to
errorevents and implement appropriate error handling logic - Dependencies:
- Python:
pip install websocket-client - JavaScript: Use native
WebSocketAPI orwslibrary
- Python:
Best Practices
- Connection Reuse: Reuse the same WebSocket connection for multiple conversation rounds to reduce connection overhead
- Error Retry: Implement exponential backoff retry mechanism for network errors
- Audio Buffering: Recommend sending audio data in chunks to avoid sending too large at once
- Usage Statistics: Pay attention to
usageinformation inresponse.doneto control costs reasonably - Timeout Handling: Set reasonable timeout to avoid long waits
