Introduction
Gemini Live API supports low-latency, real-time voice and video interactions with Gemini. It can process continuous audio, video, or text streams to provide instant, natural, and realistic voice responses. Key Features:- ✅ High-quality audio: Provides natural, realistic voices in multiple languages
- ✅ Multi-language support: Supports conversations in 24 languages
- ✅ Interruption capability: Users can interrupt the model at any time for responsive interactions
- ✅ Empathetic conversations: Adjusts response style and tone based on the emotional expression of user input
- ✅ Tool usage: Integrates function calling and Google Search
- ✅ Audio transcription: Provides text transcription of user input and model output
- ✅ Proactive audio: Controls when and in which contexts the model responds
API Endpoint
Endpoint:wss://llm.ai-nebula.com/v1beta/models/{model}/liveStream
Features:
- Uses Gemini Live API native format
- Direct passthrough, no protocol conversion
- Supports all native Gemini features
Authentication
Bearer Token, e.g.,
Bearer sk-xxxxxxxxxxSupported Models
The following models support Gemini Live API:| Model ID | Availability | Use Case | Key Features |
|---|---|---|---|
gemini-live-2.5-flash-native-audio | Generally Available | Recommended. Low-latency voice agent. Supports seamless multi-language switching and emotional tone. | Native audio, audio transcription, voice activity detection, empathetic conversations, proactive audio, tool usage |
Voice and Language Configuration
Voice Configuration
Gemini Live API supports 30 different preset voices, each with unique expression characteristics:| Voice Name | Style | Voice Name | Style | Voice Name | Style |
|---|---|---|---|---|---|
| Zephyr | Bright | Puck | Cheerful | Charon | Informative |
| Kore | Firm | Fenrir | Excited | Leda | Youthful |
| Orus | Firm | Aoede | Light | Callirrhoe | Easy-going |
| Autonoe | Bright | Enceladus | Breathy | Iapetus | Clear |
| Umbriel | Relaxed | Algieba | Smooth | Despina | Natural |
| Erinome | Clear | Algenib | Hoarse | Rasalgethi | Informative |
| Laomedeia | Cheerful | Achernar | Soft | Alnilam | Strong |
| Schedar | Steady | Gacrux | Mature | Pulcherrima | Positive |
| Achird | Friendly | Zubenelgenubi | Casual | Vindemiatrix | Gentle |
| Sadachbia | Lively | Sadaltager | Learned | Sulafat | Warm |
Language Configuration
Supports 24 languages, specified via BCP-47 language codes:| Language | Code | Language | Code |
|---|---|---|---|
| Arabic (Egypt) | ar-EG | German (Germany) | de-DE |
| English (US) | en-US | Spanish (US) | es-US |
| French (France) | fr-FR | Hindi (India) | hi-IN |
| Indonesian | id-ID | Italian (Italy) | it-IT |
| Japanese (Japan) | ja-JP | Korean (Korea) | ko-KR |
| Portuguese (Brazil) | pt-BR | Russian (Russia) | ru-RU |
| Dutch (Netherlands) | nl-NL | Polish (Poland) | pl-PL |
| Thai (Thailand) | th-TH | Turkish (Turkey) | tr-TR |
| Vietnamese (Vietnam) | vi-VN | Romanian | ro-RO |
| Ukrainian | uk-UA | Bengali | bn-BD |
| English (India) | en-IN | Marathi (India) | mr-IN |
| Tamil (India) | ta-IN | Telugu (India) | te-IN |
| Chinese (Simplified) | zh-CN |
Usage Examples
JavaScript Example
Python Example
Configuration Examples
Example 1: Audio Only Mode
Example 2: Audio + Text Transcription Mode (Recommended)
responseModalities: Response modality, choose one of the following:["AUDIO"]- Audio output only["AUDIO", "TEXT"]- Audio + text transcription (recommended, get both audio and text)
voiceName: Voice name, supports 30 preset voices (see voice configuration table above)languageCode: Language code, supports 24 languages (see language configuration table above)googleSearch: Enable Google Search functionalityproactiveAudio: Proactive audio, model can choose not to respond to irrelevant audioempatheticMode: Empathetic conversations, adjusts response style based on emotionsoutputAudioTranscription: Enable output audio-to-text transcription (requires"TEXT"inresponseModalitiesto see transcription text)automaticActivityDetection: Voice activity detection configuration
Message Types
Client Messages
| Message Type | Description |
|---|---|
setup | Session configuration |
clientContent | Client content (text/audio) |
realtimeInput | Realtime audio input |
toolResponse | Tool response |
Server Messages
| Message Type | Description |
|---|---|
setupComplete | Setup completion confirmation |
serverContent | Server content (text/audio/transcription) |
toolCall | Tool call |
toolCallCancellation | Tool call cancellation |
usageMetadata | Usage statistics |
Token Statistics
The system separately tracks:- Text Tokens (input/output)
- Audio Tokens (input/output)
- Total Token Count
usageMetadata messages:
Pricing
Important Note: Model prices may change. Please refer to the latest prices displayed in the model marketplace. Gemini Live API is billed by token, separately tracking text and audio tokens:- Text Tokens: Used for input text content and output text transcription
- Audio Tokens: Used for input audio and output audio content
usageMetadata messages, including input/output token counts for both text and audio.
Technical Specifications
Audio Format
Input Audio:- Format: 16-bit PCM
- Sample Rate: 16kHz
- Byte Order: Little-endian
- Encoding: Base64
- Format: 16-bit PCM
- Sample Rate: 24kHz
- Byte Order: Little-endian
- Encoding: Base64
FAQ
How to select a voice?
How to select a voice?
Specify the voice name in
speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName of the setup message. Supports 30 preset voices. See the Voice Configuration section above for the complete list. Default voice is Zephyr.How to enable audio-to-text transcription?
How to enable audio-to-text transcription?
Two conditions must be met:
- Include
"TEXT"ingenerationConfig.responseModalities(e.g.,["AUDIO", "TEXT"]) - Add
outputAudioTranscription: {}field in the setup message
serverContent.outputTranscription.How to enable Google Search?
How to enable Google Search?
Add
tools: { googleSearch: {} } field in the setup message. Once enabled, the model can search for the latest web information when answering questions.How to enable tool calling?
How to enable tool calling?
Add tool definitions in the setup message:
How to interrupt model response?
How to interrupt model response?
Sending a new
realtimeInput or clientContent message will interrupt the current response.Does it support video input?
Does it support video input?
Yes, Gemini Live API supports video input. Video data (JPEG format, 1 FPS) can be included in
clientContent.How to get usage statistics?
How to get usage statistics?
The system sends
usageMetadata messages during or after response completion, containing detailed usage statistics.How to configure speech recognition sensitivity?
How to configure speech recognition sensitivity?
Configure in
realtimeInputConfig.automaticActivityDetection of the setup message: