Skip to main content

Introduction

Gemini Live API supports low-latency, real-time voice and video interactions with Gemini. It can process continuous audio, video, or text streams to provide instant, natural, and realistic voice responses. Key Features:
  • ✅ High-quality audio: Provides natural, realistic voices in multiple languages
  • ✅ Multi-language support: Supports conversations in 24 languages
  • ✅ Interruption capability: Users can interrupt the model at any time for responsive interactions
  • ✅ Empathetic conversations: Adjusts response style and tone based on the emotional expression of user input
  • ✅ Tool usage: Integrates function calling and Google Search
  • ✅ Audio transcription: Provides text transcription of user input and model output
  • ✅ Proactive audio: Controls when and in which contexts the model responds

API Endpoint

Endpoint: wss://llm.ai-nebula.com/v1beta/models/{model}/liveStream Features:
  • Uses Gemini Live API native format
  • Direct passthrough, no protocol conversion
  • Supports all native Gemini features
Example:
const ws = new WebSocket('wss://llm.ai-nebula.com/v1beta/models/gemini-live-2.5-flash-native-audio/liveStream', {
  headers: {
    'Authorization': 'Bearer sk-xxxx'
  }
});

Authentication

Authorization
string
required
Bearer Token, e.g., Bearer sk-xxxxxxxxxx

Supported Models

The following models support Gemini Live API:
Model IDAvailabilityUse CaseKey Features
gemini-live-2.5-flash-native-audioGenerally AvailableRecommended. Low-latency voice agent. Supports seamless multi-language switching and emotional tone.Native audio, audio transcription, voice activity detection, empathetic conversations, proactive audio, tool usage

Voice and Language Configuration

Voice Configuration

Gemini Live API supports 30 different preset voices, each with unique expression characteristics:
Voice NameStyleVoice NameStyleVoice NameStyle
ZephyrBrightPuckCheerfulCharonInformative
KoreFirmFenrirExcitedLedaYouthful
OrusFirmAoedeLightCallirrhoeEasy-going
AutonoeBrightEnceladusBreathyIapetusClear
UmbrielRelaxedAlgiebaSmoothDespinaNatural
ErinomeClearAlgenibHoarseRasalgethiInformative
LaomedeiaCheerfulAchernarSoftAlnilamStrong
SchedarSteadyGacruxMaturePulcherrimaPositive
AchirdFriendlyZubenelgenubiCasualVindemiatrixGentle
SadachbiaLivelySadaltagerLearnedSulafatWarm
Default Voice: Zephyr (Bright)

Language Configuration

Supports 24 languages, specified via BCP-47 language codes:
LanguageCodeLanguageCode
Arabic (Egypt)ar-EGGerman (Germany)de-DE
English (US)en-USSpanish (US)es-US
French (France)fr-FRHindi (India)hi-IN
Indonesianid-IDItalian (Italy)it-IT
Japanese (Japan)ja-JPKorean (Korea)ko-KR
Portuguese (Brazil)pt-BRRussian (Russia)ru-RU
Dutch (Netherlands)nl-NLPolish (Poland)pl-PL
Thai (Thailand)th-THTurkish (Turkey)tr-TR
Vietnamese (Vietnam)vi-VNRomanianro-RO
Ukrainianuk-UABengalibn-BD
English (India)en-INMarathi (India)mr-IN
Tamil (India)ta-INTelugu (India)te-IN
Chinese (Simplified)zh-CN
Default Language: Automatically inferred from the language in system instructions

Usage Examples

JavaScript Example

const ws = new WebSocket('wss://llm.ai-nebula.com/v1beta/models/gemini-live-2.5-flash-native-audio/liveStream', {
  headers: {
    'Authorization': 'Bearer sk-xxxx'
  }
});

ws.onopen = () => {
  console.log('WebSocket connected');
  
  // Send setup message
  ws.send(JSON.stringify({
    setup: {
      model: "gemini-live-2.5-flash-native-audio",
      generationConfig: {
        temperature: 0.7,
        responseModalities: ["AUDIO"]
      },
      systemInstruction: {
        parts: [
          { text: "You are a helpful assistant. Speak naturally and conversationally." }
        ]
      },
      speechConfig: {
        voiceConfig: {
          prebuiltVoiceConfig: {
            voiceName: "Puck"
          }
        }
      }
    }
  }));
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  console.log('Received:', message);
  
  if (message.serverContent) {
    // Handle output transcription (audio to text)
    if (message.serverContent.outputTranscription) {
      const text = message.serverContent.outputTranscription.text;
      if (text) {
        console.log('[Transcription]', text);
      }
    }
    
    if (message.serverContent.modelTurn) {
      // Handle model output
      message.serverContent.modelTurn.parts.forEach(part => {
        if (part.text) {
          console.log('Text:', part.text);
        }
        if (part.inlineData && part.inlineData.mimeType === "audio/pcm") {
          // Handle audio data
          const audioData = part.inlineData.data;
          // audioData is base64-encoded PCM audio
        }
      });
    }
    if (message.serverContent.turnComplete) {
      console.log('Turn complete');
    }
  }
  
  if (message.setupComplete) {
    console.log('Setup complete');
  }
};

// Send realtime audio input
function sendRealtimeAudio(audioBuffer) {
  const base64Audio = btoa(
    String.fromCharCode(...new Uint8Array(audioBuffer))
  );
  
  ws.send(JSON.stringify({
    realtimeInput: {
      mediaChunks: [
        {
          mimeType: "audio/pcm;rate=16000",
          data: base64Audio
        }
      ]
    }
  }));
}

// Send text message
function sendText(text) {
  ws.send(JSON.stringify({
    clientContent: {
      turns: [
        {
          role: "user",
          parts: [
            { text: text }
          ]
        }
      ],
      turnComplete: true
    }
  }));
}

Python Example

import websocket
import json
import base64
import threading

def on_message(ws, message):
    data = json.loads(message)
    print(f"Received: {data}")
    
    # Handle output transcription
    if "serverContent" in data:
        server_content = data["serverContent"]
        
        if "outputTranscription" in server_content:
            transcription = server_content["outputTranscription"]
            text = transcription.get("text", "")
            if text:
                print(f"[Transcription] {text}")
        
        if "modelTurn" in server_content:
            model_turn = server_content["modelTurn"]
            if "parts" in model_turn:
                for part in model_turn["parts"]:
                    if "text" in part:
                        print(f"Text: {part['text']}")
                    elif "inlineData" in part:
                        inline_data = part["inlineData"]
                        if inline_data.get("mimeType") == "audio/pcm":
                            audio_b64 = inline_data.get("data", "")
                            if audio_b64:
                                audio_data = base64.b64decode(audio_b64)
                                # Handle audio data

def on_error(ws, error):
    print(f"Error: {error}")

def on_close(ws, close_status_code, close_msg):
    print("Connection closed")

def on_open(ws):
    print("WebSocket connected")
    
    # Send setup message
    setup_message = {
        "setup": {
            "model": "gemini-live-2.5-flash-native-audio",
            "generationConfig": {
                "temperature": 0.7,
                "responseModalities": ["AUDIO"]
            },
            "systemInstruction": {
                "parts": [
                    {"text": "You are a helpful assistant."}
                ]
            },
            "speechConfig": {
                "voiceConfig": {
                    "prebuiltVoiceConfig": {
                        "voiceName": "Puck"
                    }
                }
            }
        }
    }
    ws.send(json.dumps(setup_message))

# Connect WebSocket
ws_url = "wss://llm.ai-nebula.com/v1beta/models/gemini-live-2.5-flash-native-audio/liveStream"
ws = websocket.WebSocketApp(
    ws_url,
    header={"Authorization": "Bearer sk-xxxx"},
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

ws.run_forever()

Configuration Examples

Example 1: Audio Only Mode

{
  "setup": {
    "model": "gemini-live-2.5-flash-native-audio",
    "generationConfig": {
      "temperature": 0.7,
      "responseModalities": ["AUDIO"],
      "speechConfig": {
        "voiceConfig": {
          "prebuiltVoiceConfig": {
            "voiceName": "Zephyr"
          }
        },
        "languageCode": "zh-CN"
      }
    },
    "systemInstruction": {
      "parts": [
        {"text": "你是一个友好的助手,请用自然、对话式的方式回答问题。"}
      ]
    }
  }
}
{
  "setup": {
    "model": "gemini-live-2.5-flash-native-audio",
    "generationConfig": {
      "temperature": 0.7,
      "responseModalities": ["AUDIO", "TEXT"],
      "speechConfig": {
        "voiceConfig": {
          "prebuiltVoiceConfig": {
            "voiceName": "Zephyr"
          }
        },
        "languageCode": "zh-CN"
      }
    },
    "systemInstruction": {
      "parts": [
        {"text": "你是一个友好的助手,请用自然、对话式的方式回答问题。"}
      ]
    },
    "tools": {
      "googleSearch": {}
    },
    "proactivity": {
      "proactiveAudio": false,
      "empatheticMode": true
    },
    "outputAudioTranscription": {},
    "realtimeInputConfig": {
      "automaticActivityDetection": {
        "disabled": false,
        "startOfSpeechSensitivity": "START_SENSITIVITY_LOW",
        "endOfSpeechSensitivity": "END_SENSITIVITY_HIGH",
        "prefixPaddingMs": 0,
        "silenceDurationMs": 0
      }
    }
  }
}
Configuration Notes:
  • responseModalities: Response modality, choose one of the following:
    • ["AUDIO"] - Audio output only
    • ["AUDIO", "TEXT"] - Audio + text transcription (recommended, get both audio and text)
  • voiceName: Voice name, supports 30 preset voices (see voice configuration table above)
  • languageCode: Language code, supports 24 languages (see language configuration table above)
  • googleSearch: Enable Google Search functionality
  • proactiveAudio: Proactive audio, model can choose not to respond to irrelevant audio
  • empatheticMode: Empathetic conversations, adjusts response style based on emotions
  • outputAudioTranscription: Enable output audio-to-text transcription (requires "TEXT" in responseModalities to see transcription text)
  • automaticActivityDetection: Voice activity detection configuration

Message Types

Client Messages

Message TypeDescription
setupSession configuration
clientContentClient content (text/audio)
realtimeInputRealtime audio input
toolResponseTool response

Server Messages

Message TypeDescription
setupCompleteSetup completion confirmation
serverContentServer content (text/audio/transcription)
toolCallTool call
toolCallCancellationTool call cancellation
usageMetadataUsage statistics

Token Statistics

The system separately tracks:
  • Text Tokens (input/output)
  • Audio Tokens (input/output)
  • Total Token Count
Usage information is returned in usageMetadata messages:
{
  "usageMetadata": {
    "totalTokenCount": 100,
    "inputTokenCount": 50,
    "outputTokenCount": 50,
    "inputTokenDetails": {
      "textTokens": 30,
      "audioTokens": 20
    },
    "outputTokenDetails": {
      "textTokens": 25,
      "audioTokens": 25
    }
  }
}

Pricing

Important Note: Model prices may change. Please refer to the latest prices displayed in the model marketplace. Gemini Live API is billed by token, separately tracking text and audio tokens:
  • Text Tokens: Used for input text content and output text transcription
  • Audio Tokens: Used for input audio and output audio content
The system returns detailed usage statistics in usageMetadata messages, including input/output token counts for both text and audio.

Technical Specifications

Audio Format

Input Audio:
  • Format: 16-bit PCM
  • Sample Rate: 16kHz
  • Byte Order: Little-endian
  • Encoding: Base64
Output Audio:
  • Format: 16-bit PCM
  • Sample Rate: 24kHz
  • Byte Order: Little-endian
  • Encoding: Base64

FAQ

Specify the voice name in speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName of the setup message. Supports 30 preset voices. See the Voice Configuration section above for the complete list. Default voice is Zephyr.
Two conditions must be met:
  1. Include "TEXT" in generationConfig.responseModalities (e.g., ["AUDIO", "TEXT"])
  2. Add outputAudioTranscription: {} field in the setup message
Once enabled, the server will return audio text transcription in serverContent.outputTranscription.
Add tool definitions in the setup message:
{
  "setup": {
    "tools": {
      "functionDeclarations": [
        {
          "name": "get_weather",
          "description": "Get the weather",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string"
              }
            }
          }
        }
      ]
    }
  }
}
Sending a new realtimeInput or clientContent message will interrupt the current response.
Yes, Gemini Live API supports video input. Video data (JPEG format, 1 FPS) can be included in clientContent.
The system sends usageMetadata messages during or after response completion, containing detailed usage statistics.
Configure in realtimeInputConfig.automaticActivityDetection of the setup message:
{
  "realtimeInputConfig": {
    "automaticActivityDetection": {
      "startOfSpeechSensitivity": "START_SENSITIVITY_LOW",
      "endOfSpeechSensitivity": "END_SENSITIVITY_HIGH",
      "prefixPaddingMs": 0,
      "silenceDurationMs": 0
    }
  }
}

References