Introduction
Universal text chat API supporting OpenAI-compatible large language models for generating conversational responses. Through a unified API interface, you can call multiple mainstream large models including OpenAI, Claude, DeepSeek, Grok, and Tongyi Qianwen.
Authentication
Bearer Token, e.g. Bearer sk-xxxxxxxxxx
Request Parameters
Model identifier, supported models include:
- OpenAI series:
o4-mini, o3-mini, gpt-5.2, gpt-5.1, gpt-4o, gpt-4o-mini, etc.
- Claude series:
claude-opus-4-6, claude-sonnet-4-5-20250929, claude-haiku-4-5-20251001, etc.
- DeepSeek series:
deepseek-v3-1-250821, deepseek-v3, deepseek-r1, etc.
- Grok series:
grok-4, grok-4-fast-reasoning, grok-3, etc.
- Gemini series:
gemini-3-pro-preview, gemini-3-flash-preview, nano-banana-pro and -thinking/-nothinking / -thinking-<budget> / -thinking-low/-thinking-high variants
- Domestic models:
glm-4.7, qwen3-coder-plus, kimi-k2.5, etc.
Conversation message list, each element contains role (user/system/assistant) and content
Randomness control, 0-2, higher values = more random responses
Whether to enable streaming output, returns SSE format chunked data
Maximum number of tokens to generate, controls response length
Nucleus sampling parameter, 0-1, controls generation diversity
Basic Examples
Non-Streaming Request
Streaming Request (SSE)
Python Example
curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-opus-4-6",
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Briefly introduce artificial intelligence"}
],
"temperature": 0.7
}'
curl -N -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-opus-4-6",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Briefly introduce artificial intelligence"}
]
}'
from openai import OpenAI
client = OpenAI(
api_key="sk-xxxxxxxxxx",
base_url="https://llm.ai-nebula.com/v1"
)
# Non-streaming
completion = client.chat.completions.create(
model="claude-opus-4-6",
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Briefly introduce artificial intelligence"}
],
temperature=0.7
)
print(completion.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="claude-opus-4-6",
messages=[
{"role": "user", "content": "Briefly introduce artificial intelligence"}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"created": 1234567890,
"model": "claude-opus-4-6",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Artificial intelligence is a branch of computer science that aims to create intelligent machines..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 100,
"total_tokens": 125
}
}
Advanced Features
Supports OpenAI-compatible tool calling format, applicable to GPT, Claude, DeepSeek, Grok, Tongyi Qianwen, and other models.
Structured Output (JSON Schema)
Supports controlling output format through response_format parameter, applicable to GPT, Claude, Grok, and other models.
curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-opus-4-6",
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "Answer",
"schema": {
"type": "object",
"properties": {
"summary": {"type": "string"}
},
"required": ["summary"]
}
}
},
"messages": [
{"role": "user", "content": "Return a JSON containing a summary field"}
]
}'
For strict structured output, it is recommended to lower the temperature value (e.g., 0.1-0.3) and set an appropriate max_tokens to improve consistency.
Thinking Capability
Some models support thinking capability (Thinking/Reasoning), which can display the reasoning process when generating responses. Different models implement this differently:
DeepSeek
Tongyi Qianwen
Gemini
DeepSeek models support enabling thinking capability through the thinking field:curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "deepseek-v3-1-250821",
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Give a medium-difficulty geometry problem and solve it step by step"}
],
"thinking": {"type": "enabled"}
}'
- Default
thinking.type is "disabled", need to explicitly set to "enabled" to enable
- The output form of thinking capability may vary by model version
- It is recommended to use with
stream: true for better interactive experience
Tongyi Qianwen supports deep thinking functionality, requires streaming output:curl -N -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "qwen3-omni-flash",
"stream": true,
"enable_thinking": true,
"parameters": {
"incremental_output": true
},
"messages": [
{"role": "system", "content": "You are an excellent mathematician"},
{"role": "user", "content": "What is the formula for Tower of Hanoi"}
]
}'
Inline reasoning process into content:If the client does not display reasoning_content, you can use nebula_thinking_to_content: true to inline reasoning content into content:curl -N -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "qwen3-omni-flash",
"stream": true,
"enable_thinking": true,
"nebula_thinking_to_content": true,
"parameters": {
"incremental_output": true
},
"messages": [
{"role": "user", "content": "What is the formula for Tower of Hanoi"}
]
}'
Tongyi Qianwen’s deep thinking functionality must be used with stream: true. If enable_thinking: true is set but stream: false, the system will automatically disable deep thinking to avoid upstream errors.
Refer to the Gemini thinking mode guide. Main ways to enable:
- Model suffix:
-thinking (auto budget); -thinking-<number> precise budget (e.g., gemini-2.5-flash-thinking-8192); -nothinking disable; gemini-3-pro-preview-thinking-low/high specify level directly
- extra_body config:
extra_body.google.thinking_config.thinking_budget + include_thoughts; special values: -1 auto-enable, 0 disable, >0 specific budget; requires stream: true
- reasoning_effort: usable when using
-thinking and max_tokens is not set (low/medium/high ≈ 20%/50%/80% budget)
- Gemini 3 Pro Preview: uses
thinking_level (LOW/HIGH, default HIGH), can be combined with search
- Enable search: recommended OpenAI-compatible tool
"tools":[{"type":"function","function":{"name":"googleSearch"}}]; or pass through extra_body.google.tools:[{"googleSearch":{}}]
- Notes: thinking adapter must be enabled server-side; thinking budget counts toward output tokens; use
stream: true to view reasoning_content
Example (2.5 with specific budget):curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "gemini-3-flash-preview",
"messages": [
{"role":"user","content":"Give a medium-difficulty geometry problem and analyze it step by step."}
],
"extra_body": {
"google": {
"thinking_config": { "thinking_budget": 6000, "include_thoughts": true }
}
},
"stream": true
}'
Example (3 Pro Preview thinking + search):curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "gemini-3-pro-preview",
"messages": [
{"role":"user","content":"Google search the weather in Guangzhou today"}
],
"generationConfig": {
"thinkingConfig": { "thinkingLevel": "LOW" }
},
"tools": [
{ "type": "function", "function": { "name": "googleSearch" } }
],
"stream": true
}'
Tongyi Qianwen Extended Features
Tongyi Qianwen models support extended features such as search, speech recognition, etc. All extended parameters need to be placed in the parameters object.
Search Feature
Speech Recognition
curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "qwen3-omni-flash",
"messages": [
{"role": "user", "content": "Please first search for recent common misconceptions about Fermat'\''s Last Theorem, then answer"}
],
"stream": true,
"enable_thinking": true,
"parameters": {
"enable_search": true,
"search_options": {
"region": "CN",
"recency_days": 30
},
"incremental_output": true
}
}'
curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "qwen3-omni-flash",
"messages": [
{"role": "user", "content": "Hello"}
],
"parameters": {
"asr_options": {
"language": "zh"
}
}
}'
All extended parameters for Tongyi Qianwen (such as enable_search, search_options, asr_options, temperature, top_p, etc.) need to be placed in the parameters object, not at the top level of the request body.
Web Search Features
Some models support real-time web search, allowing access to the latest information and including citation sources in responses.
Claude Web Search
Grok Live Search
Claude models do not support enabling web search functionality through the web_search_options parameter, so it can only be implemented through tool calls, and may be unstable due to network and prompt reasons. For details, see Tool Calling (Functions / Tools) above.Basic Example (showing tool call flow):curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-opus-4-6",
"messages": [
{"role": "user", "content": "What are the latest news about artificial intelligence?"},
{
"role": "assistant",
"content": "I'\''ll help you search for the latest news about artificial intelligence.",
"tool_calls": [
{
"id": "toolu_xxx",
"type": "function",
"function": {
"name": "WebSearch",
"arguments": "{\"query\": \"artificial intelligence latest news 2025\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "toolu_xxx",
"name": "WebSearch",
"content": "Web search results for query: \"artificial intelligence latest news 2025\"..."
}
],
"web_search_options": {
"search_context_size": "medium"
}
}'
Example with Location Information (showing tool call flow):curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-opus-4-6",
"messages": [
{"role": "user", "content": "What'\''s the weather in Shanghai today?"},
{
"role": "assistant",
"content": "I'\''ll help you search for today'\''s weather in Shanghai.",
"tool_calls": [
{
"id": "toolu_xxx",
"type": "function",
"function": {
"name": "WebSearch",
"arguments": "{\"query\": \"Shanghai today weather\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "toolu_xxx",
"name": "WebSearch",
"content": "Web search results for query: \"Shanghai today weather\"..."
}
],
"web_search_options": {
"search_context_size": "medium",
"user_location": {
"approximate": {
"timezone": "Asia/Shanghai",
"country": "CN",
"region": "Shanghai",
"city": "Shanghai"
}
}
}
}'
- Search functionality will increase response time and token consumption (including search result content)
- Search results will automatically include citation sources in the response
- Supported models include Claude Sonnet 4, Claude 3 Opus, etc.
- In multi-turn conversations, tool calls and results will be visible in message history, and the model can continue the conversation based on previous search results
Stability Notice:
- Web search functionality depends on upstream proxy services and external search services, and may have the following instabilities:
- Network fluctuations: Network connection issues may cause search requests to timeout or fail
- Service limitations: Search services may have rate limits, timeout limits, or temporary unavailability
- Search result quality: Some queries may not find relevant information, or search results may be of poor quality
- Model judgment: The model will automatically determine whether a search is needed based on the question, and in some cases may not trigger a search
- This is an inherent characteristic of web search functionality. It is recommended to:
- Implement retry mechanisms in critical scenarios
- Handle search failures with graceful degradation (e.g., using the model’s knowledge base to answer)
- Avoid relying entirely on web search in scenarios with extremely high real-time requirements
Grok models support real-time search through the search_parameters parameter.Search parameter configuration
mode (optional): Search mode, options:
"off": Disable search
"auto": Model automatically determines if search is needed (recommended)
"on": Force search usage
return_citations (optional): Whether to return citation links in the response, defaults to true
Basic Example:curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "grok-4",
"messages": [
{"role": "user", "content": "What are the latest developments in AI in 2026?"}
],
"search_parameters": {
"mode": "auto"
}
}'
Force Search Example:curl -X POST "https://llm.ai-nebula.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "grok-4",
"messages": [
{"role": "user", "content": "What are the latest tech news?"}
],
"search_parameters": {
"mode": "on",
"return_citations": true
}
}'
Python Example:from openai import OpenAI
client = OpenAI(
api_key="sk-xxxxxxxxxx",
base_url="https://llm.ai-nebula.com/v1"
)
completion = client.chat.completions.create(
model="grok-4",
messages=[
{"role": "user", "content": "What are the latest developments in AI in 2026?"}
],
extra_body={
"search_parameters": {
"mode": "auto"
}
}
)
print(completion.choices[0].message.content)
- It is recommended to use
"auto" mode to let the model automatically determine if search is needed
- Search functionality will increase response time but provides access to the latest real-time information
- Supported models include
grok-4, grok-3 series, etc.
- Search results will include citation sources in the response
GPT-5 and other models support file input functionality, which needs to be called through the /v1/responses endpoint, not /v1/chat/completions.
You can upload PDF files by linking external URLs:from openai import OpenAI
client = OpenAI(
api_key="sk-xxxxxxxxxx",
base_url="https://llm.ai-nebula.com/v1/responses?api-version=2025-03-01-preview"
)
response = client.responses.create(
model="gpt-5.2",
input=[
{
"role": "user",
"content": [
{
"type": "input_text",
"text": "Analyze this letter and summarize its key points"
},
{
"type": "input_file",
"file_url": "https://www.example.com/document.pdf"
}
]
}
]
)
print(response.output_text)
Send as Base64-encoded input:import base64
from openai import OpenAI
client = OpenAI(
api_key="sk-xxxxxxxxxx",
base_url="https://llm.ai-nebula.com/v1"
)
with open("document.pdf", "rb") as f:
data = f.read()
base64_string = base64.b64encode(data).decode("utf-8")
response = client.responses.create(
model="gpt-5.2",
input=[
{
"role": "user",
"content": [
{
"type": "input_file",
"filename": "document.pdf",
"file_data": f"data:application/pdf;base64,{base64_string}"
},
{
"type": "input_text",
"text": "What is the main content of this document?"
}
]
}
]
)
print(response.output_text)
- File size limit: Single file not exceeding 50 MB, total size of all files in a single request not exceeding 50 MB
- Supported models:
gpt-4o, gpt-4o-mini, gpt-5-chat, and other models that support text and image input
- Reasoning models (o1, o3-mini, o4-mini) should also use the
/v1/responses endpoint if they need to use reasoning capability
Grok Reasoning Capability
Grok models (especially grok-4-fast-reasoning) support reasoning capability. The usage in the response distinguishes between completion_tokens and reasoning_tokens:
{
"usage": {
"prompt_tokens": 100,
"completion_tokens": 500,
"total_tokens": 600,
"completion_tokens_details": {
"reasoning_tokens": 300
}
}
}
Actual output text token count = completion_tokens - reasoning_tokens
Non-Streaming Response
Streaming Response
{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"created": 1234567890,
"model": "claude-opus-4-6",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Response content..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 100,
"total_tokens": 125
}
}
Streaming responses are returned in SSE (Server-Sent Events) format, each chunk contains partial content:data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"claude-opus-4-6","choices":[{"index":0,"delta":{"content":"回"},"finish_reason":null}]}
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"claude-opus-4-6","choices":[{"index":0,"delta":{"content":"复"},"finish_reason":null}]}
data: [DONE]
The last chunk usually contains usage statistics.
Error Handling
| Exception Type | Trigger Scenario | Return Message |
|---|
| AuthenticationError | Invalid or unauthorized API key | Error: Invalid or unauthorized API key |
| NotFoundError | Model does not exist or is not supported | Error: Model [model] does not exist or is not supported |
| APIConnectionError | Network interruption or server not responding | Error: Cannot connect to API server |
| APIError | Request format error and other server-side exceptions | API request failed: [error details] |
Supported Model Series
OpenAI Series
- GPT-4.1, GPT-4o, GPT-4o Mini, GPT-3.5-turbo
- Reasoning models: o3-mini, o4-mini (need to use
/v1/responses endpoint)
Claude Series (Anthropic)
- Claude Sonnet 4, Claude 3 Opus, Claude 3 Haiku
DeepSeek Series
Grok Series (xAI)
- Grok-4, Grok-3, Grok-3-fast, Grok-4-fast-reasoning
Tongyi Qianwen Series (Qwen)
Other Models
- Gemini series, GLM series, Kimi series, etc.
For the complete model list, please see the Model Information Page.
Notes
- In the
messages list, system role is used to set model behavior, user role is for user questions
- Multi-turn conversations require appending history (including
assistant role responses)
- Requires
openai library: pip install openai
- Different models may have different levels of support for certain features, it is recommended to check the specific model documentation before use
- Using streaming output can improve first token response time and interactive experience
- Tool calling requires proper timeout and retry mechanisms to avoid blocking model responses
- Tongyi Qianwen extended parameters must be placed in the
parameters object