Introduction
Claude’s native message API, suitable for native Anthropic clients like Claude Code. This API follows Anthropic’s specification and provides full Claude model capabilities, including Extended Thinking, tool calling, and other advanced features.
If you’re using an OpenAI-compatible client (like OpenAI SDK), we recommend using the /v1/chat/completions endpoint instead.
Authentication
Bearer Token, e.g., Bearer sk-xxxxxxxxxx
Request Parameters
Claude model identifier, supported models include:
claude-opus-4-5-20251101 - Claude Opus 4.5 (Latest, strongest reasoning)
claude-haiku-4-5-20251001 - Claude Haiku 4.5 (Latest, fastest)
claude-sonnet-4-5-20250929 - Claude Sonnet 4.5 (Latest, balanced)
claude-opus-4-1-20250805 - Claude Opus 4.1
claude-sonnet-4-20250514 - Claude Sonnet 4
- Other Claude series models
List of conversation messages, each containing role (user/assistant) and content. content can be a string or an array of media content.
Maximum number of tokens to generate. Must be greater than 0.
System prompt, can be a string or an array of media content. Used to set the model’s behavior and role.
Randomness control, 0-1. Higher values make responses more random. Recommended to set to 1.0 when using extended thinking.
Nucleus sampling parameter, 0-1, controls generation diversity. Recommended to set to 0 when using extended thinking.
Top-K sampling parameter, only supported by some models.
Whether to enable streaming output, returns SSE format data chunks. Recommended to enable when using extended thinking.
List of stop sequences. Generation stops when the model produces these sequences.
Tool definitions list, supports function tools and web search tools.
Tool selection strategy, controls how the model uses tools.
Extended thinking configuration, enables Claude’s deep reasoning capability.
Request metadata for tracking and debugging.
MCP (Model Context Protocol) server configuration.
Context management configuration, controls how conversation context is handled.
Prompt Caching
Prompt Caching allows you to cache frequently used context content, significantly reducing costs and improving response speed. Supports using the cache_control parameter in system and messages.
Cache Control Parameters
Cache control configuration, can be used in system array elements and content array elements in messages.
type: Cache type
"ephemeral": 5-minute cache (default, most cost-effective)
"persistent": 1-hour cache (suitable for long-term stable context)
Caching Mechanism
- Cache Position: The last content block marked with
cache_control will be cached
- Cache Threshold: Content needs at least 1024 tokens (Claude Sonnet 4.5) or 2048 tokens (Claude 3 Haiku)
- Cache Duration:
ephemeral: Valid for 5 minutes
persistent: Valid for 1 hour
- Cost Savings: Cache reads are 90% cheaper than regular inputs
Use Cases
- Long Document Analysis: Cache large documents in
system, ask multiple questions
- Codebase Understanding: Cache code context for multi-turn code analysis
- Knowledge Base Q&A: Cache knowledge base content for fast queries
- Multi-turn Conversations: Cache conversation history to maintain context coherence
Basic Examples
curl -X POST "https://llm.ai-nebula.com/v1/messages" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-sonnet-4-5-20250929",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Please briefly introduce artificial intelligence"}
]
}'
curl -N -X POST "https://llm.ai-nebula.com/v1/messages" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-sonnet-4-5-20250929",
"max_tokens": 1024,
"stream": true,
"messages": [
{"role": "user", "content": "Please briefly introduce artificial intelligence"}
]
}'
from anthropic import Anthropic
client = Anthropic(
api_key="sk-xxxxxxxxxx",
base_url="https://llm.ai-nebula.com"
)
# Non-streaming
message = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
messages=[
{"role": "user", "content": "Please briefly introduce artificial intelligence"}
]
)
print(message.content[0].text)
# Streaming
with client.messages.stream(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
messages=[
{"role": "user", "content": "Please briefly introduce artificial intelligence"}
]
) as stream:
for text_block in stream.text_stream:
print(text_block, end="")
{
"id": "msg_xxx",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Artificial intelligence is a branch of computer science that focuses on creating intelligent machines capable of performing tasks that typically require human intelligence..."
}
],
"model": "claude-sonnet-4-5-20250929",
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 25,
"output_tokens": 100
}
}
Advanced Features
System Prompt
System prompts can be set as a string or an array of media content:
Extended Thinking
Claude supports extended thinking, allowing the model to perform deep reasoning. When enabled, the model will think internally before generating the final answer.
Basic Usage
Python Example
curl -X POST "https://llm.ai-nebula.com/v1/messages" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-sonnet-4-5-20250929",
"max_tokens": 4096,
"temperature": 1.0,
"top_p": 0,
"stream": true,
"messages": [
{"role": "user", "content": "Give a medium difficulty geometry problem and solve it step by step"}
]
}'
from anthropic import Anthropic
client = Anthropic(
api_key="sk-xxxxxxxxxx",
base_url="https://llm.ai-nebula.com"
)
with client.messages.stream(
model="claude-sonnet-4-5-20250929",
max_tokens=4096,
thinking={
"type": "enabled",
"budget_tokens": 4096
},
temperature=1.0,
top_p=0,
messages=[
{"role": "user", "content": "Give a medium difficulty geometry problem and solve it step by step"}
]
) as stream:
for event in stream:
if event.type == "content_block_delta":
if hasattr(event.delta, "thinking"):
# Thinking process
print(f"[Thinking] {event.delta.thinking}", end="")
elif hasattr(event.delta, "text"):
# Final answer
print(event.delta.text, end="")
budget_tokens must be greater than 1024
- When using extended thinking, it’s recommended to set
temperature: 1.0 and top_p: 0
- Streaming output (
stream: true) must be enabled to see the thinking process
Supports function tools and web search tools:
tool_choice controls how the model uses tools:
| Value | Description |
|---|
{"type": "auto"} | Automatically decide whether to use tools (default) |
{"type": "any"} | Must use at least one tool |
{"type": "none"} | Don’t use any tools |
{"type": "tool", "name": "tool_name"} | Must use the specified tool |
Example:
{
"tool_choice": {
"type": "auto",
"disable_parallel_tool_use": false
}
}
Supports including images in messages:
curl -X POST "https://llm.ai-nebula.com/v1/messages" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-sonnet-4-5-20250929",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg=="
}
},
{
"type": "text",
"text": "What is in this image?"
}
]
}
]
}'
Prompt Caching
Caching frequently used context content can significantly reduce costs and improve response speed.
System Cache (5 minutes)
Messages Cache (1 hour)
Python SDK Example
curl -X POST "https://llm.ai-nebula.com/v1/messages" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-sonnet-4-5-20250929",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": "You are a professional technical documentation analyst. Here is the complete AWS Lambda technical documentation:\n\nAWS Lambda is a serverless computing service...[large documentation content, at least 1024 tokens]",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "What is Lambda's pricing model?"}
]
}'
First Request Response:{
"usage": {
"input_tokens": 50,
"cache_creation_input_tokens": 1200,
"cache_read_input_tokens": 0,
"output_tokens": 150
}
}
Second Request within 5 minutes (different question, same system):{
"usage": {
"input_tokens": 45,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 1200,
"output_tokens": 100
}
}
curl -X POST "https://llm.ai-nebula.com/v1/messages" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxxxxxxxx" \
-d '{
"model": "claude-sonnet-4-5-20250929",
"max_tokens": 1024,
"system": "You are a Python programming assistant",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Analyze this code:\n```python\n[large code snippet, at least 1024 tokens]\n```",
"cache_control": {"type": "persistent"}
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "The main functionality of this code is...[detailed analysis]",
"cache_control": {"type": "persistent"}
}
]
},
{
"role": "user",
"content": "How can I optimize the performance of this code?"
}
]
}'
Advantages of persistent cache:
- 1-hour cache duration, suitable for long sessions
- Ideal for code reviews, document analysis, etc.
- Faster subsequent requests after cache hit
from anthropic import Anthropic
client = Anthropic(
api_key="sk-xxxxxxxxxx",
base_url="https://llm.ai-nebula.com"
)
# First request: Create cache
message1 = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a professional document analyst...[long text content]",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "First question"}
]
)
print(f"Cache created: {message1.usage.cache_creation_input_tokens} tokens")
print(f"Cache read: {message1.usage.cache_read_input_tokens} tokens")
# Second request within 5 minutes: Use cache
message2 = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a professional document analyst...[same long text]",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "Second question"}
]
)
print(f"Cache created: {message2.usage.cache_creation_input_tokens} tokens")
print(f"Cache read: {message2.usage.cache_read_input_tokens} tokens")
Cache Key Points:
- Content must be ≥ 1024 tokens (Claude Sonnet 4.5) to trigger caching
ephemeral cache is valid for 5 minutes
persistent cache is valid for 1 hour
- Cache reads cost 90% less than regular inputs
- The last block with
cache_control will be cached
- Cache is based on exact content match; any changes invalidate the cache
Best Practices:
- Place unchanging long context (documents, codebases, etc.) in
system with caching enabled
- Use
persistent cache (1 hour) for long-term stable content
- Use
ephemeral cache (5 minutes) for frequently changing content
- Cache conversation history in multi-turn dialogues
- Monitor
cache_creation_input_tokens and cache_read_input_tokens to optimize costs
Non-streaming Response
Streaming Response
{
"id": "msg_xxx",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Response content..."
}
],
"model": "claude-sonnet-4-5-20250929",
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 25,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 0,
"output_tokens": 100
}
}
Usage fields when using cache:
input_tokens: Non-cached input tokens for the current request
cache_creation_input_tokens: Tokens cached for the first time (only present in first request)
cache_read_input_tokens: Tokens read from cache (present when cache hits)
output_tokens: Generated output tokens
Streaming responses are returned in SSE (Server-Sent Events) format, containing the following event types:
message_start: Message start
content_block_start: Content block start
content_block_delta: Content delta (contains text or thinking)
content_block_stop: Content block end
message_delta: Message delta (contains usage info)
message_stop: Message end
event: message_start
data: {"type":"message_start","message":{"id":"msg_xxx","type":"message","role":"assistant","content":[],"model":"claude-sonnet-4-5-20250929","stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":25,"output_tokens":0}}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Response"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" content"}}
event: content_block_stop
data: {"type":"content_block_stop","index":0}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"output_tokens":100}}
event: message_stop
data: {"type":"message_stop"}
When using extended thinking, content_block_delta may contain a thinking field:event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":"Let me think about this problem..."}}
Error Handling
The system processes upstream Claude API errors and returns standardized error response formats.
| Error Type | HTTP Status Code | Description |
|---|
invalid_request | 400 | Request parameter error (e.g., missing required fields) |
authentication_error | 401 | Invalid or unauthorized API key |
rate_limit_error | 429 | Request rate limit exceeded |
upstream_error | 500 | Upstream service error |
nebula_api_error | 500 | System internal error |
Error response example:
{
"error": {
"type": "invalid_request",
"message": "field messages is required"
}
}
Comparison with /v1/chat/completions
| Feature | /v1/messages | /v1/chat/completions |
|---|
| Authentication | Authorization: Bearer | Authorization: Bearer |
| Response Format | Anthropic native format | OpenAI compatible format |
| Extended Thinking | Native thinking parameter | Via reasoning_effort or reasoning parameter |
| Tool Calling | Native tools and tool_choice | OpenAI compatible format |
| Suitable Clients | Anthropic SDK, Claude Code | OpenAI SDK, compatible clients |
- If you’re using Claude Code or other Anthropic native clients, we recommend using the
/v1/messages endpoint
- If you’re using OpenAI SDK or need OpenAI format compatibility, we recommend using the
/v1/chat/completions endpoint
- Both endpoints have essentially the same functionality, the main difference is in request/response format
Notes
max_tokens is a required parameter and must be greater than 0
messages array cannot be empty
- When using extended thinking,
budget_tokens must be greater than 1024
- Extended thinking requires streaming output to see the thinking process
- Tool calling requires multiple rounds of interaction: first round returns tool call request, second round returns tool execution result
- Image input requires base64 encoding
- Using streaming output can improve first token response time and interaction experience
- Tool calling should have proper timeout and retry mechanisms to avoid blocking model responses
- Extended thinking can significantly improve reasoning quality for complex problems