Skip to main content
POST
https://llm.ai-nebula.com
/
v1
/
messages
Create Message Request (Claude)
curl --request POST \
  --url https://llm.ai-nebula.com/v1/messages \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "messages": [
    {}
  ],
  "max_tokens": 123,
  "system": {},
  "temperature": 123,
  "top_p": 123,
  "top_k": 123,
  "stream": true,
  "stop_sequences": [
    {}
  ],
  "tools": [
    {}
  ],
  "tool_choice": {},
  "thinking": {},
  "metadata": {},
  "mcp_servers": [
    {}
  ],
  "context_management": {},
  "cache_control": {}
}
'
{
  "id": "msg_xxx",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Artificial intelligence is a branch of computer science that focuses on creating intelligent machines capable of performing tasks that typically require human intelligence..."
    }
  ],
  "model": "claude-sonnet-4-5-20250929",
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 25,
    "output_tokens": 100
  }
}

Introduction

Claude’s native message API, suitable for native Anthropic clients like Claude Code. This API follows Anthropic’s specification and provides full Claude model capabilities, including Extended Thinking, tool calling, and other advanced features.
If you’re using an OpenAI-compatible client (like OpenAI SDK), we recommend using the /v1/chat/completions endpoint instead.

Authentication

Authorization
string
required
Bearer Token, e.g., Bearer sk-xxxxxxxxxx

Request Parameters

model
string
required
Claude model identifier, supported models include:
  • claude-opus-4-5-20251101 - Claude Opus 4.5 (Latest, strongest reasoning)
  • claude-haiku-4-5-20251001 - Claude Haiku 4.5 (Latest, fastest)
  • claude-sonnet-4-5-20250929 - Claude Sonnet 4.5 (Latest, balanced)
  • claude-opus-4-1-20250805 - Claude Opus 4.1
  • claude-sonnet-4-20250514 - Claude Sonnet 4
  • Other Claude series models
messages
array
required
List of conversation messages, each containing role (user/assistant) and content. content can be a string or an array of media content.
max_tokens
number
required
Maximum number of tokens to generate. Must be greater than 0.
system
string|array
System prompt, can be a string or an array of media content. Used to set the model’s behavior and role.
temperature
number
default:"1.0"
Randomness control, 0-1. Higher values make responses more random. Recommended to set to 1.0 when using extended thinking.
top_p
number
default:"1.0"
Nucleus sampling parameter, 0-1, controls generation diversity. Recommended to set to 0 when using extended thinking.
top_k
number
Top-K sampling parameter, only supported by some models.
stream
boolean
default:"false"
Whether to enable streaming output, returns SSE format data chunks. Recommended to enable when using extended thinking.
stop_sequences
array
List of stop sequences. Generation stops when the model produces these sequences.
tools
array
Tool definitions list, supports function tools and web search tools.
tool_choice
object
Tool selection strategy, controls how the model uses tools.
thinking
object
Extended thinking configuration, enables Claude’s deep reasoning capability.
metadata
object
Request metadata for tracking and debugging.
mcp_servers
array
MCP (Model Context Protocol) server configuration.
context_management
object
Context management configuration, controls how conversation context is handled.

Prompt Caching

Prompt Caching allows you to cache frequently used context content, significantly reducing costs and improving response speed. Supports using the cache_control parameter in system and messages.

Cache Control Parameters

cache_control
object
Cache control configuration, can be used in system array elements and content array elements in messages.
  • type: Cache type
    • "ephemeral": 5-minute cache (default, most cost-effective)
    • "persistent": 1-hour cache (suitable for long-term stable context)

Caching Mechanism

  • Cache Position: The last content block marked with cache_control will be cached
  • Cache Threshold: Content needs at least 1024 tokens (Claude Sonnet 4.5) or 2048 tokens (Claude 3 Haiku)
  • Cache Duration:
    • ephemeral: Valid for 5 minutes
    • persistent: Valid for 1 hour
  • Cost Savings: Cache reads are 90% cheaper than regular inputs

Use Cases

  1. Long Document Analysis: Cache large documents in system, ask multiple questions
  2. Codebase Understanding: Cache code context for multi-turn code analysis
  3. Knowledge Base Q&A: Cache knowledge base content for fast queries
  4. Multi-turn Conversations: Cache conversation history to maintain context coherence

Basic Examples

curl -X POST "https://llm.ai-nebula.com/v1/messages" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxxxxxxxx" \
  -d '{
    "model": "claude-sonnet-4-5-20250929",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Please briefly introduce artificial intelligence"}
    ]
  }'
{
  "id": "msg_xxx",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Artificial intelligence is a branch of computer science that focuses on creating intelligent machines capable of performing tasks that typically require human intelligence..."
    }
  ],
  "model": "claude-sonnet-4-5-20250929",
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 25,
    "output_tokens": 100
  }
}

Advanced Features

System Prompt

System prompts can be set as a string or an array of media content:
curl -X POST "https://llm.ai-nebula.com/v1/messages" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxxxxxxxx" \
  -d '{
    "model": "claude-sonnet-4-5-20250929",
    "max_tokens": 1024,
    "system": "You are a helpful assistant that excels at answering questions.",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

Extended Thinking

Claude supports extended thinking, allowing the model to perform deep reasoning. When enabled, the model will think internally before generating the final answer.
curl -X POST "https://llm.ai-nebula.com/v1/messages" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxxxxxxxx" \
  -d '{
    "model": "claude-sonnet-4-5-20250929",
    "max_tokens": 4096,
    "temperature": 1.0,
    "top_p": 0,
    "stream": true,
    "messages": [
      {"role": "user", "content": "Give a medium difficulty geometry problem and solve it step by step"}
    ]
  }'
  • budget_tokens must be greater than 1024
  • When using extended thinking, it’s recommended to set temperature: 1.0 and top_p: 0
  • Streaming output (stream: true) must be enabled to see the thinking process

Tool Calling

Supports function tools and web search tools:
curl -X POST "https://llm.ai-nebula.com/v1/messages" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxxxxxxxx" \
  -d '{
    "model": "claude-sonnet-4-5-20250929",
    "max_tokens": 1024,
    "tools": [
      {
        "name": "get_weather",
        "description": "Get weather information for a city",
        "input_schema": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string",
              "description": "City name"
            }
          },
          "required": ["city"]
        }
      }
    ],
    "tool_choice": {
      "type": "auto"
    },
    "messages": [
      {"role": "user", "content": "What is the weather in Shanghai?"}
    ]
  }'

tool_choice Parameter Details

tool_choice controls how the model uses tools:
ValueDescription
{"type": "auto"}Automatically decide whether to use tools (default)
{"type": "any"}Must use at least one tool
{"type": "none"}Don’t use any tools
{"type": "tool", "name": "tool_name"}Must use the specified tool
Example:
{
  "tool_choice": {
    "type": "auto",
    "disable_parallel_tool_use": false
  }
}

Multimodal Input (Images)

Supports including images in messages:
curl -X POST "https://llm.ai-nebula.com/v1/messages" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxxxxxxxx" \
  -d '{
    "model": "claude-sonnet-4-5-20250929",
    "max_tokens": 1024,
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "source": {
              "type": "base64",
              "media_type": "image/png",
              "data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg=="
            }
          },
          {
            "type": "text",
            "text": "What is in this image?"
          }
        ]
      }
    ]
  }'

Prompt Caching

Caching frequently used context content can significantly reduce costs and improve response speed.
curl -X POST "https://llm.ai-nebula.com/v1/messages" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxxxxxxxx" \
  -d '{
    "model": "claude-sonnet-4-5-20250929",
    "max_tokens": 1024,
    "system": [
      {
        "type": "text",
        "text": "You are a professional technical documentation analyst. Here is the complete AWS Lambda technical documentation:\n\nAWS Lambda is a serverless computing service...[large documentation content, at least 1024 tokens]",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    "messages": [
      {"role": "user", "content": "What is Lambda's pricing model?"}
    ]
  }'
First Request Response:
{
  "usage": {
    "input_tokens": 50,
    "cache_creation_input_tokens": 1200,
    "cache_read_input_tokens": 0,
    "output_tokens": 150
  }
}
Second Request within 5 minutes (different question, same system):
{
  "usage": {
    "input_tokens": 45,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 1200,
    "output_tokens": 100
  }
}
Cache Key Points:
  • Content must be ≥ 1024 tokens (Claude Sonnet 4.5) to trigger caching
  • ephemeral cache is valid for 5 minutes
  • persistent cache is valid for 1 hour
  • Cache reads cost 90% less than regular inputs
  • The last block with cache_control will be cached
  • Cache is based on exact content match; any changes invalidate the cache
Best Practices:
  • Place unchanging long context (documents, codebases, etc.) in system with caching enabled
  • Use persistent cache (1 hour) for long-term stable content
  • Use ephemeral cache (5 minutes) for frequently changing content
  • Cache conversation history in multi-turn dialogues
  • Monitor cache_creation_input_tokens and cache_read_input_tokens to optimize costs

Response Format

{
  "id": "msg_xxx",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Response content..."
    }
  ],
  "model": "claude-sonnet-4-5-20250929",
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 25,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 0,
    "output_tokens": 100
  }
}
Usage fields when using cache:
  • input_tokens: Non-cached input tokens for the current request
  • cache_creation_input_tokens: Tokens cached for the first time (only present in first request)
  • cache_read_input_tokens: Tokens read from cache (present when cache hits)
  • output_tokens: Generated output tokens

Error Handling

The system processes upstream Claude API errors and returns standardized error response formats.
Error TypeHTTP Status CodeDescription
invalid_request400Request parameter error (e.g., missing required fields)
authentication_error401Invalid or unauthorized API key
rate_limit_error429Request rate limit exceeded
upstream_error500Upstream service error
nebula_api_error500System internal error
Error response example:
{
  "error": {
    "type": "invalid_request",
    "message": "field messages is required"
  }
}

Comparison with /v1/chat/completions

Feature/v1/messages/v1/chat/completions
AuthenticationAuthorization: BearerAuthorization: Bearer
Response FormatAnthropic native formatOpenAI compatible format
Extended ThinkingNative thinking parameterVia reasoning_effort or reasoning parameter
Tool CallingNative tools and tool_choiceOpenAI compatible format
Suitable ClientsAnthropic SDK, Claude CodeOpenAI SDK, compatible clients
  • If you’re using Claude Code or other Anthropic native clients, we recommend using the /v1/messages endpoint
  • If you’re using OpenAI SDK or need OpenAI format compatibility, we recommend using the /v1/chat/completions endpoint
  • Both endpoints have essentially the same functionality, the main difference is in request/response format

Notes

  • max_tokens is a required parameter and must be greater than 0
  • messages array cannot be empty
  • When using extended thinking, budget_tokens must be greater than 1024
  • Extended thinking requires streaming output to see the thinking process
  • Tool calling requires multiple rounds of interaction: first round returns tool call request, second round returns tool execution result
  • Image input requires base64 encoding
  • Using streaming output can improve first token response time and interaction experience
  • Tool calling should have proper timeout and retry mechanisms to avoid blocking model responses
  • Extended thinking can significantly improve reasoning quality for complex problems