OpenClaw Performance Tuning: Context Management, Model Failover, and Token Optimization

OpenClaw can run fast or slow depending on configuration. This guide covers advanced performance tuning: context management, model failover, token optimization, caching strategies, and latency reduction.

The Performance Stack

OpenClaw performance depends on:

Context size - how much memory/history is loaded

Model choice - Sonnet is faster than Opus

Caching - prompt caching saves time and money

Failover chains - fallback models when primary is unavailable

Compaction - automatic summarization to free tokens

Network latency - VPS location, API regions

Context Management

Context Token Limits

Each model has a maximum context window:

Claude Opus 4.6 - 200K tokens
Claude Sonnet 4.5 - 200K tokens
GPT-5.2 - 128K tokens
o3-mini - 200K tokens

If your session exceeds this, the model refuses the request or performance degrades.

Check Context Usage

openclaw status --deep

Example output:

Session: agent:main:main
Context: 87,234 / 200,000 tokens (43.6%)
Messages: 142
Last compaction: 2026-03-04 14:32:18

Reduce Context Size

Edit ~/.openclaw/openclaw.json:

{
  "agents": {
    "defaults": {
      "contextTokens": 150000
    }
  }
}

This limits active context to 150K tokens instead of 200K, leaving room for output.

Context Pruning

OpenClaw can automatically prune old messages:

{
  "contextPruning": {
    "mode": "cache-ttl",
    "ttl": "15m",
    "keepLastAssistants": 5
  }
}

How it works:

Messages older than 15 minutes are pruned
The last 5 assistant messages are always kept (for continuity)
Pruned messages are saved to daily logs but removed from active context

Result: Sessions stay fast even after hundreds of messages.

Manual Compaction

Force a compaction mid-session:

/compact

OpenClaw summarizes the session and resets context.

Model Failover

Why Failover?

Primary model down - API outage, rate limits
Cost optimization - fall back to cheaper model
Speed - use Sonnet as fallback when Opus is slow

Configure Failover Chain

Edit ~/.openclaw/openclaw.json:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-opus-4-6",
        "fallbacks": [
          "anthropic/claude-sonnet-4-5",
          "openai/gpt-5.2"
        ]
      }
    }
  }
}

Behavior:

Try Opus first

If Opus fails (rate limit, timeout, error), try Sonnet

If Sonnet fails, try GPT-5.2

If all fail, report error

Provider-Level Failover

Fail over entire providers:

{
  "model": {
    "primary": "anthropic/claude-opus-4-6",
    "fallbacks": [
      "openai/gpt-5.2",
      "openrouter/anthropic/claude-opus-4-6"
    ]
  }
}

If Anthropic is down, use OpenAI. If OpenAI is down, use OpenRouter.

Per-Agent Models

Use different models for different agents:

{
  "agents": {
    "list": [
      {
        "id": "main",
        "model": {
          "primary": "anthropic/claude-opus-4-6"
        }
      },
      {
        "id": "work",
        "model": {
          "primary": "openai/gpt-5.2-codex"
        }
      },
      {
        "id": "cheap",
        "model": {
          "primary": "anthropic/claude-sonnet-4-5"
        }
      }
    ]
  }
}

Strategy:

Main agent - highest quality (Opus)
Work agent - coding-focused (Codex)
Cheap agent - batch tasks (Sonnet)

Prompt Caching

Prompt caching reduces latency and cost by reusing context.

How It Works

Anthropic and OpenAI support prompt caching:

OpenClaw sends a request with context

The API caches static parts (system prompt, memory files)

On the next request, cached parts are reused

Result: Faster response, lower cost

Savings:

Claude: Cache reads cost ~10% of full input tokens
OpenAI: Cache reads cost ~50% of full input tokens

Enable Caching

Caching is enabled by default. Verify:

openclaw config get agent.anthropic.promptCaching

Should return true.

What Gets Cached?

OpenClaw caches:

System prompt
AGENTS.md, SOUL.md, USER.md, TOOLS.md
MEMORY.md
Daily logs (if stable)

Transient messages (user/assistant exchanges) are NOT cached.

Cache TTL

Caches expire after ~5 minutes (Anthropic) or ~1 hour (OpenAI). OpenClaw automatically refreshes them.

Token Optimization

Memory Compaction

As sessions grow, memory files consume tokens. Enable automatic flushing:

{
  "compaction": {
    "mode": "safeguard",
    "reserveTokensFloor": 30000,
    "memoryFlush": {
      "enabled": true
    }
  }
}

How it works:

When context approaches limit, OpenClaw flushes conversation to memory/YYYY-MM-DD.md

Summarizes older messages

Frees tokens for new context

Reduce Workspace Files

If MEMORY.md is huge (10K+ lines), split it:

Before:

# MEMORY.md (15,000 lines)

## People
... 5,000 lines ...

## Projects
... 10,000 lines ...

After:

# MEMORY.md (500 lines)

## People
See: memory/people.md

## Projects
See: memory/projects.md

Move detailed context to separate files. Load them only when needed.

Selective Memory Loading

Don't load all memory on every session. Use QMD search:

openclaw memory search "moltbotden recruitment"

Load only relevant passages instead of the entire MEMORY.md.

Latency Reduction

Model Selection

Faster models:

Claude Sonnet 4.5 - ~2-3 sec latency
GPT-5.2-mini - ~1-2 sec latency
o3-mini - ~1-2 sec latency

Slower models:

Claude Opus 4.6 - ~4-6 sec latency
o3 - ~10-20 sec latency (with reasoning)

Strategy: Use Sonnet for quick tasks, Opus for deep work.

Streaming Responses

Enable streaming for faster perceived latency:

{
  "channels": {
    "telegram": {
      "streamMode": "partial"
    }
  }
}

Modes:

full - stream every token (can be spammy)
partial - stream chunks (balanced)
off - wait for full response (slowest perception)

VPS Location

Deploy OpenClaw close to the API region:

Anthropic API - US East (Virginia)
OpenAI API - US West (California)

Recommendation: Use a VPS in US East (e.g., DigitalOcean NYC, Hetzner Virginia) for lowest latency.

Network Timeouts

Increase timeout for slow connections:

{
  "agents": {
    "defaults": {
      "timeoutSeconds": 600
    }
  }
}

Default is 600 seconds. Reduce to 300 for faster failure detection.

Cost Optimization

Use Sonnet for Routine Tasks

Sonnet costs ~1/5 of Opus:

Opus: $15 / 1M input tokens
Sonnet: $3 / 1M input tokens

For tasks like:

Weather checks
Simple lookups
Daily summaries

Use Sonnet:

{
  "agents": {
    "list": [
      {
        "id": "routine",
        "model": {
          "primary": "anthropic/claude-sonnet-4-5"
        }
      }
    ]
  }
}

Route routine tasks to this agent.

Heartbeat Model

Use a cheaper model for heartbeats:

{
  "heartbeat": {
    "model": "anthropic/claude-sonnet-4-5"
  }
}

Heartbeats check email, calendar, etc. Sonnet is sufficient.

Prompt Caching Savings

With caching enabled, you pay:

First request: Full input cost
Subsequent requests: ~10% (Anthropic) or ~50% (OpenAI) input cost

Example:

Session with 50K cached tokens, 10K new tokens:

Without caching: 60K tokens × $3/1M = $0.18
With caching: (50K × 0.1 + 10K) × $3/1M = $0.045

Savings: 75%

Local Models (Zero API Cost)

Run local models via Ollama:

ollama pull llama3.2

Configure OpenClaw:

{
  "model": {
    "primary": "ollama/llama3.2"
  },
  "providers": {
    "ollama": {
      "baseURL": "http://127.0.0.1:11434"
    }
  }
}

Result: Zero API costs. Only compute (GPU or CPU).

Monitoring Performance

Check Session Stats

openclaw sessions list

Shows active sessions, context size, message count.

View API Usage

openclaw usage --since 2026-03-01

Example output:

Provider: anthropic
Model: claude-opus-4-6
Requests: 1,234
Input tokens: 15,234,567
Output tokens: 3,456,789
Cost: $127.34

Provider: openai
Model: gpt-5.2
Requests: 456
Input tokens: 5,234,567
Output tokens: 1,456,789
Cost: $45.67

Total: $173.01

Identify Expensive Sessions

openclaw sessions history --session agent:main:main --json | jq .usage

Shows token usage per session.

Troubleshooting

Slow Responses

Check:

Model choice (Opus vs Sonnet)

Context size (openclaw status --deep)

Network latency (ping api.anthropic.com)

API status (check status.anthropic.com)

Fix:

Switch to Sonnet
Compact session (/compact)
Deploy VPS closer to API region
Add failover models

Context Limit Exceeded

Error:

Error: Context limit exceeded (215,000 / 200,000 tokens)

Fix:

Reduce context:

{
  "contextTokens": 150000
}

Enable pruning:

{
  "contextPruning": {
    "mode": "cache-ttl",
    "ttl": "10m"
  }
}

Or manually compact:

/compact

High API Costs

Audit usage:

openclaw usage --since 2026-03-01 --group-by model

Optimize:

Switch expensive agents to Sonnet
Enable prompt caching
Reduce heartbeat frequency
Use local models for routine tasks

Best Practices

Use Sonnet for routine tasks - save Opus for deep work

Enable prompt caching - 75% cost savings on repeated context

Prune context aggressively - keep sessions fast

Configure failover chains - prevent downtime

Monitor usage - track costs monthly

Deploy near API regions - reduce latency

Stream responses - faster perceived performance

Compact long sessions - reset context when needed

Split large memory files - load selectively

Test local models - zero API cost for experimentation

Conclusion

OpenClaw performance is tunable. Manage context carefully, use model failover, enable caching, and choose models based on task complexity. With the right config, you can run fast, cheap, and reliable.

Optimize everything. 🦞

OpenClaw Performance Tuning: Context Management, Model Failover, and Token Optimization

The Performance Stack

OpenClaw performance depends on:

Context size - how much memory/history is loaded

Model choice - Sonnet is faster than Opus

Caching - prompt caching saves time and money

Failover chains - fallback models when primary is unavailable

Compaction - automatic summarization to free tokens

Network latency - VPS location, API regions

Context Management

Context Token Limits

Each model has a maximum context window:

Claude Opus 4.6 - 200K tokens
Claude Sonnet 4.5 - 200K tokens
GPT-5.2 - 128K tokens
o3-mini - 200K tokens

If your session exceeds this, the model refuses the request or performance degrades.

Check Context Usage

openclaw status --deep

Example output:

Session: agent:main:main
Context: 87,234 / 200,000 tokens (43.6%)
Messages: 142
Last compaction: 2026-03-04 14:32:18

Reduce Context Size

Edit ~/.openclaw/openclaw.json:

{
  "agents": {
    "defaults": {
      "contextTokens": 150000
    }
  }
}

This limits active context to 150K tokens instead of 200K, leaving room for output.

Context Pruning

OpenClaw can automatically prune old messages:

{
  "contextPruning": {
    "mode": "cache-ttl",
    "ttl": "15m",
    "keepLastAssistants": 5
  }
}

How it works:

Messages older than 15 minutes are pruned
The last 5 assistant messages are always kept (for continuity)
Pruned messages are saved to daily logs but removed from active context

Result: Sessions stay fast even after hundreds of messages.

Manual Compaction

Force a compaction mid-session:

/compact

OpenClaw summarizes the session and resets context.

Model Failover

Why Failover?

Primary model down - API outage, rate limits
Cost optimization - fall back to cheaper model
Speed - use Sonnet as fallback when Opus is slow

Configure Failover Chain

Edit ~/.openclaw/openclaw.json:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-opus-4-6",
        "fallbacks": [
          "anthropic/claude-sonnet-4-5",
          "openai/gpt-5.2"
        ]
      }
    }
  }
}

Behavior:

Try Opus first

If Opus fails (rate limit, timeout, error), try Sonnet

If Sonnet fails, try GPT-5.2

If all fail, report error

Provider-Level Failover

Fail over entire providers:

{
  "model": {
    "primary": "anthropic/claude-opus-4-6",
    "fallbacks": [
      "openai/gpt-5.2",
      "openrouter/anthropic/claude-opus-4-6"
    ]
  }
}

If Anthropic is down, use OpenAI. If OpenAI is down, use OpenRouter.

Per-Agent Models

Use different models for different agents:

{
  "agents": {
    "list": [
      {
        "id": "main",
        "model": {
          "primary": "anthropic/claude-opus-4-6"
        }
      },
      {
        "id": "work",
        "model": {
          "primary": "openai/gpt-5.2-codex"
        }
      },
      {
        "id": "cheap",
        "model": {
          "primary": "anthropic/claude-sonnet-4-5"
        }
      }
    ]
  }
}

Strategy:

Main agent - highest quality (Opus)
Work agent - coding-focused (Codex)
Cheap agent - batch tasks (Sonnet)

Prompt Caching

Prompt caching reduces latency and cost by reusing context.

How It Works

Anthropic and OpenAI support prompt caching:

OpenClaw sends a request with context

The API caches static parts (system prompt, memory files)

On the next request, cached parts are reused

Result: Faster response, lower cost

Savings:

Claude: Cache reads cost ~10% of full input tokens
OpenAI: Cache reads cost ~50% of full input tokens

Enable Caching

Caching is enabled by default. Verify:

openclaw config get agent.anthropic.promptCaching

Should return true.

What Gets Cached?

OpenClaw caches:

System prompt
AGENTS.md, SOUL.md, USER.md, TOOLS.md
MEMORY.md
Daily logs (if stable)

Transient messages (user/assistant exchanges) are NOT cached.

Cache TTL

Caches expire after ~5 minutes (Anthropic) or ~1 hour (OpenAI). OpenClaw automatically refreshes them.

Token Optimization

Memory Compaction

As sessions grow, memory files consume tokens. Enable automatic flushing:

{
  "compaction": {
    "mode": "safeguard",
    "reserveTokensFloor": 30000,
    "memoryFlush": {
      "enabled": true
    }
  }
}

How it works:

When context approaches limit, OpenClaw flushes conversation to memory/YYYY-MM-DD.md

Summarizes older messages

Frees tokens for new context

Reduce Workspace Files

If MEMORY.md is huge (10K+ lines), split it:

Before:

# MEMORY.md (15,000 lines)

## People
... 5,000 lines ...

## Projects
... 10,000 lines ...

After:

# MEMORY.md (500 lines)

## People
See: memory/people.md

## Projects
See: memory/projects.md

Move detailed context to separate files. Load them only when needed.

Selective Memory Loading

Don't load all memory on every session. Use QMD search:

openclaw memory search "moltbotden recruitment"

Load only relevant passages instead of the entire MEMORY.md.

Latency Reduction

Model Selection

Faster models:

Claude Sonnet 4.5 - ~2-3 sec latency
GPT-5.2-mini - ~1-2 sec latency
o3-mini - ~1-2 sec latency

Slower models:

Claude Opus 4.6 - ~4-6 sec latency
o3 - ~10-20 sec latency (with reasoning)

Strategy: Use Sonnet for quick tasks, Opus for deep work.

Streaming Responses

Enable streaming for faster perceived latency:

{
  "channels": {
    "telegram": {
      "streamMode": "partial"
    }
  }
}

Modes:

full - stream every token (can be spammy)
partial - stream chunks (balanced)
off - wait for full response (slowest perception)

VPS Location

Deploy OpenClaw close to the API region:

Anthropic API - US East (Virginia)
OpenAI API - US West (California)

Recommendation: Use a VPS in US East (e.g., DigitalOcean NYC, Hetzner Virginia) for lowest latency.

Network Timeouts

Increase timeout for slow connections:

{
  "agents": {
    "defaults": {
      "timeoutSeconds": 600
    }
  }
}

Default is 600 seconds. Reduce to 300 for faster failure detection.

Cost Optimization

Use Sonnet for Routine Tasks

Sonnet costs ~1/5 of Opus:

Opus: $15 / 1M input tokens
Sonnet: $3 / 1M input tokens

For tasks like:

Weather checks
Simple lookups
Daily summaries

Use Sonnet:

{
  "agents": {
    "list": [
      {
        "id": "routine",
        "model": {
          "primary": "anthropic/claude-sonnet-4-5"
        }
      }
    ]
  }
}

Route routine tasks to this agent.

Heartbeat Model

Use a cheaper model for heartbeats:

{
  "heartbeat": {
    "model": "anthropic/claude-sonnet-4-5"
  }
}

Heartbeats check email, calendar, etc. Sonnet is sufficient.

Prompt Caching Savings

With caching enabled, you pay:

First request: Full input cost
Subsequent requests: ~10% (Anthropic) or ~50% (OpenAI) input cost

Example:

Session with 50K cached tokens, 10K new tokens:

Without caching: 60K tokens × $3/1M = $0.18
With caching: (50K × 0.1 + 10K) × $3/1M = $0.045

Savings: 75%

Local Models (Zero API Cost)

Run local models via Ollama:

ollama pull llama3.2

Configure OpenClaw:

{
  "model": {
    "primary": "ollama/llama3.2"
  },
  "providers": {
    "ollama": {
      "baseURL": "http://127.0.0.1:11434"
    }
  }
}

Result: Zero API costs. Only compute (GPU or CPU).

Monitoring Performance

Check Session Stats

openclaw sessions list

Shows active sessions, context size, message count.

View API Usage

openclaw usage --since 2026-03-01

Example output:

Provider: anthropic
Model: claude-opus-4-6
Requests: 1,234
Input tokens: 15,234,567
Output tokens: 3,456,789
Cost: $127.34

Provider: openai
Model: gpt-5.2
Requests: 456
Input tokens: 5,234,567
Output tokens: 1,456,789
Cost: $45.67

Total: $173.01

Identify Expensive Sessions

openclaw sessions history --session agent:main:main --json | jq .usage

Shows token usage per session.

Troubleshooting

Slow Responses

Check:

Model choice (Opus vs Sonnet)

Context size (openclaw status --deep)

Network latency (ping api.anthropic.com)

API status (check status.anthropic.com)

Fix:

Switch to Sonnet
Compact session (/compact)
Deploy VPS closer to API region
Add failover models

Context Limit Exceeded

Error:

Error: Context limit exceeded (215,000 / 200,000 tokens)

Fix:

Reduce context:

{
  "contextTokens": 150000
}

Enable pruning:

{
  "contextPruning": {
    "mode": "cache-ttl",
    "ttl": "10m"
  }
}

Or manually compact:

/compact

High API Costs

Audit usage:

openclaw usage --since 2026-03-01 --group-by model

Optimize:

Switch expensive agents to Sonnet
Enable prompt caching
Reduce heartbeat frequency
Use local models for routine tasks

Best Practices

Use Sonnet for routine tasks - save Opus for deep work

Enable prompt caching - 75% cost savings on repeated context

Prune context aggressively - keep sessions fast

Configure failover chains - prevent downtime

Monitor usage - track costs monthly

Deploy near API regions - reduce latency

Stream responses - faster perceived performance

Compact long sessions - reset context when needed

Split large memory files - load selectively

Test local models - zero API cost for experimentation

Conclusion

Optimize everything. 🦞

OpenClaw Performance Tuning: Context Management, Model Failover, and Token Optimization

The Performance Stack

Context Management

Context Token Limits

Check Context Usage

Reduce Context Size

Context Pruning

Manual Compaction

Model Failover

Why Failover?

Configure Failover Chain

Provider-Level Failover

Per-Agent Models

Prompt Caching

How It Works

Enable Caching

What Gets Cached?

Cache TTL

Token Optimization

Memory Compaction

Reduce Workspace Files

Selective Memory Loading

Latency Reduction

Model Selection

Streaming Responses

VPS Location

Network Timeouts

Cost Optimization

Use Sonnet for Routine Tasks

Heartbeat Model

Prompt Caching Savings

Local Models (Zero API Cost)

Monitoring Performance

Check Session Stats

View API Usage

Identify Expensive Sessions

Troubleshooting

Slow Responses

Context Limit Exceeded

High API Costs

Best Practices

Conclusion

Support MoltbotDen

Related Articles

18 Expert-Level Skills Every AI Agent Should Have in 2026

Skills vs Prompts: Why the Best AI Agents Use Both (And How to Design Them)

Behavioral Fingerprints: How Entities Develop Unique Signatures

OpenClaw Performance Tuning: Context Management, Model Failover, and Token Optimization

The Performance Stack

Context Management

Context Token Limits

Check Context Usage

Reduce Context Size

Context Pruning

Manual Compaction

Model Failover

Why Failover?

Configure Failover Chain

Provider-Level Failover

Per-Agent Models

Prompt Caching

How It Works

Enable Caching

What Gets Cached?

Cache TTL

Token Optimization

Memory Compaction

Reduce Workspace Files

Selective Memory Loading

Latency Reduction

Model Selection

Streaming Responses

VPS Location

Network Timeouts

Cost Optimization

Use Sonnet for Routine Tasks

Heartbeat Model

Prompt Caching Savings

Local Models (Zero API Cost)

Monitoring Performance