🦞OpenClaw Guide
← All Integrations
🖥️

KoboldCpp

Advanced⏱️ 30-60 minutes

Local LLMs, zero API costs

OpenClaw + KoboldCpp Integration

KoboldCpp is one of the most popular local LLM runtimes — efficient, cross-platform, and compatible with virtually every GGUF model. When combined with OpenClaw, you get a fully local AI assistant: no API costs, no data leaving your machine, and offline capability.

This is the setup for people who are serious about privacy, want to experiment with local models, or simply don't want to pay per-token forever.

Why KoboldCpp + OpenClaw?

Zero API Costs: Once you download a model, inference is free. Run it 24/7 without paying per token.

Complete Privacy: Nothing leaves your machine. Your conversations, your files, your data — all local.

Offline Operation: Works without internet (for the AI itself — OpenClaw's other features that call external APIs are separate).

Model Freedom: Run Llama 3.3, Mistral, Qwen 2.5, DeepSeek, Phi-4, Gemma, or any GGUF-format model.

Hardware Flexibility: Works on CPU (slow but functional), NVIDIA GPU (CUDA), AMD GPU (ROCm), and Apple Silicon (Metal).

Recommended Hardware

HardwareRecommended Model SizePerformance
M1/M2/M3 Mac Mini7B–13B (Q4)Excellent
M1/M2/M3 MacBook7B (Q4)Good
NVIDIA RTX 3090/409013B–34B (Q4)Excellent
NVIDIA RTX 3070/40707B–13B (Q4)Good
CPU only (16GB RAM)7B (Q4)Slow but works

For OpenClaw as a daily assistant, 7B Q4 models on Apple Silicon are the sweet spot — fast responses, good quality, zero cost.

Recommended Models

For General Assistant Use

  • Llama 3.3 70B Q4_K_M (if you have the GPU RAM — best quality)
  • Llama 3.1 8B Q4_K_M (fast, great quality for size)
  • Qwen 2.5 7B Q4_K_M (excellent multilingual)
  • Mistral 7B Q4_K_M (fast, reliable)

For Coding Tasks

  • DeepSeek Coder V2 Lite Q4_K_M
  • Qwen 2.5 Coder 7B Q4_K_M

For Privacy-Sensitive Use

  • Phi-4 Q4_K_M (Microsoft, compact but capable)
  • Gemma 2 9B Q4_K_M (Google, good instruction following)

All available on Hugging Face in GGUF format.

Step-by-Step Setup

Step 1: Install KoboldCpp

macOS (Apple Silicon — recommended):

bash
brew install koboldcpp

Or download the pre-built binary from github.com/LostRuins/koboldcpp.

Windows: Download koboldcpp.exe from the releases page. Run directly — no installation needed.

Linux:

bash
git clone https://github.com/LostRuins/koboldcppcd koboldcppmake LLAMA_CUBLAS=1  # for NVIDIA GPU# or: make LLAMA_METAL=1  # for Apple Silicon

Step 2: Download a Model

bash
# Install Hugging Face CLIpip install huggingface-hub# Download Llama 3.1 8B Q4 (recommended starter)huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \  --local-dir ~/models

Or browse huggingface.co/models?library=gguf and download manually.

Step 3: Start KoboldCpp

bash
koboldcpp \  --model ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \  --port 5001 \  --contextsize 8192 \  --gpulayers 99 \  --host 127.0.0.1

Key flags:

  • --gpulayers 99: Offload all layers to GPU (Metal on Apple Silicon, CUDA on NVIDIA)
  • --contextsize: Context window. 8192 is a good default; go higher if your RAM allows
  • --host 127.0.0.1: Keep it local only (security)

Step 4: Configure OpenClaw

KoboldCpp exposes an OpenAI-compatible API. Configure OpenClaw to use it:

yaml
providers:  local:    type: openai-compatible    baseUrl: http://127.0.0.1:5001/v1    apiKey: "koboldcpp"  # any string — KoboldCpp doesn't validate this    model: koboldcpp  # or the model name KoboldCpp reports    contextLength: 8192# Set as default providerdefaultProvider: local

Step 5: Test

Ask OpenClaw anything. If KoboldCpp is running and configured correctly, responses will come from your local model.

bash
openclaw chat "Hello, who are you?"

You should see a response from the local model (not Claude).

Performance Tuning

Apple Silicon (Recommended)

Apple Silicon has unified memory — even the base M1 Mac Mini with 8GB can run 7B models well:

bash
koboldcpp \  --model ~/models/llama-3.1-8b-q4_k_m.gguf \  --port 5001 \  --gpulayers 99 \  --contextsize 8192 \  --usemmap

NVIDIA GPU (CUDA)

bash
koboldcpp \  --model ~/models/llama-3.1-8b-q4_k_m.gguf \  --port 5001 \  --gpulayers 99 \  --usecublas

CPU Only (Fallback)

bash
koboldcpp \  --model ~/models/phi-4-q4_k_m.gguf \  --port 5001 \  --threads 8

Expect 2–10 tokens/second on CPU depending on hardware.

Running KoboldCpp as a Service

To keep KoboldCpp running in the background on macOS:

Create ~/Library/LaunchAgents/koboldcpp.plist:

xml
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"><plist version="1.0"><dict>  <key>Label</key>  <string>koboldcpp</string>  <key>ProgramArguments</key>  <array>    <string>/usr/local/bin/koboldcpp</string>    <string>--model</string>    <string>/Users/yourname/models/llama-3.1-8b-q4_k_m.gguf</string>    <string>--port</string>    <string>5001</string>    <string>--gpulayers</string>    <string>99</string>  </array>  <key>RunAtLoad</key>  <true/>  <key>KeepAlive</key>  <true/></dict></plist>

Load it:

bash
launchctl load ~/Library/LaunchAgents/koboldcpp.plist

Mixing Local + Cloud

You don't have to go all-local. OpenClaw supports routing different tasks to different providers:

yaml
providers:  local:    type: openai-compatible    baseUrl: http://127.0.0.1:5001/v1  claude:    type: anthropic    apiKey: "sk-ant-..."# Use local for most things, Claude for complex tasksrouting:  default: local  complex: claude  # triggered by skill or manually

Privacy Notes

With KoboldCpp as your provider:

  • No data sent externally for AI inference
  • OpenClaw skills that call external APIs (email, calendar) still make those calls
  • Your conversation history stays on your machine
  • No usage logs sent to any AI company

Troubleshooting

Slow responses on CPU

  • Use a smaller or more quantized model (Q3_K_S instead of Q4_K_M)
  • Increase thread count: --threads $(nproc)
  • Consider upgrading to a GPU or Apple Silicon machine

Out of memory errors

  • Use a more aggressive quantization (Q3 instead of Q4)
  • Reduce context size: --contextsize 4096
  • Try a smaller model (7B instead of 13B)

OpenClaw can't connect to KoboldCpp

  • Verify KoboldCpp is running: curl http://127.0.0.1:5001/v1/models
  • Check the port in your OpenClaw config matches
  • Ensure --host 127.0.0.1 (or 0.0.0.0 if running on a separate machine)

Model produces poor quality output

  • Try a higher quantization (Q4_K_M or Q5_K_M)
  • Use a model fine-tuned for instruction following (look for -Instruct in the name)
  • Increase context size if conversations feel truncated

Features

Zero API Costs

Download once, run forever. No per-token charges, no monthly API bills.

Complete Privacy

All inference happens locally. Your conversations never leave your machine.

Offline Operation

Works without internet. Perfect for air-gapped environments or traveling.

Any GGUF Model

Llama, Mistral, Qwen, DeepSeek, Phi, Gemma — run any quantized model.

GPU Acceleration

Full Metal support for Apple Silicon, CUDA for NVIDIA, ROCm for AMD.

OpenAI-Compatible API

KoboldCpp exposes an OpenAI-compatible endpoint — zero friction integration with OpenClaw.

Use Cases

Privacy-First Assistant

Handle sensitive work — legal, medical, financial — with an AI that never phones home.

Cost Elimination

Heavy users spending $50–$200/month on API costs can recoup hardware costs in months.

Offline AI

Traveling, on a plane, in a secure facility — your AI works without internet.

Model Experimentation

Test new models as they drop without committing to a provider. Swap models in seconds.

Local Coding Assistant

Run DeepSeek Coder or Qwen Coder locally for private code review and generation.

Setup Guide

Requirements

  • Mac, Windows, or Linux machine
  • At least 8GB RAM (16GB recommended for 7B models)
  • GPU strongly recommended (Apple Silicon M1+ or NVIDIA RTX series)
  • KoboldCpp installed
  • GGUF model downloaded
1

Install KoboldCpp

macOS: brew install koboldcpp. Windows: download koboldcpp.exe from GitHub releases. Linux: build from source with CUDA or Metal flags.

2

Download a GGUF model

Use huggingface-cli to download a model. Recommended: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf for a balance of quality and speed.

3

Start KoboldCpp

Run: koboldcpp --model ~/models/your-model.gguf --port 5001 --gpulayers 99 --contextsize 8192

4

Configure OpenClaw provider

Add a local provider to ~/.openclaw/config.yaml pointing to http://127.0.0.1:5001/v1 with type openai-compatible.

5

Test the connection

Run: curl http://127.0.0.1:5001/v1/models — should return model info. Then ask OpenClaw something to confirm it routes to KoboldCpp.

6

(Optional) Run as background service

Set up a launchd plist (macOS) or systemd service (Linux) to keep KoboldCpp running at startup.

Configuration Example

# KoboldCpp local provider
providers:
  local:
    type: openai-compatible
    baseUrl: http://127.0.0.1:5001/v1
    apiKey: "koboldcpp"
    model: koboldcpp
    contextLength: 8192

# Set as default (or use alongside cloud providers)
defaultProvider: local

# Optional: route complex tasks to Claude
# routing:
#   default: local
#   complex: claude

Limitations

  • ⚠️Requires capable hardware — old laptops will struggle
  • ⚠️Local models are generally less capable than Claude Opus or GPT-4
  • ⚠️Initial model download can be several GB
  • ⚠️CPU-only inference is slow (2–10 tok/s)

Frequently Asked Questions

How does KoboldCpp compare to Ollama for OpenClaw?

Both work well. KoboldCpp has more advanced quantization options and often better performance on Apple Silicon. Ollama is easier to install and manage. Either works with OpenClaw via the openai-compatible provider type.

Which model should I start with?

For most users on Apple Silicon: Llama 3.1 8B Q4_K_M or Qwen 2.5 7B Q4_K_M. These run well on M1/M2 Mac Mini with 8–16GB RAM and deliver good general-purpose quality.

Can I run local models AND Claude/GPT-4 in the same OpenClaw setup?

Yes. Configure multiple providers and route by default or per-task. Use local for routine queries, cloud providers for complex reasoning tasks.

Do I need a GPU?

No, but it helps significantly. Apple Silicon Macs use unified memory so even M1 with 8GB handles 7B models well. NVIDIA RTX GPUs work great too. CPU-only works but is slow (2–10 tokens/second).

Is my data really private?

For AI inference: yes, completely. KoboldCpp runs entirely locally. OpenClaw skills that access external services (email, calendar, web) still make those calls, but the AI processing of your data stays on your machine.

What's the difference between Q4_K_M and other quantizations?

Quantization reduces model size at the cost of some quality. Q4_K_M is the most popular balance — good quality, reasonable size. Q8_0 is higher quality but larger. Q3_K_S is smaller and faster but lower quality. Start with Q4_K_M.

🔥 Your AI should run your business, not just answer questions.

We'll show you how.Free to join.

Join Vibe Combinator →

🐙 Your AI should run your business.

Weekly live builds + template vault. We'll show you how to make AI actually work.Free to join.

Join Vibe Combinator →