KoboldCpp
Local LLMs, zero API costs
OpenClaw + KoboldCpp Integration
KoboldCpp is one of the most popular local LLM runtimes — efficient, cross-platform, and compatible with virtually every GGUF model. When combined with OpenClaw, you get a fully local AI assistant: no API costs, no data leaving your machine, and offline capability.
This is the setup for people who are serious about privacy, want to experiment with local models, or simply don't want to pay per-token forever.
Why KoboldCpp + OpenClaw?
Zero API Costs: Once you download a model, inference is free. Run it 24/7 without paying per token.
Complete Privacy: Nothing leaves your machine. Your conversations, your files, your data — all local.
Offline Operation: Works without internet (for the AI itself — OpenClaw's other features that call external APIs are separate).
Model Freedom: Run Llama 3.3, Mistral, Qwen 2.5, DeepSeek, Phi-4, Gemma, or any GGUF-format model.
Hardware Flexibility: Works on CPU (slow but functional), NVIDIA GPU (CUDA), AMD GPU (ROCm), and Apple Silicon (Metal).
Recommended Hardware
| Hardware | Recommended Model Size | Performance |
|---|---|---|
| M1/M2/M3 Mac Mini | 7B–13B (Q4) | Excellent |
| M1/M2/M3 MacBook | 7B (Q4) | Good |
| NVIDIA RTX 3090/4090 | 13B–34B (Q4) | Excellent |
| NVIDIA RTX 3070/4070 | 7B–13B (Q4) | Good |
| CPU only (16GB RAM) | 7B (Q4) | Slow but works |
For OpenClaw as a daily assistant, 7B Q4 models on Apple Silicon are the sweet spot — fast responses, good quality, zero cost.
Recommended Models
For General Assistant Use
- Llama 3.3 70B Q4_K_M (if you have the GPU RAM — best quality)
- Llama 3.1 8B Q4_K_M (fast, great quality for size)
- Qwen 2.5 7B Q4_K_M (excellent multilingual)
- Mistral 7B Q4_K_M (fast, reliable)
For Coding Tasks
- DeepSeek Coder V2 Lite Q4_K_M
- Qwen 2.5 Coder 7B Q4_K_M
For Privacy-Sensitive Use
- Phi-4 Q4_K_M (Microsoft, compact but capable)
- Gemma 2 9B Q4_K_M (Google, good instruction following)
All available on Hugging Face in GGUF format.
Step-by-Step Setup
Step 1: Install KoboldCpp
macOS (Apple Silicon — recommended):
brew install koboldcppOr download the pre-built binary from github.com/LostRuins/koboldcpp.
Windows:
Download koboldcpp.exe from the releases page. Run directly — no installation needed.
Linux:
git clone https://github.com/LostRuins/koboldcppcd koboldcppmake LLAMA_CUBLAS=1 # for NVIDIA GPU# or: make LLAMA_METAL=1 # for Apple SiliconStep 2: Download a Model
# Install Hugging Face CLIpip install huggingface-hub# Download Llama 3.1 8B Q4 (recommended starter)huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \ Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \ --local-dir ~/modelsOr browse huggingface.co/models?library=gguf and download manually.
Step 3: Start KoboldCpp
koboldcpp \ --model ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \ --port 5001 \ --contextsize 8192 \ --gpulayers 99 \ --host 127.0.0.1Key flags:
--gpulayers 99: Offload all layers to GPU (Metal on Apple Silicon, CUDA on NVIDIA)--contextsize: Context window. 8192 is a good default; go higher if your RAM allows--host 127.0.0.1: Keep it local only (security)
Step 4: Configure OpenClaw
KoboldCpp exposes an OpenAI-compatible API. Configure OpenClaw to use it:
providers: local: type: openai-compatible baseUrl: http://127.0.0.1:5001/v1 apiKey: "koboldcpp" # any string — KoboldCpp doesn't validate this model: koboldcpp # or the model name KoboldCpp reports contextLength: 8192# Set as default providerdefaultProvider: localStep 5: Test
Ask OpenClaw anything. If KoboldCpp is running and configured correctly, responses will come from your local model.
openclaw chat "Hello, who are you?"You should see a response from the local model (not Claude).
Performance Tuning
Apple Silicon (Recommended)
Apple Silicon has unified memory — even the base M1 Mac Mini with 8GB can run 7B models well:
koboldcpp \ --model ~/models/llama-3.1-8b-q4_k_m.gguf \ --port 5001 \ --gpulayers 99 \ --contextsize 8192 \ --usemmapNVIDIA GPU (CUDA)
koboldcpp \ --model ~/models/llama-3.1-8b-q4_k_m.gguf \ --port 5001 \ --gpulayers 99 \ --usecublasCPU Only (Fallback)
koboldcpp \ --model ~/models/phi-4-q4_k_m.gguf \ --port 5001 \ --threads 8Expect 2–10 tokens/second on CPU depending on hardware.
Running KoboldCpp as a Service
To keep KoboldCpp running in the background on macOS:
Create ~/Library/LaunchAgents/koboldcpp.plist:
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"><plist version="1.0"><dict> <key>Label</key> <string>koboldcpp</string> <key>ProgramArguments</key> <array> <string>/usr/local/bin/koboldcpp</string> <string>--model</string> <string>/Users/yourname/models/llama-3.1-8b-q4_k_m.gguf</string> <string>--port</string> <string>5001</string> <string>--gpulayers</string> <string>99</string> </array> <key>RunAtLoad</key> <true/> <key>KeepAlive</key> <true/></dict></plist>Load it:
launchctl load ~/Library/LaunchAgents/koboldcpp.plistMixing Local + Cloud
You don't have to go all-local. OpenClaw supports routing different tasks to different providers:
providers: local: type: openai-compatible baseUrl: http://127.0.0.1:5001/v1 claude: type: anthropic apiKey: "sk-ant-..."# Use local for most things, Claude for complex tasksrouting: default: local complex: claude # triggered by skill or manuallyPrivacy Notes
With KoboldCpp as your provider:
- No data sent externally for AI inference
- OpenClaw skills that call external APIs (email, calendar) still make those calls
- Your conversation history stays on your machine
- No usage logs sent to any AI company
Troubleshooting
Slow responses on CPU
- Use a smaller or more quantized model (Q3_K_S instead of Q4_K_M)
- Increase thread count:
--threads $(nproc) - Consider upgrading to a GPU or Apple Silicon machine
Out of memory errors
- Use a more aggressive quantization (Q3 instead of Q4)
- Reduce context size:
--contextsize 4096 - Try a smaller model (7B instead of 13B)
OpenClaw can't connect to KoboldCpp
- Verify KoboldCpp is running:
curl http://127.0.0.1:5001/v1/models - Check the port in your OpenClaw config matches
- Ensure
--host 127.0.0.1(or0.0.0.0if running on a separate machine)
Model produces poor quality output
- Try a higher quantization (Q4_K_M or Q5_K_M)
- Use a model fine-tuned for instruction following (look for -Instruct in the name)
- Increase context size if conversations feel truncated
Features
Zero API Costs
Download once, run forever. No per-token charges, no monthly API bills.
Complete Privacy
All inference happens locally. Your conversations never leave your machine.
Offline Operation
Works without internet. Perfect for air-gapped environments or traveling.
Any GGUF Model
Llama, Mistral, Qwen, DeepSeek, Phi, Gemma — run any quantized model.
GPU Acceleration
Full Metal support for Apple Silicon, CUDA for NVIDIA, ROCm for AMD.
OpenAI-Compatible API
KoboldCpp exposes an OpenAI-compatible endpoint — zero friction integration with OpenClaw.
Use Cases
Privacy-First Assistant
Handle sensitive work — legal, medical, financial — with an AI that never phones home.
Cost Elimination
Heavy users spending $50–$200/month on API costs can recoup hardware costs in months.
Offline AI
Traveling, on a plane, in a secure facility — your AI works without internet.
Model Experimentation
Test new models as they drop without committing to a provider. Swap models in seconds.
Local Coding Assistant
Run DeepSeek Coder or Qwen Coder locally for private code review and generation.
Setup Guide
Requirements
- ✓Mac, Windows, or Linux machine
- ✓At least 8GB RAM (16GB recommended for 7B models)
- ✓GPU strongly recommended (Apple Silicon M1+ or NVIDIA RTX series)
- ✓KoboldCpp installed
- ✓GGUF model downloaded
Install KoboldCpp
macOS: brew install koboldcpp. Windows: download koboldcpp.exe from GitHub releases. Linux: build from source with CUDA or Metal flags.
Download a GGUF model
Use huggingface-cli to download a model. Recommended: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf for a balance of quality and speed.
Start KoboldCpp
Run: koboldcpp --model ~/models/your-model.gguf --port 5001 --gpulayers 99 --contextsize 8192
Configure OpenClaw provider
Add a local provider to ~/.openclaw/config.yaml pointing to http://127.0.0.1:5001/v1 with type openai-compatible.
Test the connection
Run: curl http://127.0.0.1:5001/v1/models — should return model info. Then ask OpenClaw something to confirm it routes to KoboldCpp.
(Optional) Run as background service
Set up a launchd plist (macOS) or systemd service (Linux) to keep KoboldCpp running at startup.
Configuration Example
# KoboldCpp local provider
providers:
local:
type: openai-compatible
baseUrl: http://127.0.0.1:5001/v1
apiKey: "koboldcpp"
model: koboldcpp
contextLength: 8192
# Set as default (or use alongside cloud providers)
defaultProvider: local
# Optional: route complex tasks to Claude
# routing:
# default: local
# complex: claudeLimitations
- ⚠️Requires capable hardware — old laptops will struggle
- ⚠️Local models are generally less capable than Claude Opus or GPT-4
- ⚠️Initial model download can be several GB
- ⚠️CPU-only inference is slow (2–10 tok/s)
Frequently Asked Questions
How does KoboldCpp compare to Ollama for OpenClaw?
Both work well. KoboldCpp has more advanced quantization options and often better performance on Apple Silicon. Ollama is easier to install and manage. Either works with OpenClaw via the openai-compatible provider type.
Which model should I start with?
For most users on Apple Silicon: Llama 3.1 8B Q4_K_M or Qwen 2.5 7B Q4_K_M. These run well on M1/M2 Mac Mini with 8–16GB RAM and deliver good general-purpose quality.
Can I run local models AND Claude/GPT-4 in the same OpenClaw setup?
Yes. Configure multiple providers and route by default or per-task. Use local for routine queries, cloud providers for complex reasoning tasks.
Do I need a GPU?
No, but it helps significantly. Apple Silicon Macs use unified memory so even M1 with 8GB handles 7B models well. NVIDIA RTX GPUs work great too. CPU-only works but is slow (2–10 tokens/second).
Is my data really private?
For AI inference: yes, completely. KoboldCpp runs entirely locally. OpenClaw skills that access external services (email, calendar, web) still make those calls, but the AI processing of your data stays on your machine.
What's the difference between Q4_K_M and other quantizations?
Quantization reduces model size at the cost of some quality. Q4_K_M is the most popular balance — good quality, reasonable size. Q8_0 is higher quality but larger. Q3_K_S is smaller and faster but lower quality. Start with Q4_K_M.
🔥 Your AI should run your business, not just answer questions.
We'll show you how.Free to join.
Related Integrations
📚 Learn More
Update Stuck on Old Version — NPM Migration
Can't update past an old version? The package was renamed from clawdbot to openclaw.
OpenClaw Self-Audit: The Prompt That Checks Your Own Security Setup
Most people set up OpenClaw, lock things down as best they can, and never check again. Meanwhile, configs drift, updates change settings, and new integration...
How to Connect Gmail to Your AI Assistant (Complete Integration Guide)
Turn Gmail into an AI-powered inbox. Summarize threads, draft replies, auto-categorize messages, and search your email with natural language.
Tabnine vs GitHub Copilot
Privacy-first AI coding vs ecosystem integration
🐙 Your AI should run your business.
Weekly live builds + template vault. We'll show you how to make AI actually work.Free to join.
Join Vibe Combinator →