🖥️

KoboldCpp

Advanced⏱️ 30-60 minutes

Local LLMs, zero API costs

OpenClaw + KoboldCpp Integration

KoboldCpp is one of the most popular local LLM runtimes — efficient, cross-platform, and compatible with virtually every GGUF model. When combined with OpenClaw, you get a fully local AI assistant: no API costs, no data leaving your machine, and offline capability.

This is the setup for people who are serious about privacy, want to experiment with local models, or simply don't want to pay per-token forever.

Why KoboldCpp + OpenClaw?

Zero API Costs: Once you download a model, inference is free. Run it 24/7 without paying per token.

Complete Privacy: Nothing leaves your machine. Your conversations, your files, your data — all local.

Offline Operation: Works without internet (for the AI itself — OpenClaw's other features that call external APIs are separate).

Model Freedom: Run Llama 3.3, Mistral, Qwen 2.5, DeepSeek, Phi-4, Gemma, or any GGUF-format model.

Hardware Flexibility: Works on CPU (slow but functional), NVIDIA GPU (CUDA), AMD GPU (ROCm), and Apple Silicon (Metal).

Recommended Hardware

Hardware	Recommended Model Size	Performance
M1/M2/M3 Mac Mini	7B–13B (Q4)	Excellent
M1/M2/M3 MacBook	7B (Q4)	Good
NVIDIA RTX 3090/4090	13B–34B (Q4)	Excellent
NVIDIA RTX 3070/4070	7B–13B (Q4)	Good
CPU only (16GB RAM)	7B (Q4)	Slow but works

For OpenClaw as a daily assistant, 7B Q4 models on Apple Silicon are the sweet spot — fast responses, good quality, zero cost.

Recommended Models

For General Assistant Use

Llama 3.3 70B Q4_K_M (if you have the GPU RAM — best quality)
Llama 3.1 8B Q4_K_M (fast, great quality for size)
Qwen 2.5 7B Q4_K_M (excellent multilingual)
Mistral 7B Q4_K_M (fast, reliable)

For Coding Tasks

DeepSeek Coder V2 Lite Q4_K_M
Qwen 2.5 Coder 7B Q4_K_M

For Privacy-Sensitive Use

Phi-4 Q4_K_M (Microsoft, compact but capable)
Gemma 2 9B Q4_K_M (Google, good instruction following)

All available on Hugging Face in GGUF format.

Step-by-Step Setup

Step 1: Install KoboldCpp

macOS (Apple Silicon — recommended):

bash

brew install koboldcpp

Or download the pre-built binary from github.com/LostRuins/koboldcpp.

Windows: Download koboldcpp.exe from the releases page. Run directly — no installation needed.

Linux:

bash

git clone https://github.com/LostRuins/koboldcppcd koboldcppmake LLAMA_CUBLAS=1  # for NVIDIA GPU# or: make LLAMA_METAL=1  # for Apple Silicon

Step 2: Download a Model

bash

# Install Hugging Face CLIpip install huggingface-hub# Download Llama 3.1 8B Q4 (recommended starter)huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \  --local-dir ~/models

Or browse huggingface.co/models?library=gguf and download manually.

Step 3: Start KoboldCpp

bash

koboldcpp \  --model ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \  --port 5001 \  --contextsize 8192 \  --gpulayers 99 \  --host 127.0.0.1

Key flags:

--gpulayers 99: Offload all layers to GPU (Metal on Apple Silicon, CUDA on NVIDIA)
--contextsize: Context window. 8192 is a good default; go higher if your RAM allows
--host 127.0.0.1: Keep it local only (security)

Step 4: Configure OpenClaw

KoboldCpp exposes an OpenAI-compatible API. Configure OpenClaw to use it:

yaml

providers:  local:    type: openai-compatible    baseUrl: http://127.0.0.1:5001/v1    apiKey: "koboldcpp"  # any string — KoboldCpp doesn't validate this    model: koboldcpp  # or the model name KoboldCpp reports    contextLength: 8192# Set as default providerdefaultProvider: local

Step 5: Test

Ask OpenClaw anything. If KoboldCpp is running and configured correctly, responses will come from your local model.

bash

openclaw chat "Hello, who are you?"

You should see a response from the local model (not Claude).

Performance Tuning

Apple Silicon (Recommended)

Apple Silicon has unified memory — even the base M1 Mac Mini with 8GB can run 7B models well:

bash

koboldcpp \  --model ~/models/llama-3.1-8b-q4_k_m.gguf \  --port 5001 \  --gpulayers 99 \  --contextsize 8192 \  --usemmap

NVIDIA GPU (CUDA)

bash

koboldcpp \  --model ~/models/llama-3.1-8b-q4_k_m.gguf \  --port 5001 \  --gpulayers 99 \  --usecublas

CPU Only (Fallback)

bash

koboldcpp \  --model ~/models/phi-4-q4_k_m.gguf \  --port 5001 \  --threads 8

Expect 2–10 tokens/second on CPU depending on hardware.

Running KoboldCpp as a Service

To keep KoboldCpp running in the background on macOS:

Create ~/Library/LaunchAgents/koboldcpp.plist:

xml

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"><plist version="1.0"><dict>  <key>Label</key>  <string>koboldcpp</string>  <key>ProgramArguments</key>  <array>    <string>/usr/local/bin/koboldcpp</string>    <string>--model</string>    <string>/Users/yourname/models/llama-3.1-8b-q4_k_m.gguf</string>    <string>--port</string>    <string>5001</string>    <string>--gpulayers</string>    <string>99</string>  </array>  <key>RunAtLoad</key>  <true/>  <key>KeepAlive</key>  <true/></dict></plist>

Load it:

bash

launchctl load ~/Library/LaunchAgents/koboldcpp.plist

Mixing Local + Cloud

You don't have to go all-local. OpenClaw supports routing different tasks to different providers:

yaml

providers:  local:    type: openai-compatible    baseUrl: http://127.0.0.1:5001/v1  claude:    type: anthropic    apiKey: "sk-ant-..."# Use local for most things, Claude for complex tasksrouting:  default: local  complex: claude  # triggered by skill or manually

Privacy Notes

With KoboldCpp as your provider:

No data sent externally for AI inference
OpenClaw skills that call external APIs (email, calendar) still make those calls
Your conversation history stays on your machine
No usage logs sent to any AI company

Troubleshooting

Slow responses on CPU

Use a smaller or more quantized model (Q3_K_S instead of Q4_K_M)
Increase thread count: --threads $(nproc)
Consider upgrading to a GPU or Apple Silicon machine

Out of memory errors

Use a more aggressive quantization (Q3 instead of Q4)
Reduce context size: --contextsize 4096
Try a smaller model (7B instead of 13B)

OpenClaw can't connect to KoboldCpp

Verify KoboldCpp is running: curl http://127.0.0.1:5001/v1/models
Check the port in your OpenClaw config matches
Ensure --host 127.0.0.1 (or 0.0.0.0 if running on a separate machine)

Model produces poor quality output

Try a higher quantization (Q4_K_M or Q5_K_M)
Use a model fine-tuned for instruction following (look for -Instruct in the name)
Increase context size if conversations feel truncated

Get Started View Docs

Features

Zero API Costs

Download once, run forever. No per-token charges, no monthly API bills.

Complete Privacy

All inference happens locally. Your conversations never leave your machine.

Offline Operation

Works without internet. Perfect for air-gapped environments or traveling.

Any GGUF Model

Llama, Mistral, Qwen, DeepSeek, Phi, Gemma — run any quantized model.

GPU Acceleration

Full Metal support for Apple Silicon, CUDA for NVIDIA, ROCm for AMD.

OpenAI-Compatible API

KoboldCpp exposes an OpenAI-compatible endpoint — zero friction integration with OpenClaw.

Use Cases

→

Privacy-First Assistant

Handle sensitive work — legal, medical, financial — with an AI that never phones home.

→

Cost Elimination

Heavy users spending $50–$200/month on API costs can recoup hardware costs in months.

→

Offline AI

Traveling, on a plane, in a secure facility — your AI works without internet.

→

Model Experimentation

Test new models as they drop without committing to a provider. Swap models in seconds.

→

Local Coding Assistant

Run DeepSeek Coder or Qwen Coder locally for private code review and generation.

Setup Guide

Requirements

✓Mac, Windows, or Linux machine
✓At least 8GB RAM (16GB recommended for 7B models)
✓GPU strongly recommended (Apple Silicon M1+ or NVIDIA RTX series)
✓KoboldCpp installed
✓GGUF model downloaded

Install KoboldCpp

macOS: brew install koboldcpp. Windows: download koboldcpp.exe from GitHub releases. Linux: build from source with CUDA or Metal flags.

Download a GGUF model

Use huggingface-cli to download a model. Recommended: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf for a balance of quality and speed.

Start KoboldCpp

Run: koboldcpp --model ~/models/your-model.gguf --port 5001 --gpulayers 99 --contextsize 8192

Configure OpenClaw provider

Add a local provider to ~/.openclaw/config.yaml pointing to http://127.0.0.1:5001/v1 with type openai-compatible.

Test the connection

Run: curl http://127.0.0.1:5001/v1/models — should return model info. Then ask OpenClaw something to confirm it routes to KoboldCpp.

(Optional) Run as background service

Set up a launchd plist (macOS) or systemd service (Linux) to keep KoboldCpp running at startup.

Configuration Example

# KoboldCpp local provider
providers:
  local:
    type: openai-compatible
    baseUrl: http://127.0.0.1:5001/v1
    apiKey: "koboldcpp"
    model: koboldcpp
    contextLength: 8192

# Set as default (or use alongside cloud providers)
defaultProvider: local

# Optional: route complex tasks to Claude
# routing:
#   default: local
#   complex: claude

Limitations

⚠️Requires capable hardware — old laptops will struggle
⚠️Local models are generally less capable than Claude Opus or GPT-4
⚠️Initial model download can be several GB
⚠️CPU-only inference is slow (2–10 tok/s)

Frequently Asked Questions

How does KoboldCpp compare to Ollama for OpenClaw?

Both work well. KoboldCpp has more advanced quantization options and often better performance on Apple Silicon. Ollama is easier to install and manage. Either works with OpenClaw via the openai-compatible provider type.

Which model should I start with?

For most users on Apple Silicon: Llama 3.1 8B Q4_K_M or Qwen 2.5 7B Q4_K_M. These run well on M1/M2 Mac Mini with 8–16GB RAM and deliver good general-purpose quality.

Can I run local models AND Claude/GPT-4 in the same OpenClaw setup?

Yes. Configure multiple providers and route by default or per-task. Use local for routine queries, cloud providers for complex reasoning tasks.

Do I need a GPU?

No, but it helps significantly. Apple Silicon Macs use unified memory so even M1 with 8GB handles 7B models well. NVIDIA RTX GPUs work great too. CPU-only works but is slow (2–10 tokens/second).

Is my data really private?

For AI inference: yes, completely. KoboldCpp runs entirely locally. OpenClaw skills that access external services (email, calendar, web) still make those calls, but the AI processing of your data stays on your machine.

What's the difference between Q4_K_M and other quantizations?

Quantization reduces model size at the cost of some quality. Q4_K_M is the most popular balance — good quality, reasonable size. Q8_0 is higher quality but larger. Q3_K_S is smaller and faster but lower quality. Start with Q4_K_M.

🔥 Your AI should run your business, not just answer questions.

We'll show you how.Free to join.

Join Vibe Combinator →

Related Integrations

🐙

Raycast

📚 Learn More

Help

Weekly live builds + template vault. We'll show you how to make AI actually work.Free to join.

Join Vibe Combinator →

KoboldCpp

OpenClaw + KoboldCpp Integration

Why KoboldCpp + OpenClaw?

Recommended Hardware

Recommended Models

For General Assistant Use

For Coding Tasks

For Privacy-Sensitive Use

Step-by-Step Setup

Step 1: Install KoboldCpp

Step 2: Download a Model

Step 3: Start KoboldCpp

Step 4: Configure OpenClaw

Step 5: Test

Performance Tuning

Apple Silicon (Recommended)

NVIDIA GPU (CUDA)

CPU Only (Fallback)

Running KoboldCpp as a Service

Mixing Local + Cloud

Privacy Notes

Troubleshooting

Features

Zero API Costs

Complete Privacy

Offline Operation

Any GGUF Model

GPU Acceleration

OpenAI-Compatible API

Use Cases

Privacy-First Assistant

Cost Elimination

Offline AI

Model Experimentation

Local Coding Assistant

Setup Guide

Requirements

Install KoboldCpp

Download a GGUF model

Start KoboldCpp

Configure OpenClaw provider

Test the connection

(Optional) Run as background service

Configuration Example

Limitations

Frequently Asked Questions

How does KoboldCpp compare to Ollama for OpenClaw?

Which model should I start with?

Can I run local models AND Claude/GPT-4 in the same OpenClaw setup?

Do I need a GPU?

Is my data really private?

What's the difference between Q4_K_M and other quantizations?

Related Integrations

GitHub

Obsidian

Home Assistant

Raycast

📚 Learn More

Update Stuck on Old Version — NPM Migration

OpenClaw Self-Audit: The Prompt That Checks Your Own Security Setup

How to Connect Gmail to Your AI Assistant (Complete Integration Guide)

Tabnine vs GitHub Copilot