🦞OpenClaw Guide
← Back to BlogTutorial

How to Run AI Locally: Complete Step-by-Step Guide

2026-02-0114 min read

Want to run AI on your own computer — no cloud, no subscriptions, no data leaving your machine? This tutorial walks you through everything from Ollama to full assistant setup.

Why Run AI Locally?

Running AI locally means:

  • Complete privacy — Your conversations never leave your machine
  • Zero API costs — No per-token charges after initial setup
  • Offline access — Works without internet
  • Full control — Your AI, your rules

The trade-off: Local models are good, but not quite at Claude/GPT-4 level. For many tasks, that's fine. For others, you might want a hybrid approach.

Hardware Requirements

Minimum (7B Models)

  • 16GB RAM
  • Modern CPU (2020+)
  • 10GB free disk space

This runs Llama 3 8B, Mistral 7B, and similar models. Responses take 5-15 seconds.

Recommended (13-34B Models)

  • 32GB RAM
  • Dedicated GPU (8GB+ VRAM)
  • 50GB free disk space

This runs larger models smoothly with 1-5 second responses.

Power Setup (70B+ Models)

  • 64GB+ RAM
  • High-end GPU (24GB+ VRAM) or multiple GPUs
  • 100GB+ free disk space

This runs the largest open models at reasonable speeds.

Apple Silicon Note

M1/M2/M3 Macs with 16GB+ unified memory run local AI very well. The unified architecture handles models efficiently.

Step 1: Install Ollama

Ollama is the standard runtime for local AI. It handles model downloads, optimization, and serving.

Mac

brew install ollama

Or download from ollama.com

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com

Verify Installation

ollama --version

You should see a version number.

Step 2: Download Your First Model

ollama pull llama3:8b

This downloads Llama 3 8B (~4GB). First download takes a few minutes.

Other Recommended Models

# Fast and capable general-purpose
ollama pull mistral

# Great for coding
ollama pull deepseek-coder

# Largest common model (needs 64GB+ RAM)
ollama pull llama3:70b

Step 3: Test the Model

ollama run llama3:8b

You'll get an interactive prompt. Try:

>>> What is the capital of France?

If you get a reasonable response, local AI is working.

Step 4: Connect to OpenClaw

Now let's turn this into a full AI assistant.

Install OpenClaw

npm install -g openclaw

Configure for Local Model

Create or edit ~/.openclaw/openclaw.json:

{
  "model": {
    "provider": "ollama",
    "model": "llama3:8b",
    "baseUrl": "http://localhost:11434"
  }
}

Test the Connection

openclaw

Try chatting. If you get responses, OpenClaw is using your local model.

Step 5: Optimize Performance

Model Quantization

Quantized models trade tiny accuracy for big speed gains:

# Q4 quantization (faster, slightly less accurate)
ollama pull llama3:8b-q4_0

# Q8 quantization (balanced)
ollama pull llama3:8b-q8_0

For most tasks, Q4 is indistinguishable from full precision.

GPU Acceleration

If you have a NVIDIA GPU, Ollama uses it automatically. Verify with:

nvidia-smi

For AMD, check Ollama's ROCm support documentation.

Memory Settings

For better performance on limited RAM:

OLLAMA_NUM_PARALLEL=1 ollama serve

This limits concurrent requests but reduces memory usage.

Step 6: Set Up as a Service

For 24/7 operation, run Ollama as a background service.

Mac (launchd)

Ollama installs as a service automatically. Check with:

ollama serve
# If already running, this will show an error

Linux (systemd)

sudo systemctl enable ollama
sudo systemctl start ollama

Windows

Use the Ollama Windows service or Task Scheduler.

Advanced: Multiple Models

You can run different models for different tasks:

{
  "models": {
    "fast": {
      "provider": "ollama",
      "model": "mistral"
    },
    "coding": {
      "provider": "ollama",
      "model": "deepseek-coder"
    },
    "complex": {
      "provider": "ollama",
      "model": "llama3:70b"
    }
  },
  "routing": {
    "default": "fast",
    "coding": "coding",
    "analysis": "complex"
  }
}

Simple queries use fast models. Complex ones use larger models.

Advanced: Hybrid Local + Cloud

Get the best of both worlds:

{
  "models": {
    "local": {
      "provider": "ollama",
      "model": "llama3:8b"
    },
    "cloud": {
      "provider": "anthropic",
      "apiKey": "your-key"
    }
  },
  "routing": {
    "default": "local",
    "complex": "cloud"
  }
}

Most queries run locally (free, private). Complex queries use Claude.

Troubleshooting

Slow Responses

  • Check if you have enough RAM (monitor with htop)
  • Try a smaller model
  • Ensure GPU is being used if available

Model Won't Load

  • Insufficient RAM — try a smaller model
  • Corrupt download — ollama pull again
  • Check disk space

Ollama Won't Start

  • Port conflict — check if something else uses 11434
  • Permissions — run as regular user, not root

Poor Quality Responses

Local models handle:

  • General questions well
  • Basic coding decently
  • Simple tasks reliably

They struggle with:

  • Complex reasoning
  • Nuanced creative writing
  • Difficult technical problems

For these, consider hybrid approach with cloud fallback.

Next Steps

  1. Experiment with models — Find what works for your tasks
  2. Set up OpenClaw channels — Telegram, WhatsApp
  3. Configure memory — USER.md, MEMORY.md
  4. Build habits — Use it daily

Need help? OpenClaw Cloud offers managed hosting with local model options — best of both worlds.

Skip the setup entirely

OpenClaw Cloud handles hosting, updates, and configuration for you — ready in 2 minutes.