How to Run AI Locally: Complete Step-by-Step Guide
Want to run AI on your own computer — no cloud, no subscriptions, no data leaving your machine? This tutorial walks you through everything from Ollama to full assistant setup.
Why Run AI Locally?
Running AI locally means:
- Complete privacy — Your conversations never leave your machine
- Zero API costs — No per-token charges after initial setup
- Offline access — Works without internet
- Full control — Your AI, your rules
The trade-off: Local models are good, but not quite at Claude/GPT-4 level. For many tasks, that's fine. For others, you might want a hybrid approach.
Hardware Requirements
Minimum (7B Models)
- 16GB RAM
- Modern CPU (2020+)
- 10GB free disk space
This runs Llama 3 8B, Mistral 7B, and similar models. Responses take 5-15 seconds.
Recommended (13-34B Models)
- 32GB RAM
- Dedicated GPU (8GB+ VRAM)
- 50GB free disk space
This runs larger models smoothly with 1-5 second responses.
Power Setup (70B+ Models)
- 64GB+ RAM
- High-end GPU (24GB+ VRAM) or multiple GPUs
- 100GB+ free disk space
This runs the largest open models at reasonable speeds.
Apple Silicon Note
M1/M2/M3 Macs with 16GB+ unified memory run local AI very well. The unified architecture handles models efficiently.
Step 1: Install Ollama
Ollama is the standard runtime for local AI. It handles model downloads, optimization, and serving.
Mac
brew install ollama
Or download from ollama.com
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com
Verify Installation
ollama --version
You should see a version number.
Step 2: Download Your First Model
ollama pull llama3:8b
This downloads Llama 3 8B (~4GB). First download takes a few minutes.
Other Recommended Models
# Fast and capable general-purpose
ollama pull mistral
# Great for coding
ollama pull deepseek-coder
# Largest common model (needs 64GB+ RAM)
ollama pull llama3:70b
Step 3: Test the Model
ollama run llama3:8b
You'll get an interactive prompt. Try:
>>> What is the capital of France?
If you get a reasonable response, local AI is working.
Step 4: Connect to OpenClaw
Now let's turn this into a full AI assistant.
Install OpenClaw
npm install -g openclaw
Configure for Local Model
Create or edit ~/.openclaw/openclaw.json:
{
"model": {
"provider": "ollama",
"model": "llama3:8b",
"baseUrl": "http://localhost:11434"
}
}
Test the Connection
openclaw
Try chatting. If you get responses, OpenClaw is using your local model.
Step 5: Optimize Performance
Model Quantization
Quantized models trade tiny accuracy for big speed gains:
# Q4 quantization (faster, slightly less accurate)
ollama pull llama3:8b-q4_0
# Q8 quantization (balanced)
ollama pull llama3:8b-q8_0
For most tasks, Q4 is indistinguishable from full precision.
GPU Acceleration
If you have a NVIDIA GPU, Ollama uses it automatically. Verify with:
nvidia-smi
For AMD, check Ollama's ROCm support documentation.
Memory Settings
For better performance on limited RAM:
OLLAMA_NUM_PARALLEL=1 ollama serve
This limits concurrent requests but reduces memory usage.
Step 6: Set Up as a Service
For 24/7 operation, run Ollama as a background service.
Mac (launchd)
Ollama installs as a service automatically. Check with:
ollama serve
# If already running, this will show an error
Linux (systemd)
sudo systemctl enable ollama
sudo systemctl start ollama
Windows
Use the Ollama Windows service or Task Scheduler.
Advanced: Multiple Models
You can run different models for different tasks:
{
"models": {
"fast": {
"provider": "ollama",
"model": "mistral"
},
"coding": {
"provider": "ollama",
"model": "deepseek-coder"
},
"complex": {
"provider": "ollama",
"model": "llama3:70b"
}
},
"routing": {
"default": "fast",
"coding": "coding",
"analysis": "complex"
}
}
Simple queries use fast models. Complex ones use larger models.
Advanced: Hybrid Local + Cloud
Get the best of both worlds:
{
"models": {
"local": {
"provider": "ollama",
"model": "llama3:8b"
},
"cloud": {
"provider": "anthropic",
"apiKey": "your-key"
}
},
"routing": {
"default": "local",
"complex": "cloud"
}
}
Most queries run locally (free, private). Complex queries use Claude.
Troubleshooting
Slow Responses
- Check if you have enough RAM (monitor with
htop) - Try a smaller model
- Ensure GPU is being used if available
Model Won't Load
- Insufficient RAM — try a smaller model
- Corrupt download —
ollama pullagain - Check disk space
Ollama Won't Start
- Port conflict — check if something else uses 11434
- Permissions — run as regular user, not root
Poor Quality Responses
Local models handle:
- General questions well
- Basic coding decently
- Simple tasks reliably
They struggle with:
- Complex reasoning
- Nuanced creative writing
- Difficult technical problems
For these, consider hybrid approach with cloud fallback.
Next Steps
- Experiment with models — Find what works for your tasks
- Set up OpenClaw channels — Telegram, WhatsApp
- Configure memory — USER.md, MEMORY.md
- Build habits — Use it daily
Need help? OpenClaw Cloud offers managed hosting with local model options — best of both worlds.
Skip the setup entirely
OpenClaw Cloud handles hosting, updates, and configuration for you — ready in 2 minutes.
📚 Explore More
Self-Hosted AI Guide
Self-Hosted Feature
How to Self-Host an LLM: Run AI Models on Your Own Hardware
Complete guide to running large language models locally. Llama, Mistral, Qwen, and other open-source models on your Mac, PC, or server — fully offline, zero API costs.
KoboldCpp
Run completely local LLMs with OpenClaw using KoboldCpp. Zero API costs, full privacy, offline operation — your AI assistant running entirely on your own hardware.