Medium⏱️ 30-60 minutes

How to Self-Host an LLM: Run AI Models on Your Own Hardware

Complete guide to running large language models locally. Llama, Mistral, Qwen, and other open-source models on your Mac, PC, or server — fully offline, zero API costs.

😫 The Problem

Every time you use ChatGPT or Claude, your data goes through corporate servers. API costs add up. You're dependent on internet connectivity. Rate limits throttle your usage. Model changes break your workflows. And for truly sensitive work — legal documents, medical records, proprietary code — cloud AI isn't an option. You need AI that runs completely on your hardware.

✨ The Solution

Self-host open-source LLMs and run AI inference locally. Modern models like Llama 3, Mistral, and Qwen rival GPT-4 quality while running on consumer hardware. A Mac with 16GB RAM handles 7B-13B parameter models easily. Gaming PCs with RTX 3080+ run 70B models. Zero API costs, zero data leakage, zero internet dependency. This guide covers hardware selection, model choices, and the fastest paths to local AI.

Step by Step

Understand what you're running: An LLM is a file (usually 4-50GB) containing model weights. Your computer loads this into memory and generates text token by token. CPU inference works but is slow; GPU (CUDA/Metal) is 10-50x faster. Quantization shrinks models while keeping quality — a 70B model becomes usable on consumer hardware.

Check your hardware: Minimum for usable inference: 16GB RAM (Mac) or 16GB VRAM (GPU). Sweet spots: M2/M3 Mac with 32GB unified memory, RTX 3080/4080 with 12-16GB VRAM, or dual GPU setups. CPU-only works for 7B models but expect 5-10 tokens/second instead of 50+.

Choose your model: Llama 3.1 8B — best all-rounder, runs on 16GB systems. Mistral 7B — fast, efficient, great for coding. Qwen 2.5 72B — rivals GPT-4, needs 48GB+ memory. DeepSeek Coder — specialized for programming. Start with Llama 3.1 8B if unsure.

Pick your inference engine: Ollama — easiest setup, one-command install, perfect for beginners. LM Studio — GUI app, good for model browsing. llama.cpp — fastest performance, requires compilation. vLLM — production-grade, best for servers. Koboldcpp — built-in chat UI, great for creative writing.

Install Ollama (recommended start): Mac: 'brew install ollama'. Linux: 'curl -fsSL https://ollama.com/install.sh | sh'. Windows: download installer from ollama.com. Run 'ollama serve' to start the server.

Download your first model: 'ollama pull llama3.1' downloads Llama 3.1 8B (4.7GB). First run takes a few minutes. The model lives in ~/.ollama/models and persists across restarts.

Test basic inference: 'ollama run llama3.1' opens interactive chat. Try complex questions, code generation, analysis tasks. 'ollama run llama3.1 "Explain quantum computing briefly"' for one-shot queries.

Measure performance: During generation, watch tokens/second in the output. 30+ tokens/sec feels real-time. 10-30 is usable. Under 10 is sluggish. If slow, try smaller models or quantized versions.

Try different quantizations: Models come in Q4_K_M, Q5_K_M, Q8_0 variants. Lower number = smaller, faster, slightly dumber. 'ollama pull llama3.1:70b-instruct-q4_K_M' for a compressed 70B model.

Connect to OpenClaw: Edit ~/.openclaw/config.yaml and add under models: 'local: ollama/llama3.1'. Now you can use 'llama3.1' as a model in OpenClaw. Switch between cloud and local models per-task.

Set up GPU acceleration: Ollama auto-detects CUDA (NVIDIA) and Metal (Mac). Verify with 'ollama run llama3.1' — GPU inference shows much higher tokens/sec. For multi-GPU: set CUDA_VISIBLE_DEVICES=0,1 before running.

Run larger models: 'ollama pull llama3.1:70b' or 'ollama pull qwen2.5:72b'. These need 40GB+ memory. If you have 64GB RAM, the model loads with CPU inference. With A100 or multiple GPUs, you get full speed.

Optimize for your workload: Coding? Use deepseek-coder or codellama. Creative writing? Try mistral-nemo or llama3.1-uncensored. Analysis? Qwen 2.5 excels. You can run multiple models and switch based on task.

Troubleshooting common issues: 'Out of memory' — try smaller model or quantization. Slow generation — verify GPU detection with 'nvidia-smi' or Activity Monitor (Mac). Model not found — check 'ollama list' and verify pull completed.

Production deployment: For always-on local AI, run 'ollama serve' as a systemd service (Linux) or launchd (Mac). Configure OLLAMA_HOST=0.0.0.0 to accept network connections. Add authentication if exposing beyond localhost.

🔥 Your AI should run your business, not just answer questions.

We'll show you how.Free to join.

Join Vibe Combinator →

🐙 Your AI should run your business.

Weekly live builds + template vault. We'll show you how to make AI actually work.Free to join.

Join Vibe Combinator →