The Self-Hosted AI Stack: What It Actually Costs to Run AI on Your Own Terms

UPDATE (May 2026): We now run Qwen3-VL-30B (30B params with vision) on this same GPU — 4x the model, no new hardware. Read the full update here.

The Self-Hosted AI Stack: Tools, Hardware, and Real Costs

Two weeks ago I wrote about the AI compression — what happens when one person becomes a full business. Last week I showed you what a day in that life actually looks like. This week, I’m opening the hood.

Not a theoretical overview. The actual tools, the actual hardware, the actual costs. Copy-paste this if you want. That’s the point.

The Hardware

Our AI stack runs on a Dell PowerEdge R730 in our Malta office. Specs:

CPU: 2x Intel Xeon E5-2680 v4 (28 threads total)
RAM: 128 GB DDR4 ECC
GPU: NVIDIA Tesla P40 (24 GB VRAM) — bought second-hand for €180
Storage: 2x 1 TB NVMe (OS + models) + 4x 2 TB SATA (data)
Network: Behind OPNsense firewall, static IP

Total hardware cost: ~€750, all second-hand or refurbished. No cloud GPU. No monthly compute bills. The P40 is a 2017 datacenter GPU — not flashy, but it runs Llama 3 8B at 30 tokens/second and handles inference for our daily automations without breaking a sweat.

The key insight: you don’t need the latest hardware. You need hardware that’s good enough for 80% of your tasks. The other 20% — the complex reasoning, the nuanced writing — that’s where I use Claude via API. But the foundation runs on depreciated silicon.

The Software Stack, Layer by Layer

Layer 1: Inference Engine

llama.cpp — the open-source inference engine that runs language models efficiently on consumer hardware. No CUDA toolkit headaches (it ships with its own), no Python dependency hell (it’s C++), and it supports quantized models out of the box.

We run Llama 3 8B Instruct Q4_K_M (4.9 GB on disk) for local tasks: email classification, data extraction, content categorization, simple summarization. It loads in under 10 seconds and uses ~6 GB VRAM at peak.

Startup command:

./llama-server -m models/llama-3-8b-instruct-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8081 \
  -ngl 99 -c 4096

Layer 2: Memory and Context

PostgreSQL + pgvector — the backbone of our AI system. This isn’t just a database. It’s the thing that turns a chatbot into a business assistant.

Every memory — client interactions, strategic decisions, product details, lessons learned, project context — gets embedded and stored. When an agent starts a task, it queries relevant context first. It knows Opteia’s pricing, our target sectors, which prospects are warm, what went wrong last time.

The memory API is dead simple:

# Store a memory
python3 remember_api.py "Meeting with Client X: they need AI audit by May 15"

# Recall relevant context
python3 recall_api.py "client x requirements"

Layer 3: Agent Orchestration

Claude Code (Anthropic’s CLI agent) is our primary orchestrator. It connects to Core Memory, has access to our file system, can run scripts, write code, and execute multi-step workflows autonomously.

But it doesn’t work alone. We have a scheduler (APScheduler on systemd) that kicks off tasks on cron schedules. At 6 AM every morning, it triggers a cascade:

Email triage agent processes the inbox
Content engine researches trending topics
Infrastructure checker runs health checks across all servers
News scanner pulls Malta business headlines
Outreach pipeline identifies new prospects

Each task is a Python script that loads context from memory, does its work, and stores results back. The agents don’t run continuously — they wake up, execute, and shut down. Keeps resource usage predictable.

Layer 4: Communication Interface

I don’t SSH into servers to check on my agents. Everything flows through Telegram.

A custom bot (built on aiogram) serves as the interface between me and the AI workforce. Briefings arrive as messages. I approve email drafts with a reply. Content calendars show up for review on Sundays. Infrastructure alerts ping me in real-time.

This isn’t a trivial integration — the bot handles message threading, file delivery (charts, documents, HTML drafts), voice messages, and forum topics. But the principle is simple: the agent system meets you where you already are, not the other way around.

Layer 5: Web Research

CamoFox — a self-hosted browser based on Camoufox (Firefox fork) that our agents use for web research. It runs in a Docker container with Xvfb (virtual display) and exposes a tab-based API.

Why self-hosted? No API rate limits. No vendor dependencies. No $0.01/search bills adding up. And it renders JavaScript — essential for modern websites that serve content dynamically.

The Cost Breakdown

Here’s what this actually costs per month:

For comparison: ChatGPT Team ($25/user/mo) + Microsoft Copilot ($30/user/mo) + Jasper ($49/mo) + Zapier ($20/mo) + a handful of other AI tools would run a small team €300-500/month — with no persistent memory, no custom agents, no business context, and no protection against price changes.

I’m not saying €142/month is the answer for everyone. I’m saying the gap between what AI can cost and what most businesses think it costs is massive. And it’s about to get wider.

Why This Matters Right Now

This month, two things happened that should concern every business relying on AI SaaS:

First, The Verge reported that Gartner estimates $6.3 trillion in AI data center investments by 2029. To avoid write-downs, providers need token consumption to grow by 50,000–100,000x. The bill for that growth? It’s coming to your subscription invoice.

Second, Anthropic surveyed 81,000 users and found that 48% of productivity gains came from scope expansion — doing new things, not just faster things. But scope expansion requires a system that understands your business. A generic chatbot doesn’t cut it.

The Verge piece quotes Anaconda’s CEO: “Everyone I spoke to had some version of this problem — their token usage has gone up, so their usage-based billing cost has gone up.” And Anthropic’s data shows the biggest gains going to people who built systems, not just subscriptions.

The businesses that have their own stack — even a partial one — will navigate the coming price pressure. The ones that outsourced everything to SaaS will renegotiate contracts they didn’t plan for.

Getting Started: The 80/20

You don’t need to replicate our full stack. Here’s the minimum viable self-hosted setup:

A dedicated machine — Any server with 32 GB RAM and a used GPU (P40 or similar, €150-200). No, your laptop won’t cut it for 24/7 operations.
llama.cpp — One binary, one model file, one command to start. OpenAI-compatible API in under 5 minutes.
PostgreSQL — For persistent storage. Add pgvector when you’re ready for semantic search.
A scheduler — Cron, APScheduler, or systemd timers. Something that runs tasks on a schedule without you thinking about it.
A messaging interface — Telegram bot, Slack bot, or even email. Something that lets your agents talk to you without you logging into a dashboard.

Total setup time for someone comfortable with Linux: a weekend. For everyone else: that’s literally what we do at Opteia — help you build this.

Next Tuesday — When AI Makes You Worse: The wrong way to adopt AI. Because having a great stack means nothing if you use it badly.