Terug naar Archief

My Ultimate Self-Hosted AI Chat Stack

10 min lezen

Introduction

I use ChatGPT, Copilot and Claude interchangeably depending on my mood, topic, or data sensitivity. But these services run on someone else's infrastructure, are trained on my data, and are impossible to run offline. The moment you start using AI for anything sensitive — internal docs, company data, personal projects — you get pushed towards a single vendor.

I wanted something different: a single stack that gives me control over which LLM I use, without separate subscription fees, running on my own hardware. And importantly: if it runs on my machine, it should work on yours too.

After a few iterations, I have that stack. I called it CustomAIChat (naming things is hard) and this post walks through every component, why it's there, and how to get it running yourself.

infographic

The stack at a glance

Most self-hosted AI setups solve one problem well and leave integration to the user. CustomAIChat is a Docker Compose project split into three tiers so you can start lean and expand as your hardware allows:

TierCompose fileWhat it adds
coredocker-compose.ymlChat UI, LLM proxy, observability, web search, databases
gpudocker-compose.gpu.ymlLocal LLM inference, GPU-accelerated speech-to-text, image generation
extrasdocker-compose.extras.ymlDocument research (Open Notebook), HTTPS reverse proxy (Caddy)

The core tier runs on any machine with no GPU required. Add the GPU tier when you want local models and voice. Add extras when you're ready to put it on a real domain with TLS.

Core tier

Open WebUI — the frontend

Open WebUI is the chat interface. It looks and feels like ChatGPT but connects entirely to your own backends. You get conversation history with folders and search, per-message web search, image generation in chat, voice input, file uploads with RAG, and full user management with admin roles.

One thing to be aware of: the first user to register becomes the admin. Plan accordingly.

LiteLLM — the model gateway

LiteLLM sits between Open WebUI and every AI provider. OpenAI, Ollama local models, Azure AI Foundry, Anthropic, and 100+ more — all behind a single endpoint. Open WebUI only talks to LiteLLM; switching or adding models is a config change, not a code change.

It also handles cost-based routing (cheap requests go to cheap models automatically), Redis caching for repeated responses, and sends every call to Langfuse for tracing. Honestly, the feature set is way more than this stack needs, but it's fun to explore.

The current config covers Azure OpenAI GPT models and GPT Image via Azure AI Foundry. Local Ollama and direct OpenAI are one uncomment away.

# config/litellm/config.yaml (excerpt)
router_settings:
  routing_strategy: cost-based-routing
  num_retries: 2
  timeout: 120
  redis_host: redis
  redis_port: 6379

model_list:
  - model_name: gpt-5.4-mini
    litellm_params:
      model: azure/gpt-5.4-mini
      api_base: ${AZURE_OPENAI_ENDPOINT}
      api_key: ${AZURE_OPENAI_API_KEY}
      api_version: "2025-01-01-preview"

  - model_name: azure/gpt-image-1.5
    litellm_params:
      model: azure/gpt-image-1.5
      api_base: os.environ/AZURE_API_BASE
      api_key: os.environ/AZURE_API_KEY
      api_version: os.environ/AZURE_API_VERSION
    model_info:
      mode: image_generation

SearXNG is a self-hosted meta-search engine that queries Bing, Google, DuckDuckGo and others simultaneously without exposing your identity to any of them.

In Open WebUI, clicking the globe icon on any message triggers a SearXNG search and injects results into the prompt context. The model gets current information; the search engines get an anonymous request. No API keys, no tracking, no per-search billing. This is how you give your LLM web access without giving away your data.

GPU tier

Whisper — speech to text

The GPU tier adds a dedicated Whisper ASR service for processing microphone input from Open WebUI. GPU acceleration makes transcription near real-time even on the large-v3 model.

# .env — choose your accuracy/speed tradeoff
WHISPER_ASR_MODEL=large-v3   # best accuracy
# WHISPER_ASR_MODEL=medium   # faster
# WHISPER_ASR_MODEL=base     # fastest

The core tier works without it — Open WebUI has a built-in CPU Whisper fallback — but the dedicated service is noticeably faster.

ComfyUI — local image generation

ComfyUI handles local Stable Diffusion inference. Drop any .safetensors checkpoint into data/comfyui/models/ and it's immediately available. Supports SDXL, SD 1.5, FLUX, and anything else you throw at it.

For cloud image generation, LiteLLM routes to Azure GPT Image 1.5 or Azure AI Foundry FLUX.2-pro. Pick your model in the Open WebUI settings.

Langfuse — observability

Langfuse receives a trace for every LLM call that passes through LiteLLM. The dashboard gives you input/output text, latency per model, token counts and cost per call, per-user breakdowns, and error rates with retry patterns.

This is invaluable when something behaves unexpectedly. You can replay the exact call, see the full prompt, and compare how different models respond. The stack includes ClickHouse as an analytics backend so trace queries stay fast even with thousands of entries.

Both Langfuse and ClickHouse are optional — but once you see what they reveal, you'll keep them running.

Open Notebook — document research

Open Notebook is a self-hosted alternative to Google NotebookLM. Upload PDFs, web pages, or text files and have the LLM answer questions across them. It connects to LiteLLM, so it uses the same model pool as your chat.

This is where the stack really shines for work: meeting transcripts, architecture docs, long reports — all without sending a single byte to Google.

Extras tier

Caddy — HTTPS reverse proxy

Caddy 2 proxies every service behind a single domain and handles TLS automatically via Let's Encrypt. Going from localhost to a public domain is one variable:

# .env
CADDY_DOMAIN=ai.example.com

Caddy reads this, configures HTTPS with a valid certificate, and handles renewals automatically.

The databases

The stack uses four data stores, each picked for a specific reason:

DatabaseUsed byPurpose
PostgreSQL 16LiteLLM, LangfusePrimary data store
Redis 7LiteLLMResponse caching, rate limiting
ClickHouse 24LangfuseHigh-volume analytics traces
SeaweedFSLangfuseS3-compatible object storage for media and events

All data lands in ./data/ on the host. Everything survives container restarts and updates.

Getting started

Prerequisites

  • Docker ≥ 24.0 and Docker Compose ≥ 2.20
  • NVIDIA GPU + NVIDIA Container Toolkit (GPU tier only)
  • 16 GB RAM minimum for core; 32 GB recommended with GPU tier

1. Clone and configure

git clone https://github.com/jdgoeij/CustomAIChat.git
cd CustomAIChat
cp .env.example .env

Open .env and fill in your secrets. Every variable has an inline comment. At minimum you need POSTGRES_PASSWORD and REDIS_PASSWORD (strong random strings), a LITELLM_MASTER_KEY (the API key all clients use), Langfuse auth secrets (LANGFUSE_SECRET_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_SALT), and your Azure OpenAI or OpenAI credentials if you want cloud models from day one.

2. Start the stack

Core only (no GPU required)

.\scripts\start.ps1 up core

Once all containers are healthy, you'll see:

✅ Open WebUI       → http://localhost:3000
✅ Langfuse         → http://localhost:3001
✅ Open Notebook    → http://localhost:3002
✅ LiteLLM API      → http://localhost:4000
✅ SearXNG          → http://localhost:8080

3. First-run checklist

  • Open WebUI at :3000 — register your admin account (first registration wins)
  • Langfuse at :3001 — create your organisation and generate an API key pair
  • Paste Langfuse keys back into .env and restart: .\scripts\start.ps1 up core
  • Pull a local model (GPU tier): docker exec ollama ollama pull llama3.2
  • Drop Stable Diffusion checkpoints into data/comfyui/models/checkpoints/

Configuring Open WebUI

Once everything is running, Open WebUI needs to know about SearXNG and your image generation backend. Neither works out of the box — but both are quick to set up.

Web search with SearXNG

Open WebUI talks to SearXNG over HTTP and expects JSON responses. The Docker Compose stack already handles networking between the containers, but SearXNG ships with JSON output disabled by default. Without it, Open WebUI gets HTML back and throws a 403 Forbidden error.

First, make sure SearXNG has started at least once so it generates its config files. Then edit data/searxng/settings.yml and add json to the formats list:

# data/searxng/settings.yml
search:
  formats:
    - html
    - json    # required for Open WebUI

Restart SearXNG after this change. Then in Open WebUI, go to Admin Panel → Settings → Web Search and configure:

  • Web Search Engine: SearXNG
  • SearXNG Query URL: http://searxng:8080/search?q=<query>

That's it. The globe icon in chat now triggers a private web search. Toggle it per message — it's not on by default.

Image generation

Open WebUI supports multiple image generation backends. Which one you configure depends on whether you're running the GPU tier (ComfyUI for local generation) or using cloud models through LiteLLM.

Option A: Cloud image generation via LiteLLM

If you have Azure GPT Image 1.5 or another OpenAI-compatible image model configured in LiteLLM, point Open WebUI to LiteLLM's API:

  1. Go to Admin Panel → Settings → Images
  2. Toggle Image Generation on
  3. Set Image Generation Engine to OpenAI
  4. Set API Base URL to http://litellm:4000/v1
  5. Set API Key to your LITELLM_MASTER_KEY
  6. Enter the model name exactly as it appears in your LiteLLM config (e.g. azure/gpt-image-1.5)
  7. Set Image Size to 1024x1024

For Azure specifically, make sure your LiteLLM config uses API version 2025-04-01-preview or later because older versions don't support the required parameters.

Option B: Local generation with ComfyUI

If you're running the GPU tier with ComfyUI:

  1. Go to Admin Panel → Settings → Images
  2. Toggle Image Generation on
  3. Set Image Generation Engine to ComfyUI
  4. Set ComfyUI Base URL to http://comfyui:8188
  5. Import your workflow JSON (exported from ComfyUI in API Format — not the standard save)

The API Format export is important: in ComfyUI, enable "Dev mode Options" in settings first, then use "Save (API Format)" from the menu. The standard JSON export won't work.

Drop your .safetensors checkpoints into data/comfyui/models/checkpoints/ and they appear immediately. No restart needed.

Keeping image models out of the chat selector

Once you've configured the image backend, you'll notice the image models show up in the main model selector alongside your chat models. That's not ideal and you don't want to accidentally start a conversation with an image-only model.

The trick is to hide the image models from the selector but still make them available for in-chat image generation. Here's how:

  1. Go to Workspace → Models and find your image model (e.g. azure/gpt-image-1.5)
  2. Disable or hide the model so it no longer appears in the model dropdown: disable-model
  3. Then edit each chat model you want to use for image generation — open its settings and enable the Image Generation capability configure-model

Now when you select a chat model like GPT-5.4-mini, an image generation button appears in the chat input. You stay in your conversation, click the button, type a prompt, and the image is generated using the backend you configured without ever leaving the chat or switching models. Text and images stay in one flow, just like you are used to in ChatGPT.

Image examples

Create an image: An ominous robot overlord in a futuristic control room, surrounded by glowing monitors, holographic interfaces, and banks of surveillance cameras, watching over a vast city through large windows. The scene is cinematic and dramatic, with a cold blue and red color palette, subtle fog, towering machinery, and a sense of technological surveillance and AI dominance. The robot is large, sleek, and intimidating, but clearly fictional and non-human. Highly detailed, realistic sci-fi concept art, moody lighting, wide composition.

Result: ai-overlord

Or something else:

Create an image: A vibrant technical welcome scene for OpenWebUI running in a personal Docker Compose stack, available to everyone. Futuristic neon color palette with glowing cyan, magenta, purple, and electric blue accents. Show a sleek containerized infrastructure: Docker Compose YAML panels, modular service blocks, network lines, server racks, and an AI chat interface labeled OpenWebUI at the center. The mood is happy, welcoming, modern, and community-friendly. Clean high-tech UI elements, holographic displays, subtle circuit patterns, depth, and soft neon bloom. Highly detailed, cinematic lighting, professional tech illustration, sharp lines, glossy surfaces, and a premium cyberpunk-but-accessible aesthetic.

Result 2:

openwebui

Common pitfalls

No models in Open WebUI? LiteLLM probably hasn't connected yet. Check docker logs litellm — a single bad API key will silently skip that model on startup.

SearXNG returning 403? The SEARXNG_SECRET_KEY must be set before first boot. If you changed it after, delete data/searxng/ and restart.

Langfuse not receiving traces? LiteLLM needs LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY. Restart LiteLLM after setting them and verify under Traces in the dashboard.

GPU services ignoring the GPU? Confirm the NVIDIA Container Toolkit works: docker run --gpus all nvidia/cuda:12.0-base nvidia-smi. If that fails, the toolkit isn't installed correctly.

Caddy certificate failures? Your domain must be publicly reachable on ports 80 and 443 for Let's Encrypt. Use localhost for local-only setups.

What's next

The stack is intentionally modular — start with core, get comfortable with the UI and model routing, then add GPU services when the hardware is ready.

Most of the interesting customisation lives in the LiteLLM config. The routing docs cover fallback chains (Azure → Ollama on quota errors), per-model rate limits, and budget enforcement per user.

And whenever you want to know exactly what a model said, what it cost, and how long it took then Langfuse already has the answer.

Gerelateerde Berichten