<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Ai on Home</title><link>/tags/ai/</link><description>Recent content in Ai on Home</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sat, 02 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="/tags/ai/" rel="self" type="application/rss+xml"/><item><title>Hermes Agent: A Personal AI That Gets More Useful Over Time</title><link>/2026/hermes-agent-a-personal-ai-that-gets-more-useful-over-time/</link><pubDate>Sat, 02 May 2026 00:00:00 +0000</pubDate><guid>/2026/hermes-agent-a-personal-ai-that-gets-more-useful-over-time/</guid><description>&lt;figure&gt;&lt;img src="/images/posts/post_28/overview.png"data-src="/images/posts/post_28/overview.png"
/&gt;&lt;figcaption&gt;
&lt;h4&gt;How Hermes Agent Works: From Closed-Loop Learning to Multi-Platform Deployment - AI generated&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;I came across the &lt;a href="https://github.com/nousresearch/hermes-agent"&gt;&lt;em&gt;Hermes Agent project&lt;/em&gt;&lt;/a&gt; in early March 2026 and deployed it a couple of days later. A couple of weeks in I am still using it daily, and the use cases keep expanding rather than converging. Most tools settle into a narrow routine or fall off altogether. What keeps this one going is that the agent gets more useful the longer you run it. The project is young and moving fast, with new releases every few days. The initial setup requires patience: getting the configuration to a point where it actually saves time takes effort, and the frequent updates occasionally introduce breaking changes. That said, it is genuinely fun to use, and you learn a fair amount along the way.&lt;/p&gt;
&lt;p&gt;Hermes Agent is an open-source, self-hosted AI agent framework built by &lt;a href="https://nousresearch.com/"&gt;&lt;em&gt;Nous Research&lt;/em&gt;&lt;/a&gt;, an independent AI research lab based in Austin, Texas. Nous Research is best known for the Hermes model family, a series of open-weight models fine-tuned on Llama that are used widely in the open-source AI community. The agent framework shares the name but is a separate project. It is MIT-licensed, model-agnostic, and runs on your own infrastructure, either as a self-hosted Python service or as a containerized deployment.&lt;/p&gt;
&lt;h2 id="how-it-works"&gt;How It Works&lt;/h2&gt;
&lt;p&gt;The part that makes Hermes Agent different from most agent frameworks is the skill system. The agent ships with a set of preconfigured skills covering common tasks. Beyond that, you can ask it to create a skill from something it just did: it writes a structured Markdown document capturing the approach, what worked, and describes possible edge cases. The next time a similar task appears, the agent loads the relevant skill rather than starting from scratch. Skills can be triggered directly by asking Hermes to run one, or set on a schedule and executed automatically at defined intervals. Over time this turns completed work into a growing library of reusable operating knowledge. Version v0.12.0 added an Autonomous Curator to keep that library from growing unwieldy. It runs on a seven-day cycle by default, grades skills by usage, consolidates overlapping ones, and removes those that have stopped being useful. A short report is written after each run, so you can see what changed and why.&lt;/p&gt;
&lt;p&gt;Alongside the skill system, the agent maintains three layers of memory: a persistent store for completed tasks and notes, a full-text search index across prior sessions, and a user model that accumulates preferences over time, coding style, communication tone, timezone, tools. The idea is that the agent gets more useful the longer you run it, not just better at individual tasks in isolation.&lt;/p&gt;
&lt;h2 id="my-setup"&gt;My Setup&lt;/h2&gt;
&lt;p&gt;Hermes Agent runs in my &lt;a href="/2026/my-homelab-a-traefik-centered-self-hosting-setup/"&gt;homelab&lt;/a&gt; as a service on a dedicated Linux host. Keeping it on a separate machine gives me direct control over what the agent has access to. Incoming traffic is routed through Traefik. I access it through three entry points depending on where I am and what I am doing. The primary interface is the &lt;a href="https://matrix.org/"&gt;&lt;em&gt;Matrix&lt;/em&gt;&lt;/a&gt; chat protocol, which means I can reach the agent from any Matrix client on any device. I also connected it to a dedicated email inbox, so it can handle certain tasks asynchronously. For longer sessions at my desk I use &lt;em&gt;Open WebUI&lt;/em&gt;, which gives a more comfortable interface for extended conversations.&lt;/p&gt;
&lt;p&gt;The model configuration is versatile: the agent supports various AI services and model providers.&lt;/p&gt;
&lt;h2 id="what-i-gave-it-access-to"&gt;What I Gave It Access To&lt;/h2&gt;
&lt;p&gt;I gave the agent access to three local knowledge sources: my bookmarks, a structured knowledge base, and a local mirror of Red Hat&amp;rsquo;s product documentation.&lt;/p&gt;
&lt;p&gt;The first is my bookmarks folder. I have been saving links as Markdown files in Obsidian for several years. The agent can search and cross-reference that collection when doing research, which means it draws on context I actually care about rather than training data alone.&lt;/p&gt;
&lt;p&gt;The second is a knowledge base built on the &lt;a href="https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f"&gt;LLM Wiki&lt;/a&gt; principle described by Andrej Karpathy. The idea is to maintain a curated set of structured Markdown files that an AI agent helps write and update over time. Topics, entities, comparisons, each in its own file. The agent both contributes to this knowledge base and draws from it when working on research tasks.&lt;/p&gt;
&lt;p&gt;The third is a local mirror of Red Hat&amp;rsquo;s product documentation. A team mate built a tool called &lt;em&gt;rh-mastery&lt;/em&gt; that pulls documentation from &lt;em&gt;docs.redhat.com&lt;/em&gt;, converts it to Markdown, and stores it in a structured local directory. Pointed at that directory, Hermes can query accurate, version-tracked product documentation without touching the internet. For someone who spends a lot of time with Red Hat products, this closes a gap that is easy to overlook until you actually need it. More on rh-mastery in an upcomming post.&lt;/p&gt;
&lt;h2 id="practical-uses"&gt;Practical Uses&lt;/h2&gt;
&lt;p&gt;The combination of bookmarks, structured knowledge, Red Hat&amp;rsquo;s product documentation, and the skill system makes the agent genuinely useful for research. When I ask it to investigate a topic, it starts with what I have already collected: prior notes, bookmarks, and documentation. If that is not enough, and when asked, it reaches out to the web to fill the gaps. The result is something grounded in material I collected and curated myself, which makes the output in most cases very useful.&lt;/p&gt;
&lt;p&gt;One use I did not expect to find as useful: slide generation. I integrated &lt;em&gt;Marp&lt;/em&gt;, a Markdown-based presentation framework, into the workflow. When I need to put together a presentation and am staring at a blank file, I can ask the agent to draft an initial structure. Getting past that first empty screen is often the hardest part. Whether I keep most of what it produces is a different question, but having something to react to is worth more than nothing to start from.&lt;/p&gt;
&lt;h2 id="skills-and-subagents"&gt;Skills and Subagents&lt;/h2&gt;
&lt;p&gt;The agent can develop and add skills on its own as it works, but skills can also be added manually or loaded from the community hub at &lt;a href="https://agentskills.io"&gt;agentskills.io&lt;/a&gt;. More interesting to me is the subagent capability: the agent can delegate tasks to specialized subagents, each backed by a specific AI service or holding a particular context. This makes it possible to compose workflows where different parts of a task go to the most appropriate model.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Several weeks in is not a long track record, and the project is still moving fast enough that some things will break between releases. That said, the architecture is sound and the development pace is truly impressive. Whether I will keep running it long-term, I genuinely do not know. For now, it is pulling its weight. For anyone already running a homelab and looking for a self-hosted agent that gets more useful over time rather than staying flat, Hermes Agent is worth the setup time.&lt;/p&gt;
&lt;p&gt;Peter Steinberger, the creator of &lt;em&gt;OpenClaw&lt;/em&gt;, another widely-used AI agent framework, put it well in a recent &lt;a href="https://www.youtube.com/watch?v=7rzYDM6vMtI"&gt;TED talk&lt;/a&gt;: &amp;ldquo;The bottleneck is no longer typing. It&amp;rsquo;s thinking.&amp;rdquo; That observation fits. The agent handles the mechanical parts of research and structuring. The judgment about what matters and what to do with it still has to come from someone. For now, a human in the loop is still necessary.&lt;/p&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Hermes Agent on GitHub - &lt;a href="https://github.com/nousresearch/hermes-agent"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Hermes Agent Documentation - &lt;a href="https://hermes-agent.nousresearch.com/docs/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Nous Research - &lt;a href="https://nousresearch.com/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Matrix - &lt;a href="https://matrix.org/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;OpenRouter - &lt;a href="https://openrouter.ai/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Andrej Karpathy LLM Wiki concept - &lt;a href="https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Marp - &lt;a href="https://marp.app/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;agentskills.io - &lt;a href="https://agentskills.io/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Peter Steinberger TED talk - &lt;a href="https://www.youtube.com/watch?v=7rzYDM6vMtI"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Extending the Local AI Stack with On-Demand GPU Inference on RunPod</title><link>/2026/extending-the-local-ai-stack-with-on-demand-gpu-inference-on-runpod/</link><pubDate>Sat, 07 Mar 2026 00:00:00 +0000</pubDate><guid>/2026/extending-the-local-ai-stack-with-on-demand-gpu-inference-on-runpod/</guid><description>&lt;figure&gt;&lt;img src="/images/posts/post_24/overview.png"data-src="/images/posts/post_24/overview.png"
/&gt;&lt;figcaption&gt;
&lt;h4&gt;Conceptual illustration of the extended AI stack with elastic cloud GPU resources for running large language models on demand - AI generated&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;In this post, I want to describe how I extended the local AI stack I built in my homelab with on-demand GPU-backed model inference, without adding any GPU hardware to the lab itself.&lt;/p&gt;
&lt;p&gt;The two previous posts in this series provide the context for what follows. The &lt;a href="/2026/my-homelab-a-traefik-centered-self-hosting-setup/"&gt;homelab post&lt;/a&gt; covers the base infrastructure: thin clients, Docker Compose, Traefik, and internal DNS. The &lt;a href="/2026/my-local-ai-stack-open-webui-litellm-searxng-and-docling/"&gt;local AI stack post&lt;/a&gt; describes how &lt;em&gt;Open WebUI&lt;/em&gt;, &lt;em&gt;LiteLLM&lt;/em&gt;, &lt;em&gt;SearXNG&lt;/em&gt;, and &lt;em&gt;Docling&lt;/em&gt; sit on top of that infrastructure to form a self-hosted AI environment. That stack works well, and I have been using it for a while. Keeping the lab CPU-only is a deliberate choice. For orchestration, document workflows, and routing requests to publicly available AI services, dedicated GPU hardware at home is simply not necessary. When I want to try a particular model that is not available through a managed API, or experiment with something freshly released on Hugging Face, I rent the compute on demand rather than maintain it permanently.&lt;/p&gt;
&lt;p&gt;The solution is straightforward: rent GPU capacity on demand from a specialized cloud provider, expose it as an OpenAI-compatible endpoint, and wire it into the existing stack. No new hardware, no permanent cost, no changes to the tools I already use.&lt;/p&gt;
&lt;h2 id="a-note-on-neo-clouds"&gt;A Note on Neo Clouds&lt;/h2&gt;
&lt;p&gt;The providers that specialize in this type of GPU-first infrastructure are sometimes called &lt;em&gt;Neo Clouds&lt;/em&gt;. The term emerged around 2024 to distinguish GPU-specialist vendors such as RunPod, CoreWeave and others from traditional hyperscalers. In practice, I am not sure the new term adds much. For me these are specialized cloud providers focused on GPU compute and AI workloads. Useful services, somewhat unnecessary branding.&lt;/p&gt;
&lt;h2 id="why-runpod"&gt;Why RunPod&lt;/h2&gt;
&lt;p&gt;I use &lt;a href="https://www.runpod.io/"&gt;RunPod&lt;/a&gt; for this setup for a few practical reasons. The interface is intuitive, the deployment path from template to running pod is short, and the GPU catalog is broad enough to cover most use cases. Pricing is per second with no ingress or egress fees, which makes on-demand experimentation economical. RunPod also exposes an API for its core operations, so deployments can be automated rather than driven entirely through the UI.&lt;/p&gt;
&lt;p&gt;A detailed description of all RunPod services is out of scope for this post. The focus here is on one specific workflow: deploying a &lt;em&gt;vLLM&lt;/em&gt; inference server with a model loaded from &lt;em&gt;Hugging Face&lt;/em&gt;, and connecting the resulting endpoint to Open WebUI.&lt;/p&gt;
&lt;h2 id="deploying-a-vllm-inference-server-on-runpod"&gt;Deploying a vLLM Inference Server on RunPod&lt;/h2&gt;
&lt;p&gt;RunPod uses templates to save pod configurations for reuse. A template defines the container image, the start command, the storage allocation, and other runtime parameters. I maintain a small collection of private templates, each configured for a different model.&lt;/p&gt;
&lt;figure&gt;&lt;img src="/images/posts/post_24/list_of_private_templates.png"data-src="/images/posts/post_24/list_of_private_templates.png"
/&gt;&lt;figcaption&gt;
&lt;h4&gt;A selection of saved vLLM templates on RunPod, each using to a different model from Hugging Face&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;The container image for all of these templates is &lt;code&gt;vllm/vllm-openai:latest&lt;/code&gt;, which bundles &lt;em&gt;vLLM&lt;/em&gt; with an OpenAI-compatible API server. The model itself is specified in the container start command, which means swapping models is a matter of editing a single line.&lt;/p&gt;
&lt;h2 id="creating-a-template"&gt;Creating a Template&lt;/h2&gt;
&lt;p&gt;When creating or editing a template, the key fields are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Type:&lt;/strong&gt; Pod&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compute type:&lt;/strong&gt; Nvidia GPU&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Container image:&lt;/strong&gt; &lt;code&gt;vllm/vllm-openai:latest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Container start command:&lt;/strong&gt; the vLLM arguments, including the model reference&lt;/li&gt;
&lt;/ul&gt;
&lt;figure&gt;&lt;img src="/images/posts/post_24/vllm_start_cmd.png"data-src="/images/posts/post_24/vllm_start_cmd.png"
/&gt;&lt;figcaption&gt;
&lt;h4&gt;Template configuration for the vllm_gemma-3-12b template, showing the container image and start command&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Throughout the following steps, any value written in &lt;code&gt;&amp;lt;angle brackets&amp;gt;&lt;/code&gt; is a placeholder and must be replaced with your actual value before running the command.&lt;/p&gt;
&lt;p&gt;A start command for deploying the Red Hat&amp;rsquo;s validated &lt;code&gt;RedHatAI/Qwen3-8B-FP8-dynamic&lt;/code&gt; model looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--host 0.0.0.0 --port &lt;span class="m"&gt;8000&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --model RedHatAI/Qwen3-8B-FP8-dynamic &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --dtype bfloat16 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --enforce-eager &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --gpu-memory-utilization 0.95 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --api-key &amp;lt;api_key&amp;gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --max-model-len &lt;span class="m"&gt;8128&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The parameters worth noting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--model&lt;/code&gt;&lt;/strong&gt;: any model available on Hugging Face can be referenced here by its repository path&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--dtype bfloat16&lt;/code&gt;&lt;/strong&gt;: sets the compute dtype; &lt;code&gt;bfloat16&lt;/code&gt; is a good default for inference on NVIDIA hardware&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--enforce-eager&lt;/code&gt;&lt;/strong&gt;: disables CUDA graph capture, which reduces memory overhead at the cost of some throughput; useful when fitting larger models on a single GPU&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--gpu-memory-utilization 0.95&lt;/code&gt;&lt;/strong&gt;: allows vLLM to use up to 95% of available GPU memory for the KV cache&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--api-key&lt;/code&gt;&lt;/strong&gt;: sets a bearer token for the OpenAI-compatible endpoint; always set this when deploying a public endpoint&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--max-model-len&lt;/code&gt;&lt;/strong&gt;: caps the maximum sequence length; reducing this frees memory and allows larger models to fit on smaller GPUs&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="selecting-a-gpu-and-deploying"&gt;Selecting a GPU and Deploying&lt;/h2&gt;
&lt;p&gt;Once the template is configured, deploying it requires selecting a GPU and clicking deploy. RunPod shows available hardware with current pricing.&lt;/p&gt;
&lt;figure&gt;&lt;img src="/images/posts/post_24/gpu_selection.png"data-src="/images/posts/post_24/gpu_selection.png"
/&gt;&lt;figcaption&gt;
&lt;h4&gt;GPU selection on RunPod, ranging from RTX 3090 class cards to H200 and B200 datacenter accelerators&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;For most inference workloads with 8 to 12 billion parameter models, an RTX 4090 or L4 is a practical and cost-effective choice. Larger models with higher memory requirements will need 48 GB or 80 GB class cards. The per-hour pricing shown in the interface makes it easy to estimate cost for a session before committing.&lt;/p&gt;
&lt;p&gt;After deployment, RunPod assigns a public HTTPS endpoint to the pod. The vLLM server is reachable at that endpoint on port 8000, with the path structure matching the OpenAI API.&lt;/p&gt;
&lt;h2 id="connecting-the-endpoint-to-open-webui"&gt;Connecting the Endpoint to Open WebUI&lt;/h2&gt;
&lt;p&gt;With the pod running and the model loaded, the endpoint can be added to Open WebUI as an external connection. In Open WebUI, navigate to &lt;strong&gt;Admin Panel&lt;/strong&gt; then &lt;strong&gt;Settings&lt;/strong&gt; and add a new connection with the following values:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Connection type:&lt;/strong&gt; External&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;URL:&lt;/strong&gt; &lt;code&gt;https://&amp;lt;runpod_endpoint&amp;gt;/v1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Auth:&lt;/strong&gt; API key set in the vLLM start command&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provider type:&lt;/strong&gt; OpenAI&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;API type:&lt;/strong&gt; Chat Completions&lt;/li&gt;
&lt;/ul&gt;
&lt;figure&gt;&lt;img src="/images/posts/post_24/open_webui_configuration.png"data-src="/images/posts/post_24/open_webui_configuration.png"
/&gt;&lt;figcaption&gt;
&lt;h4&gt;Adding the RunPod vLLM endpoint as an external OpenAI-compatible connection in Open WebUI&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Once saved, the model served by vLLM on RunPod appears in the model selector alongside any other configured backends. From a user perspective, the interface is identical to any other configured model, whether local or a commercial API.&lt;/p&gt;
&lt;p&gt;Alternatively, the endpoint can be added to LiteLLM as a named model alias. This is the better option if you want centralized credential management or want to expose the RunPod model alongside other backends under a consistent naming scheme across the stack.&lt;/p&gt;
&lt;h2 id="why-this-setup-works-well"&gt;Why This Setup Works Well&lt;/h2&gt;
&lt;p&gt;The combination of a self-hosted orchestration stack and on-demand GPU inference fits well with a homelab where tooling and workflows are in place but on-premises compute is intentionally kept lean.&lt;/p&gt;
&lt;p&gt;A few things make this pattern practical:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Low cost for experimentation.&lt;/strong&gt; Models run only when needed. A session of an hour or two to test a new model costs a few dollars at most.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access to current models.&lt;/strong&gt; Many of the recently published models available on Hugging Face can be loaded into vLLM, which means it is straightforward to test recently released models without waiting for them to appear in a managed API.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No changes to the existing stack.&lt;/strong&gt; Open WebUI, LiteLLM, SearXNG, and Docling continue to work exactly as before. The RunPod endpoint is just another backend.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automatable.&lt;/strong&gt; RunPod exposes an API for managing pods, so deployments can be triggered programmatically. Combined with LiteLLM&amp;rsquo;s routing, it becomes possible to bring a model endpoint up on demand and tear it down again when it is no longer needed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Adding RunPod as an on-demand GPU backend closes the main gap in a CPU-only homelab AI stack. The setup requires no changes to the existing infrastructure and takes only a few minutes from template to running endpoint. The result is the ability to experiment with current, capable models at low cost, using the same interface and workflows already in place.&lt;/p&gt;
&lt;p&gt;For on-demand model access that does not warrant the cost of persistent GPU hardware, this pattern is worth considering.&lt;/p&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;My Homelab: A Traefik-centered Self-hosting Setup - &lt;a href="/2026/my-homelab-a-traefik-centered-self-hosting-setup/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;My Local AI Stack: Open WebUI, LiteLLM, SearXNG, and Docling - &lt;a href="/2026/my-local-ai-stack-open-webui-litellm-searxng-and-docling/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;RunPod - project site - &lt;a href="https://www.runpod.io/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;RunPod - documentation - &lt;a href="https://docs.runpod.io/overview"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;vLLM - project site - &lt;a href="https://docs.vllm.ai/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Hugging Face - model hub - &lt;a href="https://huggingface.co/models"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;RedHatAI models on Hugging Face - &lt;a href="https://huggingface.co/RedHatAI"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>My Local AI Stack: Open WebUI, LiteLLM, SearXNG, and Docling</title><link>/2026/my-local-ai-stack-open-webui-litellm-searxng-and-docling/</link><pubDate>Sat, 14 Feb 2026 00:00:00 +0000</pubDate><guid>/2026/my-local-ai-stack-open-webui-litellm-searxng-and-docling/</guid><description>&lt;figure&gt;&lt;img src="/images/posts/post_19/overview.png"data-src="/images/posts/post_19/overview.png"
/&gt;&lt;figcaption&gt;
&lt;h4&gt;Overview of the modular self-hosted AI stack - AI generated&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;In my previous post about my &lt;a href="/2026/my-homelab-a-traefik-centered-self-hosting-setup/"&gt;homelab&lt;/a&gt;, I described the foundation I use for self-hosted services: a small set of low-power machines, Docker Compose for deployment, Traefik as the reverse proxy, and internal DNS to expose services with clean HTTPS hostnames. I have been running this setup for several years with very little maintenance overhead. That setup turned out to be a good base not only for classic self-hosting, but also for local AI workloads. Over the past two year or so, I started extending it with tools to use and experiment with AI services.&lt;/p&gt;
&lt;p&gt;Over time, I wanted more than a single chat UI connected to a single model provider. I wanted a setup that would let me experiment with different models, keep sensitive data inside my own network, enrich prompts with live web results, and work with local documents in a structured way. I also wanted to reuse the same operational patterns I already trusted in the rest of the homelab.&lt;/p&gt;
&lt;p&gt;The result is a local AI stack built from four components:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Open WebUI as the browser-based user interface&lt;/li&gt;
&lt;li&gt;LiteLLM as the OpenAI-compatible model gateway&lt;/li&gt;
&lt;li&gt;SearXNG as the privacy-friendly web search backend&lt;/li&gt;
&lt;li&gt;Docling as the document parsing layer for file-based workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Individually, each of these tools is useful. Combined, they form a practical self-hosted AI environment that fits neatly into the same Traefik-centered architecture as the rest of my homelab.&lt;/p&gt;
&lt;h2 id="base-platform-and-prerequisites"&gt;Base platform and prerequisites&lt;/h2&gt;
&lt;p&gt;The AI stack runs on the same infrastructure described in the &lt;a href="/2026/my-homelab-a-traefik-centered-self-hosting-setup/"&gt;previous post&lt;/a&gt;: refurbished thin clients running CentOS Stream 9, Docker and Docker Compose, Traefik as the reverse proxy, and internal DNS for clean HTTPS hostnames. The key design principle carries over as well: every externally reachable service joins the &lt;code&gt;external&lt;/code&gt; Docker network and is exposed through Traefik using labels, giving a consistent way to publish services under HTTPS without managing ports or certificates per application.&lt;/p&gt;
&lt;p&gt;My current setup is CPU-only. That matters. It is perfectly usable for orchestration, document processing, and web-augmented prompting, but it is not the right environment for large, latency-sensitive inference workloads. In practice, that constraint pushed me toward an architecture where the user interface, routing, tools, and document workflows run locally, while the model backend remains flexible enough to use either local or remote providers.&lt;/p&gt;
&lt;h2 id="architecture-overview"&gt;Architecture overview&lt;/h2&gt;
&lt;p&gt;At a high level, the request flow looks like this:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A user opens Open WebUI in the browser.&lt;/li&gt;
&lt;li&gt;Open WebUI sends model requests to LiteLLM through its OpenAI-compatible API.&lt;/li&gt;
&lt;li&gt;LiteLLM routes the request to the selected backend model.&lt;/li&gt;
&lt;li&gt;If a prompt requires live information, Open WebUI can use SearXNG as a search tool.&lt;/li&gt;
&lt;li&gt;If a prompt requires document context, uploaded files are parsed with Docling and converted into Markdown.&lt;/li&gt;
&lt;li&gt;The model response is returned to Open WebUI and displayed to the user.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This separation of concerns is what makes the stack useful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Open WebUI handles the human interaction layer&lt;/li&gt;
&lt;li&gt;LiteLLM abstracts model backends and credentials&lt;/li&gt;
&lt;li&gt;SearXNG provides fresh web context&lt;/li&gt;
&lt;li&gt;Docling turns messy source documents into structured text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Traefik remains the single public entry point. From an operations perspective, that is valuable because the AI stack behaves like any other part of the homelab.&lt;/p&gt;
&lt;h2 id="open-webui-as-the-central-interface"&gt;Open WebUI as the central interface&lt;/h2&gt;
&lt;p&gt;Open WebUI is the part of the stack I interact with every day. It provides the browser-based interface for conversations, model selection, file uploads, and tool-assisted prompting. The important point is that Open WebUI does not need to know anything about individual model providers. It only needs a single OpenAI-compatible endpoint, which in this setup is LiteLLM.&lt;/p&gt;
&lt;p&gt;That keeps the client configuration simple. If I want to add a new provider, swap one model for another, or change credentials, I do it behind the scenes in LiteLLM without having to reconfigure the user interface. Open WebUI also supports user and group management, making it straightforward to grant access to specific models or restrict certain users to a defined set of backends. A particularly useful feature is the ability to send a single prompt to multiple AI services simultaneously, which makes side-by-side model comparison a natural part of the workflow.&lt;/p&gt;
&lt;p&gt;A simplified Docker Compose service definition for Open WebUI in this setup looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;services&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;open-webui&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;ghcr.io/open-webui/open-webui:main&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;open-webui&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;restart&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;unless-stopped&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;OPENAI_API_BASE_URL=http://litellm:4000/v1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;OPENAI_API_KEY=${LITELLM_MASTER_KEY}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;volumes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;./data/open-webui:/app/backend/data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;networks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;external&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;internal&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.enable=true&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.docker.network=external&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.openwebui.rule=Host(`ai.home.example.com`)&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.openwebui.entrypoints=https&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.openwebui.tls.certresolver=cloudflare&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.services.openwebui.loadbalancer.server.port=8080&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The exact image tag and environment variables may differ depending on the release and your setup, but the pattern stays the same: persistent storage for state, Traefik labels for routing, and a backend API endpoint that points to LiteLLM.&lt;/p&gt;
&lt;h2 id="litellm-as-the-model-gateway"&gt;LiteLLM as the model gateway&lt;/h2&gt;
&lt;p&gt;LiteLLM is the glue that makes the rest of the system flexible. It exposes a single OpenAI-style API while allowing multiple backends underneath. That means I can define logical model names and map them to either local inference backends or remote providers.&lt;/p&gt;
&lt;p&gt;This is useful for several reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Open WebUI only has to speak to few API endpoints&lt;/li&gt;
&lt;li&gt;I can standardize naming across models&lt;/li&gt;
&lt;li&gt;Provider credentials stay centralized&lt;/li&gt;
&lt;li&gt;Swapping backends becomes operationally cheap&lt;/li&gt;
&lt;li&gt;Logging and usage controls are easier to centralize&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Compose service definition for LiteLLM follows the same pattern:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;services&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;litellm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;ghcr.io/berriai/litellm:main-latest&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;litellm&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;restart&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;unless-stopped&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;--config&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;/app/config.yaml&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;--port&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;4000&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;OPENAI_API_KEY=${OPENAI_API_KEY}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;volumes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;./litellm/config.yaml:/app/config.yaml:ro&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;networks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;internal&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;external&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.enable=true&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.docker.network=external&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.litellm.rule=Host(`litellm.home.example.com`)&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.litellm.entrypoints=https&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.litellm.tls.certresolver=cloudflare&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.services.litellm.loadbalancer.server.port=4000&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;style type="text/css"&gt;.notice{--root-color:#444;--root-background:#eff;--title-color:#fff;--title-background:#7bd;--warning-title:#c33;--warning-content:#fee;--info-title:#fb7;--info-content:#fec;--note-title:#6be;--note-content:#e7f2fa;--tip-title:#5a5;--tip-content:#efe}@media (prefers-color-scheme:dark){.notice{--root-color:#ddd;--root-background:#eff;--title-color:#fff;--title-background:#7bd;--warning-title:#800;--warning-content:#400;--info-title:#a50;--info-content:#420;--note-title:#069;--note-content:#023;--tip-title:#363;--tip-content:#121}}body.dark .notice{--root-color:#ddd;--root-background:#eff;--title-color:#fff;--title-background:#7bd;--warning-title:#800;--warning-content:#400;--info-title:#a50;--info-content:#420;--note-title:#069;--note-content:#023;--tip-title:#363;--tip-content:#121}.notice{line-height:24px;margin-bottom:24px;border-radius:4px;color:var(--root-color);background:var(--root-background)}.notice p:last-child{margin-bottom:0; padding: .5rem 1.2rem 1rem;}.notice-title{margin:-18px -18px 12px;padding:4px 18px;border-radius:4px 4px 0 0;font-weight:700;color:var(--title-color);background:var(--title-background)}.notice.warning .notice-title{background:var(--warning-title)}.notice.warning{background:var(--warning-content)}.notice.info .notice-title{background:var(--info-title)}.notice.info{background:var(--info-content)}.notice.note .notice-title{background:var(--note-title)}.notice.note{background:var(--note-content)}.notice.tip .notice-title{background:var(--tip-title)}.notice.tip{background:var(--tip-content)}.icon-notice{display:inline-flex;align-self:center;margin-right:8px}.icon-notice img,.icon-notice svg{height:1em;width:1em;fill:currentColor}.icon-notice img,.icon-notice.baseline svg{top:.125em;position:relative}&lt;/style&gt;
&lt;div&gt;&lt;svg width="0" height="0" display="none" xmlns="http://www.w3.org/2000/svg"&gt;&lt;symbol id="tip-notice" viewBox="0 0 512 512" preserveAspectRatio="xMidYMid meet"&gt;&lt;path d="M504 256c0 136.967-111.033 248-248 248S8 392.967 8 256 119.033 8 256 8s248 111.033 248 248zM227.314 387.314l184-184c6.248-6.248 6.248-16.379 0-22.627l-22.627-22.627c-6.248-6.249-16.379-6.249-22.628 0L216 308.118l-70.059-70.059c-6.248-6.248-16.379-6.248-22.628 0l-22.627 22.627c-6.248 6.248-6.248 16.379 0 22.627l104 104c6.249 6.249 16.379 6.249 22.628.001z"/&gt;&lt;/symbol&gt;&lt;symbol id="note-notice" viewBox="0 0 512 512" preserveAspectRatio="xMidYMid meet"&gt;&lt;path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zm-248 50c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"/&gt;&lt;/symbol&gt;&lt;symbol id="warning-notice" viewBox="0 0 576 512" preserveAspectRatio="xMidYMid meet"&gt;&lt;path d="M569.517 440.013C587.975 472.007 564.806 512 527.94 512H48.054c-36.937 0-59.999-40.055-41.577-71.987L246.423 23.985c18.467-32.009 64.72-31.951 83.154 0l239.94 416.028zM288 354c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"/&gt;&lt;/symbol&gt;&lt;symbol id="info-notice" viewBox="0 0 512 512" preserveAspectRatio="xMidYMid meet"&gt;&lt;path d="M256 8C119.043 8 8 119.083 8 256c0 136.997 111.043 248 248 248s248-111.003 248-248C504 119.083 392.957 8 256 8zm0 110c23.196 0 42 18.804 42 42s-18.804 42-42 42-42-18.804-42-42 18.804-42 42-42zm56 254c0 6.627-5.373 12-12 12h-88c-6.627 0-12-5.373-12-12v-24c0-6.627 5.373-12 12-12h12v-64h-12c-6.627 0-12-5.373-12-12v-24c0-6.627 5.373-12 12-12h64c6.627 0 12 5.373 12 12v100h12c6.627 0 12 5.373 12 12v24z"/&gt;&lt;/symbol&gt;&lt;/svg&gt;&lt;/div&gt;&lt;div class="notice warning" &gt;
&lt;p class="first notice-title"&gt;&lt;span class="icon-notice baseline"&gt;&lt;svg&gt;&lt;use href="#warning-notice"&gt;&lt;/use&gt;&lt;/svg&gt;&lt;/span&gt;Warning&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Security note:&lt;/strong&gt;&lt;br&gt;
In March 2026, LiteLLM was subject to a suspected supply chain attack in which versions v1.82.7 and v1.82.8 on PyPI contained a malicious payload designed to harvest credentials and exfiltrate them to an external domain. Users running the official LiteLLM Docker image were not affected, as that deployment path pins dependencies and does not rely on the compromised PyPI packages. If you installed LiteLLM via &lt;code&gt;pip&lt;/code&gt; during the affected window, treat any secrets on that system as compromised and rotate them immediately. See the official incident report for full details and verified safe versions.&lt;/p&gt;&lt;/div&gt;
&lt;h2 id="searxng-for-live-privacy-friendly-search"&gt;SearXNG for live, privacy-friendly search&lt;/h2&gt;
&lt;p&gt;One of the biggest limitations of a plain chat interface is the lack of current information. SearXNG solves that problem cleanly. It is a self-hosted metasearch engine that aggregates results from multiple sources and gives me a search API under my own control.&lt;/p&gt;
&lt;p&gt;Even outside the AI stack, SearXNG is useful as a search engine. Inside the stack, it becomes more interesting because it can be exposed as a tool for prompts that need fresh information.&lt;/p&gt;
&lt;p&gt;A minimal Compose service might look like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;services&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;searxng&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;docker.io/searxng/searxng:latest&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;searxng&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;restart&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;unless-stopped&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;volumes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;./searxng:/etc/searxng&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;networks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;external&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.enable=true&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.docker.network=external&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.searxng.rule=Host(`search.home.example.com`)&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.searxng.entrypoints=https&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.searxng.tls.certresolver=cloudflare&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.services.searxng.loadbalancer.server.port=8080&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Once connected to Open WebUI as a tool, the flow is straightforward:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The user asks a question that requires current information.&lt;/li&gt;
&lt;li&gt;The model decides to call the search tool.&lt;/li&gt;
&lt;li&gt;SearXNG performs the search.&lt;/li&gt;
&lt;li&gt;Titles, snippets, and URLs are returned as context.&lt;/li&gt;
&lt;li&gt;The model synthesizes an answer grounded in current results.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="docling-for-document-parsing"&gt;Docling for document parsing&lt;/h2&gt;
&lt;p&gt;The fourth component, Docling, addresses a different problem. Large language models work best with clean text, but many real documents are messy. PDFs, slide decks, and office files often contain broken text flows, layout artifacts, or table structures that are not useful when passed to a model as-is.&lt;/p&gt;
&lt;p&gt;Docling converts these documents into a Markdown representation that is much easier to use as model context. That sounds small, but it is a major quality improvement for local document workflows.&lt;/p&gt;
&lt;p&gt;The Docling service definition is straightforward:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;services&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;docling&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;quay.io/docling-project/docling-serve:latest&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;docling&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;restart&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;unless-stopped&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;networks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;internal&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;external&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.enable=true&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.docker.network=external&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.docling.rule=Host(`docling.home.example.com`)&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.docling.entrypoints=https&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.routers.docling.tls.certresolver=cloudflare&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="s2"&gt;&amp;#34;traefik.http.services.docling.loadbalancer.server.port=5001&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The typical usage pattern is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Upload a document in Open WebUI.&lt;/li&gt;
&lt;li&gt;Docling parses the file and converts it to Markdown.&lt;/li&gt;
&lt;li&gt;Feed that Markdown into the model as structured prompt context.&lt;/li&gt;
&lt;li&gt;Ask targeted questions against the extracted content.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is especially useful for technical notes, whitepapers, internal PDFs, or vendor documentation where the raw file format is not suitable for direct prompting.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This stack did not start as an attempt to build a local alternative to a commercial AI product. It emerged naturally from an existing homelab that already had strong building blocks: containerized services, Traefik, DNS-based routing, and a bias toward self-hosting.&lt;/p&gt;
&lt;p&gt;Adding Open WebUI, LiteLLM, SearXNG, and Docling turned that base into a practical local AI environment. It gives me a single interface for model interaction, the ability to swap backends without changing clients, a way to enrich prompts with live web data, and a better workflow for document-driven tasks.&lt;/p&gt;
&lt;p&gt;Just as important, it stays operationally consistent with the rest of the homelab. That keeps the setup understandable, maintainable, and worth using day to day.&lt;/p&gt;
&lt;p&gt;Future extensions are obvious: adding a vector database, introducing GPU-backed local inference, routing requests to model endpoints running on specialized inference platforms, or using Open WebUI as a gateway to interact with AI agents. But even without those additions, this combination already covers a large share of the AI workflows I actually care about.&lt;/p&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;My Homelab: A Traefik-centered Self-hosting Setup - &lt;a href="/2026/my-homelab-a-traefik-centered-self-hosting-setup/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Open WebUI - project site - &lt;a href="https://openwebui.com/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Open WebUI - GitHub - &lt;a href="https://github.com/open-webui/open-webui"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;LiteLLM - project site - &lt;a href="https://www.litellm.ai/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;LiteLLM - GitHub - &lt;a href="https://github.com/BerriAI/litellm"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;LiteLLM - Security incident report, March 2026 - &lt;a href="https://docs.litellm.ai/blog/security-update-march-2026"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;SearXNG - documentation - &lt;a href="https://docs.searxng.org/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;SearXNG - GitHub - &lt;a href="https://github.com/searxng/searxng"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Docling - documentation - &lt;a href="https://docling-project.github.io/docling/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Docling - GitHub - &lt;a href="https://github.com/docling-project/docling"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>