Ai on Home

Hermes Agent: A Personal AI That Gets More Useful Over Time

Sat, 02 May 2026 00:00:00 +0000

How Hermes Agent Works: From Closed-Loop Learning to Multi-Platform Deployment - AI generated

Introduction

I came across the Hermes Agent project in early March 2026 and deployed it a couple of days later. A couple of weeks in I am still using it daily, and the use cases keep expanding rather than converging. Most tools settle into a narrow routine or fall off altogether. What keeps this one going is that the agent gets more useful the longer you run it. The project is young and moving fast, with new releases every few days. The initial setup requires patience: getting the configuration to a point where it actually saves time takes effort, and the frequent updates occasionally introduce breaking changes. That said, it is genuinely fun to use, and you learn a fair amount along the way.

Hermes Agent is an open-source, self-hosted AI agent framework built by Nous Research, an independent AI research lab based in Austin, Texas. Nous Research is best known for the Hermes model family, a series of open-weight models fine-tuned on Llama that are used widely in the open-source AI community. The agent framework shares the name but is a separate project. It is MIT-licensed, model-agnostic, and runs on your own infrastructure, either as a self-hosted Python service or as a containerized deployment.

How It Works

The part that makes Hermes Agent different from most agent frameworks is the skill system. The agent ships with a set of preconfigured skills covering common tasks. Beyond that, you can ask it to create a skill from something it just did: it writes a structured Markdown document capturing the approach, what worked, and describes possible edge cases. The next time a similar task appears, the agent loads the relevant skill rather than starting from scratch. Skills can be triggered directly by asking Hermes to run one, or set on a schedule and executed automatically at defined intervals. Over time this turns completed work into a growing library of reusable operating knowledge. Version v0.12.0 added an Autonomous Curator to keep that library from growing unwieldy. It runs on a seven-day cycle by default, grades skills by usage, consolidates overlapping ones, and removes those that have stopped being useful. A short report is written after each run, so you can see what changed and why.

Alongside the skill system, the agent maintains three layers of memory: a persistent store for completed tasks and notes, a full-text search index across prior sessions, and a user model that accumulates preferences over time, coding style, communication tone, timezone, tools. The idea is that the agent gets more useful the longer you run it, not just better at individual tasks in isolation.

My Setup

Hermes Agent runs in my homelab as a service on a dedicated Linux host. Keeping it on a separate machine gives me direct control over what the agent has access to. Incoming traffic is routed through Traefik. I access it through three entry points depending on where I am and what I am doing. The primary interface is the Matrix chat protocol, which means I can reach the agent from any Matrix client on any device. I also connected it to a dedicated email inbox, so it can handle certain tasks asynchronously. For longer sessions at my desk I use Open WebUI, which gives a more comfortable interface for extended conversations.

The model configuration is versatile: the agent supports various AI services and model providers.

What I Gave It Access To

I gave the agent access to three local knowledge sources: my bookmarks, a structured knowledge base, and a local mirror of Red Hat’s product documentation.

The first is my bookmarks folder. I have been saving links as Markdown files in Obsidian for several years. The agent can search and cross-reference that collection when doing research, which means it draws on context I actually care about rather than training data alone.

The second is a knowledge base built on the LLM Wiki principle described by Andrej Karpathy. The idea is to maintain a curated set of structured Markdown files that an AI agent helps write and update over time. Topics, entities, comparisons, each in its own file. The agent both contributes to this knowledge base and draws from it when working on research tasks.

The third is a local mirror of Red Hat’s product documentation. A team mate built a tool called rh-mastery that pulls documentation from docs.redhat.com, converts it to Markdown, and stores it in a structured local directory. Pointed at that directory, Hermes can query accurate, version-tracked product documentation without touching the internet. For someone who spends a lot of time with Red Hat products, this closes a gap that is easy to overlook until you actually need it. More on rh-mastery in an upcomming post.

Practical Uses

The combination of bookmarks, structured knowledge, Red Hat’s product documentation, and the skill system makes the agent genuinely useful for research. When I ask it to investigate a topic, it starts with what I have already collected: prior notes, bookmarks, and documentation. If that is not enough, and when asked, it reaches out to the web to fill the gaps. The result is something grounded in material I collected and curated myself, which makes the output in most cases very useful.

One use I did not expect to find as useful: slide generation. I integrated Marp, a Markdown-based presentation framework, into the workflow. When I need to put together a presentation and am staring at a blank file, I can ask the agent to draft an initial structure. Getting past that first empty screen is often the hardest part. Whether I keep most of what it produces is a different question, but having something to react to is worth more than nothing to start from.

Skills and Subagents

The agent can develop and add skills on its own as it works, but skills can also be added manually or loaded from the community hub at agentskills.io. More interesting to me is the subagent capability: the agent can delegate tasks to specialized subagents, each backed by a specific AI service or holding a particular context. This makes it possible to compose workflows where different parts of a task go to the most appropriate model.

Conclusion

Several weeks in is not a long track record, and the project is still moving fast enough that some things will break between releases. That said, the architecture is sound and the development pace is truly impressive. Whether I will keep running it long-term, I genuinely do not know. For now, it is pulling its weight. For anyone already running a homelab and looking for a self-hosted agent that gets more useful over time rather than staying flat, Hermes Agent is worth the setup time.

Peter Steinberger, the creator of OpenClaw, another widely-used AI agent framework, put it well in a recent TED talk: “The bottleneck is no longer typing. It’s thinking.” That observation fits. The agent handles the mechanical parts of research and structuring. The judgment about what matters and what to do with it still has to come from someone. For now, a human in the loop is still necessary.

References

Hermes Agent on GitHub - link
Hermes Agent Documentation - link
Nous Research - link
Matrix - link
OpenRouter - link
Andrej Karpathy LLM Wiki concept - link
Marp - link
agentskills.io - link
Peter Steinberger TED talk - link

Extending the Local AI Stack with On-Demand GPU Inference on RunPod

Sat, 07 Mar 2026 00:00:00 +0000

Conceptual illustration of the extended AI stack with elastic cloud GPU resources for running large language models on demand - AI generated

Introduction

In this post, I want to describe how I extended the local AI stack I built in my homelab with on-demand GPU-backed model inference, without adding any GPU hardware to the lab itself.

The two previous posts in this series provide the context for what follows. The homelab post covers the base infrastructure: thin clients, Docker Compose, Traefik, and internal DNS. The local AI stack post describes how Open WebUI, LiteLLM, SearXNG, and Docling sit on top of that infrastructure to form a self-hosted AI environment. That stack works well, and I have been using it for a while. Keeping the lab CPU-only is a deliberate choice. For orchestration, document workflows, and routing requests to publicly available AI services, dedicated GPU hardware at home is simply not necessary. When I want to try a particular model that is not available through a managed API, or experiment with something freshly released on Hugging Face, I rent the compute on demand rather than maintain it permanently.

The solution is straightforward: rent GPU capacity on demand from a specialized cloud provider, expose it as an OpenAI-compatible endpoint, and wire it into the existing stack. No new hardware, no permanent cost, no changes to the tools I already use.

A Note on Neo Clouds

The providers that specialize in this type of GPU-first infrastructure are sometimes called Neo Clouds. The term emerged around 2024 to distinguish GPU-specialist vendors such as RunPod, CoreWeave and others from traditional hyperscalers. In practice, I am not sure the new term adds much. For me these are specialized cloud providers focused on GPU compute and AI workloads. Useful services, somewhat unnecessary branding.

Why RunPod

I use RunPod for this setup for a few practical reasons. The interface is intuitive, the deployment path from template to running pod is short, and the GPU catalog is broad enough to cover most use cases. Pricing is per second with no ingress or egress fees, which makes on-demand experimentation economical. RunPod also exposes an API for its core operations, so deployments can be automated rather than driven entirely through the UI.

A detailed description of all RunPod services is out of scope for this post. The focus here is on one specific workflow: deploying a vLLM inference server with a model loaded from Hugging Face, and connecting the resulting endpoint to Open WebUI.

Deploying a vLLM Inference Server on RunPod

RunPod uses templates to save pod configurations for reuse. A template defines the container image, the start command, the storage allocation, and other runtime parameters. I maintain a small collection of private templates, each configured for a different model.

A selection of saved vLLM templates on RunPod, each using to a different model from Hugging Face

The container image for all of these templates is vllm/vllm-openai:latest, which bundles vLLM with an OpenAI-compatible API server. The model itself is specified in the container start command, which means swapping models is a matter of editing a single line.

Creating a Template

When creating or editing a template, the key fields are:

Type: Pod
Compute type: Nvidia GPU
Container image: vllm/vllm-openai:latest
Container start command: the vLLM arguments, including the model reference

Template configuration for the vllm_gemma-3-12b template, showing the container image and start command

Throughout the following steps, any value written in <angle brackets> is a placeholder and must be replaced with your actual value before running the command.

A start command for deploying the Red Hat’s validated RedHatAI/Qwen3-8B-FP8-dynamic model looks like this:

--host 0.0.0.0 --port 8000 \
 --model RedHatAI/Qwen3-8B-FP8-dynamic \
 --dtype bfloat16 \
 --enforce-eager \
 --gpu-memory-utilization 0.95 \
 --api-key <api_key> \
 --max-model-len 8128

The parameters worth noting:

--model: any model available on Hugging Face can be referenced here by its repository path
--dtype bfloat16: sets the compute dtype; bfloat16 is a good default for inference on NVIDIA hardware
--enforce-eager: disables CUDA graph capture, which reduces memory overhead at the cost of some throughput; useful when fitting larger models on a single GPU
--gpu-memory-utilization 0.95: allows vLLM to use up to 95% of available GPU memory for the KV cache
--api-key: sets a bearer token for the OpenAI-compatible endpoint; always set this when deploying a public endpoint
--max-model-len: caps the maximum sequence length; reducing this frees memory and allows larger models to fit on smaller GPUs

Selecting a GPU and Deploying

Once the template is configured, deploying it requires selecting a GPU and clicking deploy. RunPod shows available hardware with current pricing.

GPU selection on RunPod, ranging from RTX 3090 class cards to H200 and B200 datacenter accelerators

For most inference workloads with 8 to 12 billion parameter models, an RTX 4090 or L4 is a practical and cost-effective choice. Larger models with higher memory requirements will need 48 GB or 80 GB class cards. The per-hour pricing shown in the interface makes it easy to estimate cost for a session before committing.

After deployment, RunPod assigns a public HTTPS endpoint to the pod. The vLLM server is reachable at that endpoint on port 8000, with the path structure matching the OpenAI API.

Connecting the Endpoint to Open WebUI

With the pod running and the model loaded, the endpoint can be added to Open WebUI as an external connection. In Open WebUI, navigate to Admin Panel then Settings and add a new connection with the following values:

Connection type: External
URL: https://<runpod_endpoint>/v1
Auth: API key set in the vLLM start command
Provider type: OpenAI
API type: Chat Completions

Adding the RunPod vLLM endpoint as an external OpenAI-compatible connection in Open WebUI

Once saved, the model served by vLLM on RunPod appears in the model selector alongside any other configured backends. From a user perspective, the interface is identical to any other configured model, whether local or a commercial API.

Alternatively, the endpoint can be added to LiteLLM as a named model alias. This is the better option if you want centralized credential management or want to expose the RunPod model alongside other backends under a consistent naming scheme across the stack.

Why This Setup Works Well

The combination of a self-hosted orchestration stack and on-demand GPU inference fits well with a homelab where tooling and workflows are in place but on-premises compute is intentionally kept lean.

A few things make this pattern practical:

Low cost for experimentation. Models run only when needed. A session of an hour or two to test a new model costs a few dollars at most.
Access to current models. Many of the recently published models available on Hugging Face can be loaded into vLLM, which means it is straightforward to test recently released models without waiting for them to appear in a managed API.
No changes to the existing stack. Open WebUI, LiteLLM, SearXNG, and Docling continue to work exactly as before. The RunPod endpoint is just another backend.
Automatable. RunPod exposes an API for managing pods, so deployments can be triggered programmatically. Combined with LiteLLM’s routing, it becomes possible to bring a model endpoint up on demand and tear it down again when it is no longer needed.

Conclusion

Adding RunPod as an on-demand GPU backend closes the main gap in a CPU-only homelab AI stack. The setup requires no changes to the existing infrastructure and takes only a few minutes from template to running endpoint. The result is the ability to experiment with current, capable models at low cost, using the same interface and workflows already in place.

For on-demand model access that does not warrant the cost of persistent GPU hardware, this pattern is worth considering.

References

My Homelab: A Traefik-centered Self-hosting Setup - link
My Local AI Stack: Open WebUI, LiteLLM, SearXNG, and Docling - link
RunPod - project site - link
RunPod - documentation - link
vLLM - project site - link
Hugging Face - model hub - link
RedHatAI models on Hugging Face - link

My Local AI Stack: Open WebUI, LiteLLM, SearXNG, and Docling

Sat, 14 Feb 2026 00:00:00 +0000

Overview of the modular self-hosted AI stack - AI generated

Introduction

In my previous post about my homelab, I described the foundation I use for self-hosted services: a small set of low-power machines, Docker Compose for deployment, Traefik as the reverse proxy, and internal DNS to expose services with clean HTTPS hostnames. I have been running this setup for several years with very little maintenance overhead. That setup turned out to be a good base not only for classic self-hosting, but also for local AI workloads. Over the past two year or so, I started extending it with tools to use and experiment with AI services.

Over time, I wanted more than a single chat UI connected to a single model provider. I wanted a setup that would let me experiment with different models, keep sensitive data inside my own network, enrich prompts with live web results, and work with local documents in a structured way. I also wanted to reuse the same operational patterns I already trusted in the rest of the homelab.

The result is a local AI stack built from four components:

Open WebUI as the browser-based user interface
LiteLLM as the OpenAI-compatible model gateway
SearXNG as the privacy-friendly web search backend
Docling as the document parsing layer for file-based workflows

Individually, each of these tools is useful. Combined, they form a practical self-hosted AI environment that fits neatly into the same Traefik-centered architecture as the rest of my homelab.

Base platform and prerequisites

The AI stack runs on the same infrastructure described in the previous post: refurbished thin clients running CentOS Stream 9, Docker and Docker Compose, Traefik as the reverse proxy, and internal DNS for clean HTTPS hostnames. The key design principle carries over as well: every externally reachable service joins the external Docker network and is exposed through Traefik using labels, giving a consistent way to publish services under HTTPS without managing ports or certificates per application.

My current setup is CPU-only. That matters. It is perfectly usable for orchestration, document processing, and web-augmented prompting, but it is not the right environment for large, latency-sensitive inference workloads. In practice, that constraint pushed me toward an architecture where the user interface, routing, tools, and document workflows run locally, while the model backend remains flexible enough to use either local or remote providers.

Architecture overview

At a high level, the request flow looks like this:

A user opens Open WebUI in the browser.
Open WebUI sends model requests to LiteLLM through its OpenAI-compatible API.
LiteLLM routes the request to the selected backend model.
If a prompt requires live information, Open WebUI can use SearXNG as a search tool.
If a prompt requires document context, uploaded files are parsed with Docling and converted into Markdown.
The model response is returned to Open WebUI and displayed to the user.

This separation of concerns is what makes the stack useful:

Open WebUI handles the human interaction layer
LiteLLM abstracts model backends and credentials
SearXNG provides fresh web context
Docling turns messy source documents into structured text

Traefik remains the single public entry point. From an operations perspective, that is valuable because the AI stack behaves like any other part of the homelab.

Open WebUI as the central interface

Open WebUI is the part of the stack I interact with every day. It provides the browser-based interface for conversations, model selection, file uploads, and tool-assisted prompting. The important point is that Open WebUI does not need to know anything about individual model providers. It only needs a single OpenAI-compatible endpoint, which in this setup is LiteLLM.

That keeps the client configuration simple. If I want to add a new provider, swap one model for another, or change credentials, I do it behind the scenes in LiteLLM without having to reconfigure the user interface. Open WebUI also supports user and group management, making it straightforward to grant access to specific models or restrict certain users to a defined set of backends. A particularly useful feature is the ability to send a single prompt to multiple AI services simultaneously, which makes side-by-side model comparison a natural part of the workflow.

A simplified Docker Compose service definition for Open WebUI in this setup looks like this:

services:
 open-webui:
 image: ghcr.io/open-webui/open-webui:main
 container_name: open-webui
 restart: unless-stopped
 environment:
 - OPENAI_API_BASE_URL=http://litellm:4000/v1
 - OPENAI_API_KEY=${LITELLM_MASTER_KEY}
 volumes:
 - ./data/open-webui:/app/backend/data
 networks:
 - external
 - internal
 labels:
 - "traefik.enable=true"
 - "traefik.docker.network=external"
 - "traefik.http.routers.openwebui.rule=Host(`ai.home.example.com`)"
 - "traefik.http.routers.openwebui.entrypoints=https"
 - "traefik.http.routers.openwebui.tls.certresolver=cloudflare"
 - "traefik.http.services.openwebui.loadbalancer.server.port=8080"

The exact image tag and environment variables may differ depending on the release and your setup, but the pattern stays the same: persistent storage for state, Traefik labels for routing, and a backend API endpoint that points to LiteLLM.

LiteLLM as the model gateway

LiteLLM is the glue that makes the rest of the system flexible. It exposes a single OpenAI-style API while allowing multiple backends underneath. That means I can define logical model names and map them to either local inference backends or remote providers.

This is useful for several reasons:

Open WebUI only has to speak to few API endpoints
I can standardize naming across models
Provider credentials stay centralized
Swapping backends becomes operationally cheap
Logging and usage controls are easier to centralize

The Compose service definition for LiteLLM follows the same pattern:

services:
 litellm:
 image: ghcr.io/berriai/litellm:main-latest
 container_name: litellm
 restart: unless-stopped
 command: ["--config", "/app/config.yaml", "--port", "4000"]
 environment:
 - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
 - OPENAI_API_KEY=${OPENAI_API_KEY}
 volumes:
 - ./litellm/config.yaml:/app/config.yaml:ro
 networks:
 - internal
 - external
 labels:
 - "traefik.enable=true"
 - "traefik.docker.network=external"
 - "traefik.http.routers.litellm.rule=Host(`litellm.home.example.com`)"
 - "traefik.http.routers.litellm.entrypoints=https"
 - "traefik.http.routers.litellm.tls.certresolver=cloudflare"
 - "traefik.http.services.litellm.loadbalancer.server.port=4000"

Warning

Security note:
In March 2026, LiteLLM was subject to a suspected supply chain attack in which versions v1.82.7 and v1.82.8 on PyPI contained a malicious payload designed to harvest credentials and exfiltrate them to an external domain. Users running the official LiteLLM Docker image were not affected, as that deployment path pins dependencies and does not rely on the compromised PyPI packages. If you installed LiteLLM via pip during the affected window, treat any secrets on that system as compromised and rotate them immediately. See the official incident report for full details and verified safe versions.

SearXNG for live, privacy-friendly search

One of the biggest limitations of a plain chat interface is the lack of current information. SearXNG solves that problem cleanly. It is a self-hosted metasearch engine that aggregates results from multiple sources and gives me a search API under my own control.

Even outside the AI stack, SearXNG is useful as a search engine. Inside the stack, it becomes more interesting because it can be exposed as a tool for prompts that need fresh information.

A minimal Compose service might look like this:

services:
 searxng:
 image: docker.io/searxng/searxng:latest
 container_name: searxng
 restart: unless-stopped
 volumes:
 - ./searxng:/etc/searxng
 networks:
 - external
 labels:
 - "traefik.enable=true"
 - "traefik.docker.network=external"
 - "traefik.http.routers.searxng.rule=Host(`search.home.example.com`)"
 - "traefik.http.routers.searxng.entrypoints=https"
 - "traefik.http.routers.searxng.tls.certresolver=cloudflare"
 - "traefik.http.services.searxng.loadbalancer.server.port=8080"

Once connected to Open WebUI as a tool, the flow is straightforward:

The user asks a question that requires current information.
The model decides to call the search tool.
SearXNG performs the search.
Titles, snippets, and URLs are returned as context.
The model synthesizes an answer grounded in current results.

Docling for document parsing

The fourth component, Docling, addresses a different problem. Large language models work best with clean text, but many real documents are messy. PDFs, slide decks, and office files often contain broken text flows, layout artifacts, or table structures that are not useful when passed to a model as-is.

Docling converts these documents into a Markdown representation that is much easier to use as model context. That sounds small, but it is a major quality improvement for local document workflows.

The Docling service definition is straightforward:

services:
 docling:
 image: quay.io/docling-project/docling-serve:latest
 container_name: docling
 restart: unless-stopped
 networks:
 - internal
 - external
 labels:
 - "traefik.enable=true"
 - "traefik.docker.network=external"
 - "traefik.http.routers.docling.rule=Host(`docling.home.example.com`)"
 - "traefik.http.routers.docling.entrypoints=https"
 - "traefik.http.routers.docling.tls.certresolver=cloudflare"
 - "traefik.http.services.docling.loadbalancer.server.port=5001"

The typical usage pattern is:

Upload a document in Open WebUI.
Docling parses the file and converts it to Markdown.
Feed that Markdown into the model as structured prompt context.
Ask targeted questions against the extracted content.

This is especially useful for technical notes, whitepapers, internal PDFs, or vendor documentation where the raw file format is not suitable for direct prompting.

Conclusion

This stack did not start as an attempt to build a local alternative to a commercial AI product. It emerged naturally from an existing homelab that already had strong building blocks: containerized services, Traefik, DNS-based routing, and a bias toward self-hosting.

Adding Open WebUI, LiteLLM, SearXNG, and Docling turned that base into a practical local AI environment. It gives me a single interface for model interaction, the ability to swap backends without changing clients, a way to enrich prompts with live web data, and a better workflow for document-driven tasks.

Just as important, it stays operationally consistent with the rest of the homelab. That keeps the setup understandable, maintainable, and worth using day to day.

Future extensions are obvious: adding a vector database, introducing GPU-backed local inference, routing requests to model endpoints running on specialized inference platforms, or using Open WebUI as a gateway to interact with AI agents. But even without those additions, this combination already covers a large share of the AI workflows I actually care about.

References

My Homelab: A Traefik-centered Self-hosting Setup - link
Open WebUI - project site - link
Open WebUI - GitHub - link
LiteLLM - project site - link
LiteLLM - GitHub - link
LiteLLM - Security incident report, March 2026 - link
SearXNG - documentation - link
SearXNG - GitHub - link
Docling - documentation - link
Docling - GitHub - link