Network Topology for a Hybrid Homelab-Cloud Setup

Someone asked me recently for a diagram of my homelab. I opened a blank canvas and stared at it for ten minutes. The problem wasn’t that I didn’t know the topology. The problem was that there are three topologies, layered on top of each other, and drawing one without the other two is misleading.

There’s the physical LAN. There’s the VPN overlay. And there’s the DNS layer that decides which path a request actually takes. None of them work in isolation. Together, they turn 9 scattered nodes into something that behaves like a single environment.

The physical layer

My LAN is a flat /24 at 172.16.1.0. Every bare-metal node has a static IP:

Node	IP	Role
RPi4 (8GB)	.1	DNS gateway (Pi-hole + CoreDNS)
Acemagic-1 (12GB)	.2	Proxmox host (k3s-server + agent-1 VMs)
Beelink (8GB)	.3	Ollama bare metal (LLM inference)
Jetson Nano (4GB)	.4	Edge inference (llama.cpp)
Acemagic-2 (12GB)	.5	Proxmox host (k3s-agent-2 VM)

Then the VMs running on Proxmox:

VM	IP	Host
k3s-server	.10	Acemagic-1
k3s-agent-1	.11	Acemagic-1
k3s-agent-2	.12	Acemagic-2

The RPi3 (1GB) sits on a separate network entirely. It runs Uptime Kuma for external monitoring — you don’t put your monitoring on the same network as the thing you’re monitoring.

And then there’s the VPS. A Hetzner CX22 at 162.55.57.175. That’s production. Docker Compose today, K3s eventually. It’s not on my LAN. It’s not even in the same country.

graph TB
    subgraph "VPS Hetzner (162.55.57.175)"
        VPS["Production<br/>Docker Compose → K3s<br/>Traefik + Headscale"]
    end

    subgraph "Home LAN (172.16.1.0/24)"
        subgraph "Acemagic-1 (.2)"
            A1_PVE["Proxmox VE"]
            A1_VM1["k3s-server (.10)"]
            A1_VM2["k3s-agent-1 (.11)"]
        end
        subgraph "Acemagic-2 (.5)"
            A2_PVE["Proxmox VE"]
            A2_VM3["k3s-agent-2 (.12)"]
        end
        RPI4["RPi4 (.1)<br/>Pi-hole + CoreDNS"]
        BEE["Beelink (.3)<br/>Ollama"]
        JET["Jetson Nano (.4)<br/>llama.cpp"]
    end

    subgraph "External Network"
        RPI3["RPi3<br/>Uptime Kuma"]
    end

    subgraph "VPN Overlay (Headscale 100.64.0.0/24)"
        direction LR
        MESH["MSI .1 ── VPS .2 ── Beelink .3<br/>k3s-server .4 ── RPi4 .5 ── RPi3 .6<br/>agent-1 .7 ── Jetson .8 ── agent-2 .9"]
    end

    VPS -.->|Headscale mesh| MESH
    RPI4 -.->|Headscale mesh| MESH
    A1_VM1 -.->|Headscale mesh| MESH
    BEE -.->|Headscale mesh| MESH
    RPI3 -.->|Headscale mesh| MESH

    A1_VM1 ---|K3s cluster| A1_VM2
    A1_VM2 ---|K3s cluster| A2_VM3

    RPI4 -->|"advertises 172.16.1.0/24<br/>as subnet route"| MESH

Two things to notice. First, the K3s cluster spans two physical hosts but it’s three VMs — the control plane and one agent share Acemagic-1, the heavy worker gets all of Acemagic-2. Second, the VPN overlay connects everything, including nodes that can’t see each other on the LAN.

The VPN is the backbone

Headscale (self-hosted Tailscale control plane) runs on the VPS at vpn.kubelab.live. Every node runs a Tailscale client that connects to it. This creates a WireGuard mesh where any node can reach any other node, regardless of NATs, firewalls, or physical location.

The RPi4 does something critical: it advertises the entire 172.16.1.0/24 subnet as a Headscale route. This means my MSI workstation (which is only on the VPN, not on the LAN) can SSH directly into 172.16.1.10 (k3s-server) through the RPi4’s subnet route. Without this, VPN clients could only reach nodes that have Tailscale installed — not the Proxmox VMs behind them.

There’s a bootstrap dependency that took me a while to internalize. Headscale runs on the VPS. Tailscale clients need to reach the VPS to join the mesh. If DNS resolves vpn.kubelab.live through the VPN… you have a circular dependency. The VPN needs DNS, DNS needs the VPN.

The rule is absolute: vpn.kubelab.live must always resolve to the public IP 162.55.57.175, never to the Tailscale IP 100.64.0.2. Every node has a static /etc/hosts entry for this. An Ansible role called dns_resilience manages it across all 7 nodes. If the RPi4 goes down and takes DNS with it, Tailscale can still reconnect because it never depended on DNS for the control plane in the first place.

Three ingress paths

This is where people’s eyes glaze over, but it’s the core of the whole design. There are three completely separate paths a request can take, depending on which domain you’re hitting.

flowchart TB
    subgraph "1. Production"
        U1["User"] -->|"*.kubelab.live"| CF["Cloudflare DNS"]
        CF -->|"162.55.57.175"| VPS_T["VPS Traefik\n(Docker)"]
        VPS_T --> APP1["App containers"]
    end

    subgraph "2. Staging (VPN only)"
        U2["Developer\n(on VPN)"] -->|"*.staging.kubelab.live"| HS["Headscale\nsplit DNS"]
        HS -->|"100.64.0.5"| PH["RPi4 Pi-hole\n:53"]
        PH -->|"conditional\nforward"| CD["CoreDNS\n:5353"]
        CD -->|"100.64.0.4"| K3S_T["K3s Traefik\nIngress"]
        K3S_T --> APP2["K3s pods"]
    end

    subgraph "3. Development"
        U3["Developer\n(localhost)"] -->|"*.kubelab.test"| ETC["/etc/hosts\n127.0.0.1"]
        ETC --> LOCAL["Local dev\nserver"]
    end

Production is public. *.kubelab.live goes through Cloudflare, hits the VPS at its public IP, and Traefik routes it to the right container. Standard setup, nothing unusual.

Staging is VPN-only. This is the interesting one. Headscale has split DNS configured for the staging.kubelab.live subdomain. When a VPN client queries grafana.staging.kubelab.live, Headscale intercepts the DNS query and sends it to the RPi4’s Tailscale IP (100.64.0.5) instead of a public resolver. Pi-hole receives it, matches the kubelab.live zone, and forwards it to CoreDNS on port 5353. CoreDNS knows that *.staging.kubelab.live resolves to 100.64.0.4 — the k3s-server’s Tailscale IP, where Traefik is listening.

The split DNS only targets staging.kubelab.live, not the broader kubelab.live. I learned this the hard way. When I initially configured it for all of kubelab.live, production domains also routed through the RPi4. If the RPi4 went down, prod domains became unreachable from VPN clients — even though they have perfectly valid public Cloudflare records. Narrowing the split to staging.kubelab.live means production always resolves through public DNS, regardless of the RPi4’s health.

Dev is trivial. /etc/hosts maps *.kubelab.test to localhost. No DNS infrastructure involved.

External services: not everything runs on K3s

The Beelink runs Ollama bare metal. It’s not a Kubernetes pod. The RPi3 runs Uptime Kuma. Also not a pod. But I still want to access them through K3s Traefik with proper TLS and authentication.

The pattern is a headless Service plus an EndpointSlice that points at the node’s real IP. Kubernetes thinks it’s routing to a pod. It’s actually routing to a bare-metal process on a different machine. The manifests live in infra/k8s/base/external/ and carry the label kubelab.live/location: external so I can identify them at a glance.

It’s a clean abstraction. Every service in the cluster has the same ingress pattern — IngressRoute, TLS, middleware — regardless of whether it’s a pod on K3s or a process on a Beelink.

SSH and access patterns

Day to day, I SSH into nodes using their Headscale IPs. ssh 100.64.0.4 hits the k3s-server from anywhere, whether I’m at home or in a coffee shop. The VPN handles the routing.

For the Proxmox VMs, there’s a ProxyJump through the Proxmox host. The VMs don’t have Tailscale installed (they don’t need it — the RPi4’s subnet route covers them), so reaching them from outside the LAN means hopping through a node that is on both networks.

LAN IPs are the fallback. If Headscale is down, I can still reach everything from the local network. This has saved me twice — once when I accidentally broke Tailscale on the RPi4 while updating its config, and once during a Headscale upgrade that took longer than expected.

Security boundaries

Headscale supports ACLs with user-based groups. I have three:

kubelab: Full admin access to everything. My personal devices.
work: Windows PCs. Isolated. Can reach the VPN but nothing else on the mesh. Implicit deny.
contractors: Not used yet, but the group exists. When I eventually give someone access to a specific service, they’ll get a scoped ACL that only reaches that service’s Tailscale IP.

No outbound rules on work devices. They can reach the internet through their own gateway, but they can’t reach a single node on my homelab. The VPN is not a flat network.

What I’ve learned

The VPN mesh is not a nice-to-have. Without Headscale, my homelab is five boxes in a closet and a VPS in Germany with no relationship between them. The mesh is what turns physical hardware into an environment.

DNS is the single point of failure. The RPi4 going down cascades into everything. The resilience patterns — /etc/hosts fallbacks, dual nameservers (127.0.0.1 + 8.8.8.8 on the RPi4 itself), Ansible-managed static entries — aren’t paranoia. They’re the difference between “the RPi4 is down” and “the entire homelab is down.”

Every node has one job. The RPi4 is the gateway. The Beelink runs LLMs. The VPS runs prod. The Acemagics run K3s VMs. When I tried to also run a container registry on the RPi4, debugging became impossible because a DNS issue and a registry issue looked identical. I moved it to K3s and the RPi4 went back to doing one thing well.

The topology looks complex on a diagram. In practice, each layer has a clear purpose: physical gives you wires, VPN gives you reach, DNS gives you names. If you understand which layer is responsible for what, debugging becomes a matter of asking “which layer is broken?” instead of staring at tcpdump output for an hour.

Nine nodes, two networks, three DNS paths. It’s not simple. But every piece is there for a reason, and I can explain every one of them. That’s the point.