K8s Homelab: Deploying the Cluster Infrastructure

Table of Contents

This is the fourth post in our “K8s Homelab” series. Check out the previous post to see how we deployed the physical K3s cluster with PXE boot and Butane configurations.

Building the Foundation Layer

With the physical K3s cluster up and running across our three Lenovo nodes, it was time to deploy the essential infrastructure services that would form the foundation for everything else. These services provide the core primitives that all workloads running on the cluster will depend on.

The foundation layer consists of four critical components:

  1. MetalLB - Load balancer for bare-metal Kubernetes
  2. Traefik - Ingress controller for HTTP/HTTPS traffic
  3. Longhorn - Distributed block storage for persistent volumes
  4. Container Registries - Local registry and mirrors for image management

Together, these four services provide the essential infrastructure layer that all applications and services running on the cluster will consume.

Challenge 1: Load Balancing Without a Cloud Provider

Kubernetes was designed for cloud environments where load balancers are a first-class primitive. In a bare-metal homelab, you don’t have AWS ELB or Google Cloud Load Balancer. You need a way to expose services to the network.

The Solution: MetalLB with Layer 2 Mode

MetalLB provides the “missing” load balancer for bare-metal Kubernetes. It operates in two modes:

  1. Layer 2 Mode (what we used): Assigns IP addresses from a configured pool, advertises them via ARP
  2. BGP Mode: Integrates with your network’s BGP infrastructure

For a homelab, Layer 2 mode is simpler and perfectly adequate. MetalLB:

  • Chooses one node as the “leader” to respond to ARP requests
  • Automatically migrates the IP if that node fails
  • Supports multiple IP pools for different services

The VIP Design

I decided to use a single Virtual IP (VIP) as the entry point for all cluster services:

192.168.X.254 → cluster.lab.x.y.z

This VIP becomes the single point of entry. All services get DNS CNAMEs that point to this VIP:

longhorn.lab.x.y.z → CNAME → cluster.lab.x.y.z
(traefik dashboard would be the same)

DNS on the OpenWRT router handles the hostname routing, while MetalLB ensures the VIP is always reachable even if nodes fail.

MetalLB Implementation

# cluster/roles/metallb/defaults/main.yaml
metallb_namespace: metallb-system
metallb_version: v0.14.3
metallb_ip_pool_name: cluster-vip
metallb_ip_address: "{{ hostvars[groups['kubernetes_cluster'][0]].cluster_vip }}/32"

The MetalLB configuration creates an IPAddressPool and an L2Advertisement:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: cluster-vip
  namespace: metallb-system
spec:
  addresses:
  - 192.168.X.254/32

---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: cluster-l2-advert
  namespace: metallb-system
spec:
  ipAddressPools:
  - cluster-vip

One critical issue we discovered: MetalLB’s webhook service needs endpoints to be in the “ready” state before you can apply these configurations. The playbook needed a wait condition:

- name: Wait for webhook service to have endpoints
  kubernetes.core.k8s_info:
    api_version: v1
    kind: Endpoints
    name: metallb-webhook-service
    namespace: "{{ metallb_namespace }}"
  register: webhook_endpoints
  until: >
    webhook_endpoints.resources | length > 0 
    and webhook_endpoints.resources[0].subsets | default([]) | length > 0
    and webhook_endpoints.resources[0].subsets[0].addresses | default([]) | length > 0    
  retries: 60
  delay: 5

This waits for the endpoint to be in the addresses array (ready state) rather than notReadyAddresses (when the controller is starting).

Challenge 2: Namespace Organization and the Traefik Saga

The original Traefik deployment used the kube-system namespace. This bothered me from a cleanliness perspective. I wanted dedicated namespaces for each infrastructure component:

  • metallb-system - Load balancer
  • traefik-system - Ingress controller
  • longhorn-system - Storage

After some consideration, I decided to migrate Traefik to its own namespace. This required:

  1. Removing the existing Traefik deployment
  2. Adding namespace creation to the Ansible playbook
  3. Updating default variables
  4. Redeploying with the new namespace

The migration went smoothly, and now each infrastructure component has clear ownership and isolation.

Traefik Configuration

Traefik needed to use the MetalLB VIP and serve both HTTP and HTTPS:

# cluster/roles/ingress/templates/values.yaml.j2
service:
  type: LoadBalancer
  spec:
    loadBalancerIP: {{ traefik_loadbalancer_ip }}

ports:
  web:
    port: 80
    expose: {}
  websecure:
    port: 443
    expose: {}

Note the expose: {} syntax—the newer Traefik Helm chart expects an object, not a boolean.

Challenge 3: Distributed Storage with Longhorn

Longhorn provides distributed block storage for Kubernetes. Unlike cloud storage solutions that rely on external cloud storage APIs, Longhorn runs entirely within the cluster. It:

  • Creates replicated volumes across nodes
  • Handles node failures gracefully
  • Provides storage classes for different use cases
  • Includes a web UI for management

Longhorn Configuration

Longhorn’s configuration includes several important settings:

# cluster/roles/longhorn/defaults/main.yaml
longhorn_namespace: longhorn-system
longhorn_version: v1.5.3
longhorn_data_path: /var/lib/longhorn
longhorn_replica_count: 3
longhorn_ingress_enabled: true
longhorn_ingress_host: longhorn.lab.x.y.z
longhorn_ingress_class: traefik

The Helm values file configures Longhorn’s default settings:

# cluster/roles/longhorn/templates/values.yaml.j2
defaultSettings:
  defaultDataPath: {{ longhorn_data_path }}
  defaultReplicaCount: {{ longhorn_replica_count }}
  guaranteedEngineManagerCPU: "250"
  guaranteedReplicaManagerCPU: "250"
  storageReservedPercentageForDefaultDisk: "5"

The key decisions here:

  • Replica count of 3: With three nodes, storing three replicas means each node has one copy. This provides redundancy if one node fails.
  • Storage reserved percentage: Set to 5% (instead of the default 30%) to maximize available storage for volumes. This is important because Longhorn reserves space for system operations and to prevent over-provisioning.
  • Ingress enabled: Access Longhorn UI through Traefik at longhorn.lab.x.y.z
  • Data path: Store Longhorn data on dedicated HDDs (separate from the OS SSDs on each node)

Disk Tagging

Since our nodes have dedicated HDD disks for Longhorn storage (mounted at /var/lib/longhorn), we tag them with longhorn-hdd-raw to ensure they’re reserved for Longhorn use. The Ansible playbook automatically applies these tags:

# cluster/roles/longhorn/templates/node-disk-tags.yaml.j2
apiVersion: longhorn.io/v1beta2
kind: Node
metadata:
  name: {{ node_item.name }}
  namespace: {{ longhorn_namespace }}
spec:
  disks:
    "{{ node_item.disk_key }}":
      tags:
        - longhorn-hdd-raw

The tag uses a longhorn- prefix for clarity, making it explicit that these disks are reserved for Longhorn storage.

Custom Storage Classes

Longhorn comes with a default storage class, but we created custom ones with disk selectors to ensure volumes are only created on our tagged HDD disks:

# cluster/roles/longhorn/templates/storage-classes.yaml.j2
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-hdd-raw-delete
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "30"
  diskSelector: "longhorn-hdd-raw"
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-hdd-raw-retain
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "30"
  diskSelector: "longhorn-hdd-raw"

This setup ensures:

  • Disk isolation: Volumes are only created on disks tagged with longhorn-hdd-raw
  • Reclaim policies: Two storage classes - one for Delete (default) and one for Retain (for important data)
  • Replica distribution: Three replicas spread across the three nodes for redundancy

The diskSelector parameter ensures that Longhorn only schedules volume replicas on our dedicated HDD disks, preventing accidental use of system SSDs.

Challenge 4: The Great Version Mismatch Mystery

Everything was going smoothly until we noticed instability: pods restarting, services not responding. Investigation revealed a critical issue:

lenovo1: v1.28.5+k3s1
lenovo2: v1.33.5+k3s1  ← The problem
lenovo3: v1.28.5+k3s1

Somehow, lenovo2 was running a version 5 major versions ahead of the other nodes! This created severe compatibility issues:

  • MetalLB speakers couldn’t communicate on port 7946 (member discovery)
  • Traefik pods were crashing
  • API compatibility mismatches throughout the cluster

The Root Cause: Manual Intervention

After examining logs and timestamps, we discovered that the K3s installer was manually re-run without specifying a version. The installer defaults to the latest version, resulting in the mismatch.

This incident highlighted the need for:

  1. Version pinning in inventory configuration
  2. Automated upgrades via Ansible
  3. Rolling upgrade strategy to maintain cluster health

The Solution: Rolling Upgrade Playbook

We created an Ansible playbook to safely upgrade K3s with:

  • Serial execution: Only one node at a time
  • Cordon and drain: Move workloads before upgrading
  • Health checks: Wait for nodes to be ready before proceeding
  • Automatic uncordon: Restore scheduling after successful upgrade
# machines/playbooks/upgrade-k3s.yaml
- name: Rolling K3s Upgrade
  hosts: kubernetes_cluster
  serial: 1  # One node at a time
  vars:
    target_version: "{{ k3s_version }}"
    
  tasks:
    - name: Cordon node
      command: kubectl cordon {{ node_name }}
      
    - name: Drain node
      command: kubectl drain {{ node_name }} --ignore-daemonsets --force
      
    - name: Upgrade K3s
      shell: curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="{{ target_version }}" sh -
      
    - name: Wait for node ready
      command: kubectl get node {{ node_name }} -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
      until: node_status.stdout == "True"
      retries: 60
      delay: 5
      
    - name: Uncordon node
      command: kubectl uncordon {{ node_name }}

This playbook ensures zero-downtime upgrades while maintaining cluster health.

DNS Configuration: The Final Piece

With services deployed, we needed DNS resolution. The OpenWRT router’s DNSmasq configuration includes:

# A record for the cluster VIP
config domain
  option name 'cluster.lab.x.y.z'
  option ip '192.168.X.254'

# CNAME for each service
config cname
  option cname 'longhorn.lab.x.y.z'
  option target 'cluster.lab.x.y.z'

This allows accessing services by their friendly hostnames while only managing one actual IP address.

The Ansible Structure

All cluster infrastructure is organized in cluster/:

cluster/
├── playbooks/
│   └── deploy.yaml          # Main playbook
├── roles/
│   ├── metallb/
│   │   ├── defaults/main.yaml
│   │   ├── tasks/main.yaml
│   │   └── templates/config.yaml.j2
│   ├── ingress/
│   │   ├── defaults/main.yaml
│   │   ├── tasks/main.yaml
│   │   └── templates/values.yaml.j2
│   ├── longhorn/
│   │   ├── defaults/main.yaml
│   │   ├── tasks/main.yaml
│   │   └── templates/
│   │       ├── values.yaml.j2
│   │       ├── storage-classes.yaml.j2
│   │       └── node-disk-tags.yaml.j2
│   └── registries/
│       ├── defaults/main.yaml
│       ├── tasks/
│       │   ├── install.yaml
│       │   └── uninstall.yaml
│       └── templates/
│           ├── container-image-registry-*.yaml.j2
│           └── container-image-mirror-*.yaml.j2
└── data/                    # Generated files (gitignored)
    ├── metallb-config.yaml
    ├── traefik-values.yaml
    ├── longhorn-values.yaml
    ├── storage-classes.yaml
    ├── node-disk-tags.yaml
    └── registry-*.yaml

Each role follows a consistent pattern:

  1. Add Helm repository
  2. Create namespace
  3. Template values/configuration
  4. Deploy via Helm
  5. Wait for readiness
  6. Apply custom resources (StorageClasses, etc.)

Python Installation: A Necessary Evil

One unexpected requirement: Fedora CoreOS doesn’t include Python by default (it’s a minimal immutable OS). This caused issues with Ansible’s gather_facts module.

The solution: Add a Python installation service to the Butane configurations:

# In all .bu.j2 files
- path: /etc/systemd/system/python-install.service
  mode: 0644
  contents:
    inline: |
      [Unit]
      Description=Install Python for Ansible
      After=network-online.target
      Before=k3s-install.service
      
      [Service]
      Type=oneshot
      ExecStart=/usr/bin/rpm-ostree install python3
      
      [Install]
      WantedBy=multi-user.target      

This installs Python3 on first boot and persists through Zincati (Fedora CoreOS automatic updates).

Makefile Integration

The deployment is integrated into the main Makefile:

# Cluster Commands
.PHONY: cluster/deploy
cluster/deploy: machines/kubeconfig
    @echo "📦 Deploying cluster infrastructure..."
    ansible-playbook -i inventory.yaml cluster/playbooks/deploy.yaml

Running make cluster/deploy:

  1. Ensures kubeconfig exists
  2. Creates required directories
  3. Runs all three roles in sequence
  4. Deploys the complete foundation layer

Challenge 5: Container Image Registry and Mirrors

As the cluster grew, I needed a way to store custom container images and cache public images to avoid rate limiting and speed up pulls. The solution: deploy a local container registry and mirror registries for upstream sources.

The Registry Architecture

I deployed three registries in the registry-system namespace:

  1. Main Registry (registry): Stores custom-built images

    • Storage: 2Gi with Retain policy (data should persist)
    • Accessible at: registry.lab.x.y.z
  2. Docker Hub Mirror (docker-io): Caches images from docker.io

    • Storage: 2Gi with Delete policy (cache can be evicted)
    • Accessible at: docker-io.lab.x.y.z
  3. GHCR Mirror (ghcr-io): Caches images from ghcr.io

    • Storage: 2Gi with Delete policy (cache can be evicted)
    • Accessible at: ghcr-io.lab.x.y.z

All registries use Docker Registry v2 and are exposed via Traefik with TLS termination.

Registry Configuration

The registries are configured via Ansible with a flexible inventory structure:

# inventory.yaml
container_image_mirrors:
  - url: https://registry-1.docker.io
    registry: docker.io
    size: 2Gi
  - url: https://ghcr.io
    registry: ghcr.io
    size: 2Gi

The Ansible role automatically:

  • Creates ConfigMaps with proxy configuration for each mirror
  • Deploys Deployments with proper resource limits
  • Creates Services and Ingress resources
  • Generates TLS certificates using the CA role
  • Configures K3s registries.yaml for automatic mirror usage

K3s Integration: Automatic Mirror Usage

The key feature is automatic mirror usage during node installation. The registries.yaml file is generated and served via PXE boot:

# Generated registries.yaml
mirrors:
  "docker.io":
    endpoint:
      - "https://docker-io.registry-system.svc.cluster.local:5000"
  "ghcr.io":
    endpoint:
      - "https://ghcr-io.registry-system.svc.cluster.local:5000"
configs:
  "registry.registry-system.svc.cluster.local:5000":
    tls:
      ca_file: /etc/rancher/k3s/ssl/registry-ca.crt
  "docker-io.registry-system.svc.cluster.local:5000":
    tls:
      ca_file: /etc/rancher/k3s/ssl/registry-ca.crt
  "ghcr-io.registry-system.svc.cluster.local:5000":
    tls:
      ca_file: /etc/rancher/k3s/ssl/registry-ca.crt

This configuration:

  • Routes docker.io pulls to the docker-io mirror
  • Routes ghcr.io pulls to the ghcr-io mirror
  • Falls back to upstream registries if mirrors are unavailable
  • Uses TLS with the cluster CA certificate

Docker Registry v2 Proxy Limitations

Docker Registry v2’s proxy mode only supports a single upstream registry per instance. This is why we need separate mirror instances for docker.io and ghcr.io. Each mirror:

  • Proxies requests to its upstream registry
  • Caches blobs locally with configurable TTL (default: 168h)
  • Returns cached blobs on subsequent requests
  • Automatically evicts old blobs when storage is full

TLS Configuration

All registries use TLS certificates generated by the CA role:

  • Certificates are stored in cluster/secrets/ca/registry/
  • Each registry has its own certificate (registry.crt, docker-io.crt, ghcr-io.crt)
  • Certificates are signed by the cluster CA
  • The CA certificate is distributed to nodes during PXE boot

External Access

Registries are accessible both:

  • Internally: Via Kubernetes service DNS (registry.registry-system.svc.cluster.local:5000)
  • Externally: Via Traefik ingress (registry.lab.x.y.z:443)

External access requires installing the CA certificate on Docker Desktop or native Docker clients. The CA website (https://ca.lab.x.y.z/) provides installation instructions for both web browsers and Docker clients.

Benefits

This registry setup provides:

Custom Image Storage: Build and store custom images locally ✅ Rate Limit Avoidance: Cache public images to avoid Docker Hub rate limits ✅ Faster Pulls: Local cache speeds up image pulls significantly ✅ Offline Capability: Cached images available even if upstream is down ✅ Automatic Usage: K3s nodes automatically use mirrors during installation ✅ TLS Security: All registry communication is encrypted

What We Achieved

With the foundation layer deployed, we now have:

Load Balancing: MetalLB managing VIP distribution ✅ Ingress: Traefik routing HTTP/HTTPS traffic ✅ Storage: Longhorn providing persistent volumes ✅ Container Registries: Local registry and mirrors for image management ✅ DNS: CNAME-based routing for all services ✅ Automation: Complete Ansible-driven deployment ✅ Monitoring: All services accessible via friendly hostnames ✅ Redundancy: Services distributed across three nodes

The Road Ahead

The foundation is now solid. The next steps will be:

  1. Longhorn Access: Configure ingress to access Longhorn UI ✅
  2. Testing: Verify VIP failover and service resilience
  3. Service Migration: Move existing services to Kubernetes
  4. Monitoring Stack: Deploy Prometheus and Grafana ✅
  5. Container Registry: Local registry for custom images ✅

Lessons Learned

This phase taught me several important lessons:

  1. Webhooks Matter: Always wait for webhook endpoints to be ready before applying CRDs
  2. Namespace Organization: Dedicated namespaces improve clarity and maintainability
  3. Version Consistency: Never manually upgrade without proper procedures
  4. Layer Order: Infrastructure services must be stable before deploying workload
  5. DNS Design: CNAME aliasing is cleaner than managing multiple A records
  6. Automation First: Infrastructure changes should always go through Ansible
  7. Registry Mirrors: Separate mirror instances needed for each upstream registry (Docker Registry v2 limitation)
  8. TLS Everywhere: Self-signed certificates require CA distribution to all clients (browsers, Docker, K3s)
  9. PXE Integration: Registry configuration must be available during node installation for automatic mirror usage

Conclusion

The cluster infrastructure is now deployed and stable. We have a solid base for running all workloads on the cluster. The combination of MetalLB, Traefik, Longhorn, and container registries provides all the essential primitives that applications and services will consume.

The foundation layer is complete and ready to support any workload we deploy to the cluster.


Check out the previous post to see how we built the physical cluster, or read the first post for the complete journey from the beginning.


comments powered by Disqus