K8s Homelab: Deploying the Cluster Infrastructure
Table of Contents
This is the fourth post in our “K8s Homelab” series. Check out the previous post to see how we deployed the physical K3s cluster with PXE boot and Butane configurations.
Building the Foundation Layer
With the physical K3s cluster up and running across our three Lenovo nodes, it was time to deploy the essential infrastructure services that would form the foundation for everything else. These services provide the core primitives that all workloads running on the cluster will depend on.
The foundation layer consists of four critical components:
- MetalLB - Load balancer for bare-metal Kubernetes
- Traefik - Ingress controller for HTTP/HTTPS traffic
- Longhorn - Distributed block storage for persistent volumes
- Container Registries - Local registry and mirrors for image management
Together, these four services provide the essential infrastructure layer that all applications and services running on the cluster will consume.
Challenge 1: Load Balancing Without a Cloud Provider
Kubernetes was designed for cloud environments where load balancers are a first-class primitive. In a bare-metal homelab, you don’t have AWS ELB or Google Cloud Load Balancer. You need a way to expose services to the network.
The Solution: MetalLB with Layer 2 Mode
MetalLB provides the “missing” load balancer for bare-metal Kubernetes. It operates in two modes:
- Layer 2 Mode (what we used): Assigns IP addresses from a configured pool, advertises them via ARP
- BGP Mode: Integrates with your network’s BGP infrastructure
For a homelab, Layer 2 mode is simpler and perfectly adequate. MetalLB:
- Chooses one node as the “leader” to respond to ARP requests
- Automatically migrates the IP if that node fails
- Supports multiple IP pools for different services
The VIP Design
I decided to use a single Virtual IP (VIP) as the entry point for all cluster services:
192.168.X.254 → cluster.lab.x.y.z
This VIP becomes the single point of entry. All services get DNS CNAMEs that point to this VIP:
longhorn.lab.x.y.z → CNAME → cluster.lab.x.y.z
(traefik dashboard would be the same)
DNS on the OpenWRT router handles the hostname routing, while MetalLB ensures the VIP is always reachable even if nodes fail.
MetalLB Implementation
# cluster/roles/metallb/defaults/main.yaml
metallb_namespace: metallb-system
metallb_version: v0.14.3
metallb_ip_pool_name: cluster-vip
metallb_ip_address: "{{ hostvars[groups['kubernetes_cluster'][0]].cluster_vip }}/32"
The MetalLB configuration creates an IPAddressPool and an L2Advertisement:
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: cluster-vip
namespace: metallb-system
spec:
addresses:
- 192.168.X.254/32
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: cluster-l2-advert
namespace: metallb-system
spec:
ipAddressPools:
- cluster-vip
One critical issue we discovered: MetalLB’s webhook service needs endpoints to be in the “ready” state before you can apply these configurations. The playbook needed a wait condition:
- name: Wait for webhook service to have endpoints
kubernetes.core.k8s_info:
api_version: v1
kind: Endpoints
name: metallb-webhook-service
namespace: "{{ metallb_namespace }}"
register: webhook_endpoints
until: >
webhook_endpoints.resources | length > 0
and webhook_endpoints.resources[0].subsets | default([]) | length > 0
and webhook_endpoints.resources[0].subsets[0].addresses | default([]) | length > 0
retries: 60
delay: 5
This waits for the endpoint to be in the addresses array (ready state) rather than notReadyAddresses (when the controller is starting).
Challenge 2: Namespace Organization and the Traefik Saga
The original Traefik deployment used the kube-system namespace. This bothered me from a cleanliness perspective. I wanted dedicated namespaces for each infrastructure component:
metallb-system- Load balancertraefik-system- Ingress controllerlonghorn-system- Storage
After some consideration, I decided to migrate Traefik to its own namespace. This required:
- Removing the existing Traefik deployment
- Adding namespace creation to the Ansible playbook
- Updating default variables
- Redeploying with the new namespace
The migration went smoothly, and now each infrastructure component has clear ownership and isolation.
Traefik Configuration
Traefik needed to use the MetalLB VIP and serve both HTTP and HTTPS:
# cluster/roles/ingress/templates/values.yaml.j2
service:
type: LoadBalancer
spec:
loadBalancerIP: {{ traefik_loadbalancer_ip }}
ports:
web:
port: 80
expose: {}
websecure:
port: 443
expose: {}
Note the expose: {} syntax—the newer Traefik Helm chart expects an object, not a boolean.
Challenge 3: Distributed Storage with Longhorn
Longhorn provides distributed block storage for Kubernetes. Unlike cloud storage solutions that rely on external cloud storage APIs, Longhorn runs entirely within the cluster. It:
- Creates replicated volumes across nodes
- Handles node failures gracefully
- Provides storage classes for different use cases
- Includes a web UI for management
Longhorn Configuration
Longhorn’s configuration includes several important settings:
# cluster/roles/longhorn/defaults/main.yaml
longhorn_namespace: longhorn-system
longhorn_version: v1.5.3
longhorn_data_path: /var/lib/longhorn
longhorn_replica_count: 3
longhorn_ingress_enabled: true
longhorn_ingress_host: longhorn.lab.x.y.z
longhorn_ingress_class: traefik
The Helm values file configures Longhorn’s default settings:
# cluster/roles/longhorn/templates/values.yaml.j2
defaultSettings:
defaultDataPath: {{ longhorn_data_path }}
defaultReplicaCount: {{ longhorn_replica_count }}
guaranteedEngineManagerCPU: "250"
guaranteedReplicaManagerCPU: "250"
storageReservedPercentageForDefaultDisk: "5"
The key decisions here:
- Replica count of 3: With three nodes, storing three replicas means each node has one copy. This provides redundancy if one node fails.
- Storage reserved percentage: Set to 5% (instead of the default 30%) to maximize available storage for volumes. This is important because Longhorn reserves space for system operations and to prevent over-provisioning.
- Ingress enabled: Access Longhorn UI through Traefik at
longhorn.lab.x.y.z - Data path: Store Longhorn data on dedicated HDDs (separate from the OS SSDs on each node)
Disk Tagging
Since our nodes have dedicated HDD disks for Longhorn storage (mounted at /var/lib/longhorn), we tag them with longhorn-hdd-raw to ensure they’re reserved for Longhorn use. The Ansible playbook automatically applies these tags:
# cluster/roles/longhorn/templates/node-disk-tags.yaml.j2
apiVersion: longhorn.io/v1beta2
kind: Node
metadata:
name: {{ node_item.name }}
namespace: {{ longhorn_namespace }}
spec:
disks:
"{{ node_item.disk_key }}":
tags:
- longhorn-hdd-raw
The tag uses a longhorn- prefix for clarity, making it explicit that these disks are reserved for Longhorn storage.
Custom Storage Classes
Longhorn comes with a default storage class, but we created custom ones with disk selectors to ensure volumes are only created on our tagged HDD disks:
# cluster/roles/longhorn/templates/storage-classes.yaml.j2
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-hdd-raw-delete
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
parameters:
numberOfReplicas: "3"
staleReplicaTimeout: "30"
diskSelector: "longhorn-hdd-raw"
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-hdd-raw-retain
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
parameters:
numberOfReplicas: "3"
staleReplicaTimeout: "30"
diskSelector: "longhorn-hdd-raw"
This setup ensures:
- Disk isolation: Volumes are only created on disks tagged with
longhorn-hdd-raw - Reclaim policies: Two storage classes - one for Delete (default) and one for Retain (for important data)
- Replica distribution: Three replicas spread across the three nodes for redundancy
The diskSelector parameter ensures that Longhorn only schedules volume replicas on our dedicated HDD disks, preventing accidental use of system SSDs.
Challenge 4: The Great Version Mismatch Mystery
Everything was going smoothly until we noticed instability: pods restarting, services not responding. Investigation revealed a critical issue:
lenovo1: v1.28.5+k3s1
lenovo2: v1.33.5+k3s1 ← The problem
lenovo3: v1.28.5+k3s1
Somehow, lenovo2 was running a version 5 major versions ahead of the other nodes! This created severe compatibility issues:
- MetalLB speakers couldn’t communicate on port 7946 (member discovery)
- Traefik pods were crashing
- API compatibility mismatches throughout the cluster
The Root Cause: Manual Intervention
After examining logs and timestamps, we discovered that the K3s installer was manually re-run without specifying a version. The installer defaults to the latest version, resulting in the mismatch.
This incident highlighted the need for:
- Version pinning in inventory configuration
- Automated upgrades via Ansible
- Rolling upgrade strategy to maintain cluster health
The Solution: Rolling Upgrade Playbook
We created an Ansible playbook to safely upgrade K3s with:
- Serial execution: Only one node at a time
- Cordon and drain: Move workloads before upgrading
- Health checks: Wait for nodes to be ready before proceeding
- Automatic uncordon: Restore scheduling after successful upgrade
# machines/playbooks/upgrade-k3s.yaml
- name: Rolling K3s Upgrade
hosts: kubernetes_cluster
serial: 1 # One node at a time
vars:
target_version: "{{ k3s_version }}"
tasks:
- name: Cordon node
command: kubectl cordon {{ node_name }}
- name: Drain node
command: kubectl drain {{ node_name }} --ignore-daemonsets --force
- name: Upgrade K3s
shell: curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="{{ target_version }}" sh -
- name: Wait for node ready
command: kubectl get node {{ node_name }} -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
until: node_status.stdout == "True"
retries: 60
delay: 5
- name: Uncordon node
command: kubectl uncordon {{ node_name }}
This playbook ensures zero-downtime upgrades while maintaining cluster health.
DNS Configuration: The Final Piece
With services deployed, we needed DNS resolution. The OpenWRT router’s DNSmasq configuration includes:
# A record for the cluster VIP
config domain
option name 'cluster.lab.x.y.z'
option ip '192.168.X.254'
# CNAME for each service
config cname
option cname 'longhorn.lab.x.y.z'
option target 'cluster.lab.x.y.z'
This allows accessing services by their friendly hostnames while only managing one actual IP address.
The Ansible Structure
All cluster infrastructure is organized in cluster/:
cluster/
├── playbooks/
│ └── deploy.yaml # Main playbook
├── roles/
│ ├── metallb/
│ │ ├── defaults/main.yaml
│ │ ├── tasks/main.yaml
│ │ └── templates/config.yaml.j2
│ ├── ingress/
│ │ ├── defaults/main.yaml
│ │ ├── tasks/main.yaml
│ │ └── templates/values.yaml.j2
│ ├── longhorn/
│ │ ├── defaults/main.yaml
│ │ ├── tasks/main.yaml
│ │ └── templates/
│ │ ├── values.yaml.j2
│ │ ├── storage-classes.yaml.j2
│ │ └── node-disk-tags.yaml.j2
│ └── registries/
│ ├── defaults/main.yaml
│ ├── tasks/
│ │ ├── install.yaml
│ │ └── uninstall.yaml
│ └── templates/
│ ├── container-image-registry-*.yaml.j2
│ └── container-image-mirror-*.yaml.j2
└── data/ # Generated files (gitignored)
├── metallb-config.yaml
├── traefik-values.yaml
├── longhorn-values.yaml
├── storage-classes.yaml
├── node-disk-tags.yaml
└── registry-*.yaml
Each role follows a consistent pattern:
- Add Helm repository
- Create namespace
- Template values/configuration
- Deploy via Helm
- Wait for readiness
- Apply custom resources (StorageClasses, etc.)
Python Installation: A Necessary Evil
One unexpected requirement: Fedora CoreOS doesn’t include Python by default (it’s a minimal immutable OS). This caused issues with Ansible’s gather_facts module.
The solution: Add a Python installation service to the Butane configurations:
# In all .bu.j2 files
- path: /etc/systemd/system/python-install.service
mode: 0644
contents:
inline: |
[Unit]
Description=Install Python for Ansible
After=network-online.target
Before=k3s-install.service
[Service]
Type=oneshot
ExecStart=/usr/bin/rpm-ostree install python3
[Install]
WantedBy=multi-user.target
This installs Python3 on first boot and persists through Zincati (Fedora CoreOS automatic updates).
Makefile Integration
The deployment is integrated into the main Makefile:
# Cluster Commands
.PHONY: cluster/deploy
cluster/deploy: machines/kubeconfig
@echo "📦 Deploying cluster infrastructure..."
ansible-playbook -i inventory.yaml cluster/playbooks/deploy.yaml
Running make cluster/deploy:
- Ensures kubeconfig exists
- Creates required directories
- Runs all three roles in sequence
- Deploys the complete foundation layer
Challenge 5: Container Image Registry and Mirrors
As the cluster grew, I needed a way to store custom container images and cache public images to avoid rate limiting and speed up pulls. The solution: deploy a local container registry and mirror registries for upstream sources.
The Registry Architecture
I deployed three registries in the registry-system namespace:
Main Registry (
registry): Stores custom-built images- Storage: 2Gi with
Retainpolicy (data should persist) - Accessible at:
registry.lab.x.y.z
- Storage: 2Gi with
Docker Hub Mirror (
docker-io): Caches images fromdocker.io- Storage: 2Gi with
Deletepolicy (cache can be evicted) - Accessible at:
docker-io.lab.x.y.z
- Storage: 2Gi with
GHCR Mirror (
ghcr-io): Caches images fromghcr.io- Storage: 2Gi with
Deletepolicy (cache can be evicted) - Accessible at:
ghcr-io.lab.x.y.z
- Storage: 2Gi with
All registries use Docker Registry v2 and are exposed via Traefik with TLS termination.
Registry Configuration
The registries are configured via Ansible with a flexible inventory structure:
# inventory.yaml
container_image_mirrors:
- url: https://registry-1.docker.io
registry: docker.io
size: 2Gi
- url: https://ghcr.io
registry: ghcr.io
size: 2Gi
The Ansible role automatically:
- Creates ConfigMaps with proxy configuration for each mirror
- Deploys Deployments with proper resource limits
- Creates Services and Ingress resources
- Generates TLS certificates using the CA role
- Configures K3s
registries.yamlfor automatic mirror usage
K3s Integration: Automatic Mirror Usage
The key feature is automatic mirror usage during node installation. The registries.yaml file is generated and served via PXE boot:
# Generated registries.yaml
mirrors:
"docker.io":
endpoint:
- "https://docker-io.registry-system.svc.cluster.local:5000"
"ghcr.io":
endpoint:
- "https://ghcr-io.registry-system.svc.cluster.local:5000"
configs:
"registry.registry-system.svc.cluster.local:5000":
tls:
ca_file: /etc/rancher/k3s/ssl/registry-ca.crt
"docker-io.registry-system.svc.cluster.local:5000":
tls:
ca_file: /etc/rancher/k3s/ssl/registry-ca.crt
"ghcr-io.registry-system.svc.cluster.local:5000":
tls:
ca_file: /etc/rancher/k3s/ssl/registry-ca.crt
This configuration:
- Routes
docker.iopulls to thedocker-iomirror - Routes
ghcr.iopulls to theghcr-iomirror - Falls back to upstream registries if mirrors are unavailable
- Uses TLS with the cluster CA certificate
Docker Registry v2 Proxy Limitations
Docker Registry v2’s proxy mode only supports a single upstream registry per instance. This is why we need separate mirror instances for docker.io and ghcr.io. Each mirror:
- Proxies requests to its upstream registry
- Caches blobs locally with configurable TTL (default: 168h)
- Returns cached blobs on subsequent requests
- Automatically evicts old blobs when storage is full
TLS Configuration
All registries use TLS certificates generated by the CA role:
- Certificates are stored in
cluster/secrets/ca/registry/ - Each registry has its own certificate (registry.crt, docker-io.crt, ghcr-io.crt)
- Certificates are signed by the cluster CA
- The CA certificate is distributed to nodes during PXE boot
External Access
Registries are accessible both:
- Internally: Via Kubernetes service DNS (
registry.registry-system.svc.cluster.local:5000) - Externally: Via Traefik ingress (
registry.lab.x.y.z:443)
External access requires installing the CA certificate on Docker Desktop or native Docker clients. The CA website (https://ca.lab.x.y.z/) provides installation instructions for both web browsers and Docker clients.
Benefits
This registry setup provides:
✅ Custom Image Storage: Build and store custom images locally ✅ Rate Limit Avoidance: Cache public images to avoid Docker Hub rate limits ✅ Faster Pulls: Local cache speeds up image pulls significantly ✅ Offline Capability: Cached images available even if upstream is down ✅ Automatic Usage: K3s nodes automatically use mirrors during installation ✅ TLS Security: All registry communication is encrypted
What We Achieved
With the foundation layer deployed, we now have:
✅ Load Balancing: MetalLB managing VIP distribution ✅ Ingress: Traefik routing HTTP/HTTPS traffic ✅ Storage: Longhorn providing persistent volumes ✅ Container Registries: Local registry and mirrors for image management ✅ DNS: CNAME-based routing for all services ✅ Automation: Complete Ansible-driven deployment ✅ Monitoring: All services accessible via friendly hostnames ✅ Redundancy: Services distributed across three nodes
The Road Ahead
The foundation is now solid. The next steps will be:
- Longhorn Access: Configure ingress to access Longhorn UI ✅
- Testing: Verify VIP failover and service resilience
- Service Migration: Move existing services to Kubernetes
- Monitoring Stack: Deploy Prometheus and Grafana ✅
- Container Registry: Local registry for custom images ✅
Lessons Learned
This phase taught me several important lessons:
- Webhooks Matter: Always wait for webhook endpoints to be ready before applying CRDs
- Namespace Organization: Dedicated namespaces improve clarity and maintainability
- Version Consistency: Never manually upgrade without proper procedures
- Layer Order: Infrastructure services must be stable before deploying workload
- DNS Design: CNAME aliasing is cleaner than managing multiple A records
- Automation First: Infrastructure changes should always go through Ansible
- Registry Mirrors: Separate mirror instances needed for each upstream registry (Docker Registry v2 limitation)
- TLS Everywhere: Self-signed certificates require CA distribution to all clients (browsers, Docker, K3s)
- PXE Integration: Registry configuration must be available during node installation for automatic mirror usage
Conclusion
The cluster infrastructure is now deployed and stable. We have a solid base for running all workloads on the cluster. The combination of MetalLB, Traefik, Longhorn, and container registries provides all the essential primitives that applications and services will consume.
The foundation layer is complete and ready to support any workload we deploy to the cluster.
Check out the previous post to see how we built the physical cluster, or read the first post for the complete journey from the beginning.
comments powered by Disqus