K8s Homelab: Deploying the LGTM Observability Stack
Table of Contents
This is the fifth post in our “K8s Homelab” series. Check out the previous post to see how we deployed MetalLB, Traefik, and Longhorn as the foundation layer.
From Infrastructure to Observability
With the cluster infrastructure in place—MetalLB providing load balancing, Traefik handling ingress, and Longhorn offering distributed storage—the cluster was functional but blind. I had no visibility into what was happening inside. Pods could crash, nodes could fail, services could degrade, and I’d have no idea until something broke catastrophically.
It was time to add observability. Not just basic monitoring, but a comprehensive observability stack that would give me insights into logs, metrics, and traces—the three pillars of observability.
Why LGTM: The Observability Stack Choice
After researching various observability solutions, I settled on the LGTM stack:
- Loki - Log aggregation, similar to Prometheus but for logs
- Grafana - Visualization and dashboards (the “G” in LGTM)
- Tempo - Distributed tracing backend
- Mimir - Long-term metrics storage (Prometheus-compatible)
This stack is designed to work together seamlessly, with Grafana as the unified interface for querying logs, metrics, and traces. Plus, it’s all open source and integrates perfectly with Kubernetes.
The missing piece was data collection. That’s where Grafana Alloy comes in—it’s the next-generation agent that can collect logs, metrics, and traces and forward them to the LGTM stack. It replaced Grafana Agent and provides a more modern, unified approach to data collection.
The Architecture: Observability at Scale
The observability architecture follows a clear pattern:
graph TB
subgraph "Kubernetes Cluster"
subgraph "Data Collection"
Alloy[Grafana Alloy<br/>DaemonSet]
end
subgraph "Data Backends"
Loki[Loki<br/>Log Storage]
Tempo[Tempo<br/>Trace Storage]
Mimir[Mimir<br/>Metric Storage]
end
subgraph "Visualization"
Grafana[Grafana<br/>Dashboards]
end
subgraph "Infrastructure"
Traefik[Traefik<br/>Metrics]
Longhorn[Longhorn<br/>Metrics]
MetalLB[MetalLB<br/>Metrics]
MetricsServer[Metrics Server<br/>K8s Metrics]
end
end
Alloy -->|Logs| Loki
Alloy -->|Traces| Tempo
Alloy -->|Metrics| Mimir
Alloy -->|Scrapes| Traefik
Alloy -->|Scrapes| Longhorn
Alloy -->|Scrapes| MetalLB
Alloy -->|Scrapes| MetricsServer
Grafana -->|Queries| Loki
Grafana -->|Queries| Tempo
Grafana -->|Queries| Mimir
Grafana Alloy runs as a DaemonSet on every node, collecting data from various sources:
- Kubernetes pod logs
- Node metrics (kubelet, cAdvisor)
- Service discovery for pods and services
- Infrastructure component metrics (Traefik, Longhorn, MetalLB)
Data Backends store the collected data:
- Loki stores logs with efficient compression
- Tempo stores distributed traces
- Mimir stores metrics with long-term retention
Grafana provides the unified interface for querying and visualizing all observability data.
Storage Strategy: Delete vs Retain
One of the key decisions was how to handle storage for different types of data:
Logs, Metrics, Traces (
lg-hdd-raw-x1-delete): These are time-series data that can be regenerated. If we lose historical data, we can start fresh. Single replica is sufficient for a homelab, and Delete reclaim policy means we don’t waste storage on old data.Grafana Dashboards/Configs (
lg-hdd-raw-x3): These are configuration data that must persist. Three replicas provide redundancy, and Retain policy ensures we never accidentally lose our dashboards and configurations.
This storage strategy balances cost (single replica for ephemeral data) with reliability (triple replica for critical configs).
Challenge 1: Enabling Metrics in Existing Roles
Before deploying the LGTM stack, I needed to ensure that existing infrastructure components were exposing metrics. This required modifications to three roles:
Traefik Metrics
File: cluster/roles/ingress/templates/values.yaml.j2
I added Prometheus metrics configuration to Traefik:
metrics:
prometheus:
enabled: true
port: 8080
path: /metrics
Traefik exposes metrics on port 8080 at /metrics, which Grafana Alloy can scrape.
Longhorn Metrics
Longhorn manager exposes metrics by default on port 9500. I verified this was working and didn’t need any changes—the metrics endpoint was already available at longhorn-manager.longhorn-system.svc.cluster.local:9500.
MetalLB Metrics
MetalLB exposes metrics by default on port 7472 for both the controller and speaker components. Again, no changes were needed—the endpoints were already available.
Challenge 2: Metrics Server Deployment
K3s ships with metrics-server disabled by default. I needed to deploy it separately to expose Kubernetes resource metrics (CPU, memory) for pods and nodes.
File: cluster/roles/metrics-server/defaults/main.yaml
---
# Metrics-server role default variables
metrics_server_namespace: metrics-server-system
metrics_server_version: 0.7.0
File: cluster/roles/metrics-server/tasks/main.yaml
The metrics-server role follows the same pattern as other roles:
- Add Helm repository
- Create namespace
- Template values
- Deploy via Helm
The key configuration is in values.yaml.j2:
args:
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
The --kubelet-insecure-tls flag is necessary for K3s, which uses self-signed certificates.
Challenge 3: LGTM Stack Deployment
The LGTM role is the most complex, deploying five components in sequence:
Component Configuration
File: cluster/roles/lgtm/defaults/main.yaml
lgtm_namespace: lgtm-system
# Component versions (latest stable as of January 2025)
# These are Helm chart versions, not app versions
loki_version: "6.46.0"
grafana_version: "10.1.4"
tempo_version: "1.24.0"
mimir_version: "6.0.1"
alloy_version: "1.4.0"
# Storage classes
lgtm_data_storage_class: lg-hdd-raw-x1-delete
grafana_storage_class: lg-hdd-raw-x1-delete
# Grafana configuration
grafana_hostname: grafana.lab.x.y.z
grafana_ingress_class: traefik
# Grafana admin password should be set in cluster/secrets/grafana.yaml (git-crypt encrypted)
# This is a fallback default - will be overridden by cluster/secrets/grafana.yaml if it exists
grafana_admin_password: admin
# Mimir multi-tenancy configuration
mimir_org_id: "homelab"
# Storage sizes (conservative for homelab)
loki_storage_size: 50Gi
tempo_storage_size: 50Gi
mimir_storage_size: 50Gi
grafana_storage_size: 10Gi
# Retention periods (data rotation)
loki_retention_period: 168h # 7 days
tempo_retention_period: 360h # 15 days
mimir_retention_period: 720h # 30 days
The Grafana admin password is stored in cluster/secrets/grafana.yaml (git-crypted for security).
Deployment Order
The components are deployed in a specific order:
- Loki - Log storage backend
- Tempo - Trace storage backend
- Mimir - Metrics storage backend
- Grafana Alloy - Data collection (needs backends to be ready)
- Grafana - Visualization (needs backends and data sources configured)
We rely on Kubernetes reconciliation—no explicit wait conditions. Components will retry connections automatically until their dependencies are ready.
Loki Configuration
File: cluster/roles/lgtm/templates/loki-values.yaml.j2
Loki runs in distributed mode with separate backend and write components. The configuration includes:
loki:
auth_enabled: false
storage:
bucketNames:
chunks: chunks
ruler: ruler
admin: admin
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
gateway:
enabled: true
storage:
type: filesystem
filesystem:
directory: /loki/chunks
limits_config:
retention_period: {{ loki_retention_period }}
persistence:
enabled: true
storageClassName: {{ lgtm_data_storage_class }}
size: {{ loki_storage_size }}
# Backend persistence for distributed mode
backend:
persistence:
enabled: true
volumeClaimsEnabled: true
storageClass: {{ lgtm_data_storage_class }}
size: {{ loki_storage_size }}
# Write persistence for distributed mode
write:
persistence:
enabled: true
volumeClaimsEnabled: true
storageClass: {{ lgtm_data_storage_class }}
size: {{ loki_storage_size }}
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Key points:
- Schema Config: Required for Loki to define storage schema and index configuration
- Distributed Mode: Uses separate backend and write StatefulSets for scalability
- Persistence: Both backend and write components need
volumeClaimsEnabled: trueand usestorageClass(notstorageClassName) for StatefulSet volumeClaimTemplates - Gateway: Provides a unified entry point for log ingestion
Tempo Configuration
File: cluster/roles/lgtm/templates/tempo-values.yaml.j2
tempo:
storage:
trace: # Note: 'trace' not 'traces' - this is the correct field name
backend: local
local:
path: /var/tempo/traces
compactor:
compaction:
retention: {{ tempo_retention_period }}
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
persistence:
enabled: true
storageClassName: {{ lgtm_data_storage_class }}
size: {{ tempo_storage_size }}
Tempo runs in single mode with local filesystem storage. For a homelab, this is simpler than distributed mode and perfectly adequate.
Important: The storage field is trace (singular), not traces (plural). This is a common mistake that causes configuration validation errors.
Mimir Configuration
File: cluster/roles/lgtm/templates/mimir-values.yaml.j2
Mimir runs in distributed mode with multiple components (distributor, ingester, querier, queryFrontend, compactor, storeGateway). Each component that needs storage requires its own persistence configuration:
# Configure Mimir for single mode (1 replica of each component)
distributor:
replicas: 1
ingester:
replicas: 1
persistentVolume: # Note: 'persistentVolume' not 'persistence'
enabled: true
storageClass: {{ lgtm_data_storage_class }} # Note: 'storageClass' not 'storageClassName'
size: {{ mimir_storage_size }}
querier:
replicas: 1
queryFrontend:
replicas: 1
compactor:
replicas: 1
persistentVolume:
enabled: true
storageClass: {{ lgtm_data_storage_class }}
size: {{ mimir_storage_size }}
store_gateway: # Note: snake_case, not camelCase
replicas: 1
persistentVolume:
enabled: true
storageClass: {{ lgtm_data_storage_class }}
size: {{ mimir_storage_size }}
# Disable ruler and alertmanager - using Grafana unified alerting instead
ruler:
replicas: 0
alertmanager:
replicas: 0
# Disable MinIO - we use filesystem storage
minio:
enabled: false
# Disable Kafka - not needed for single mode
kafka:
enabled: false
# Disable rollout operator - webhooks cause TLS certificate issues
# The rollout operator is mainly for zero-downtime rollouts in distributed mode
rolloutOperator:
enabled: false
mimir:
structuredConfig:
target: all
# Enable tenant federation to allow querying across multiple tenants
tenant_federation:
enabled: true
blocks_storage:
backend: filesystem
filesystem:
dir: /data/blocks
tsdb:
dir: /data/tsdb
common:
storage:
filesystem:
dir: /data/blocks
ingest_storage:
enabled: false
ingester:
# Enable Push gRPC method when ingest_storage is disabled
# This is required for Mimir v3.0.0+
push_grpc_method_enabled: true
compactor:
data_dir: /data/compactor
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
Critical configuration points:
- persistentVolume vs persistence: Mimir uses
persistentVolume(notpersistence) for StatefulSet components - storageClass vs storageClassName: Use
storageClass(notstorageClassName) for StatefulSet volumeClaimTemplates - store_gateway naming: Must use snake_case
store_gateway, not camelCasestoreGateway - Ruler and Alertmanager disabled: Set to
replicas: 0since we use Grafana Unified Alerting instead of Mimir’s built-in alerting - Storage path separation: All components (blocks_storage, tsdb, compactor) must use different paths under
/datato avoid conflicts - Ingester Push gRPC: When
ingest_storage.enabled: false, must setingester.push_grpc_method_enabled: truefor Mimir v3.0.0+ - Rollout operator: Disabled to avoid webhook TLS certificate issues in homelab environments
- Multi-tenancy: Enabled by default, with tenant federation enabled to allow querying across multiple tenants
Grafana Alloy Configuration
File: cluster/roles/lgtm/templates/alloy-values.yaml.j2
Grafana Alloy configuration is the most complex, defining comprehensive scraping with multi-tenancy support:
alloy:
configMap:
create: true
content: |
logging {
level = "debug"
format = "logfmt"
}
// Prometheus remote write to Mimir
// Using cluster-specific orgId for multi-tenancy
prometheus.remote_write "mimir" {
endpoint {
url = "http://mimir-distributor:8080/api/v1/push"
headers = {
"X-Scope-OrgID" = "{{ mimir_org_id }}",
}
}
}
// Static service scraping
prometheus.scrape "traefik" {
targets = [{
__address__ = "traefik.traefik-system.svc.cluster.local:8080",
}]
forward_to = [prometheus.remote_write.mimir.receiver]
}
prometheus.scrape "longhorn" {
targets = [{
__address__ = "longhorn-manager.longhorn-system.svc.cluster.local:9500",
}]
forward_to = [prometheus.remote_write.mimir.receiver]
}
prometheus.scrape "metallb_controller" {
targets = [{
__address__ = "metallb-controller.metallb-system.svc.cluster.local:7472",
}]
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Discover Kubernetes nodes for kubelet scraping
discovery.kubernetes "kubelet_nodes" {
role = "node"
}
// Relabel to set correct address for kubelet scraping
// Filter to only scrape the node where this Alloy pod is running
discovery.relabel "kubelet_local" {
targets = discovery.kubernetes.kubelet_nodes.targets
// Only keep targets for the node where this pod is running
rule {
source_labels = ["__meta_kubernetes_node_name"]
regex = env("NODE_NAME")
action = "keep"
}
// Set the scrape address to node IP:10250
rule {
source_labels = ["__meta_kubernetes_node_address_InternalIP"]
target_label = "__address__"
replacement = "${1}:10250"
}
rule {
target_label = "__scheme__"
replacement = "https"
}
rule {
target_label = "__metrics_path__"
replacement = "/metrics/cadvisor"
}
}
// Scrape cAdvisor metrics from kubelets (each pod scrapes only its local node)
prometheus.scrape "kubelet_cadvisor" {
targets = discovery.relabel.kubelet_local.output
forward_to = [prometheus.remote_write.mimir.receiver]
scrape_interval = "15s"
scrape_timeout = "10s"
bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
scheme = "https"
tls_config {
insecure_skip_verify = true
}
}
// Self-monitoring
prometheus.exporter.self "alloy" {
}
prometheus.scrape "alloy" {
targets = prometheus.exporter.self.alloy.targets
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Loki log collection
loki.source.file "kubectl_logs" {
targets = [
{
__path__ = "/var/log/pods/*/*/*.log",
job = "kubernetes/pods",
},
]
forward_to = [loki.write.loki.receiver]
}
loki.write "loki" {
endpoint {
url = "http://loki-gateway:80/loki/api/v1/push"
headers = {
"X-Scope-OrgID" = "{{ mimir_org_id }}",
}
}
}
// OpenTelemetry traces
otelcol.receiver.otlp "tempo" {
grpc {
endpoint = "0.0.0.0:4317"
}
http {
endpoint = "0.0.0.0:4318"
}
output {
traces = [otelcol.exporter.otlp.tempo.input]
}
}
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo:4317"
}
}
extraEnv:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
This configuration:
- Multi-tenancy: Sends metrics and logs with
X-Scope-OrgIDheader set to{{ mimir_org_id }}(defaults to “homelab”) - Node-local scraping: Each Alloy pod only scrapes its local node’s kubelet to avoid out-of-order samples and reduce network traffic
- Kubernetes node discovery: Uses
discovery.kubernetesto discover nodes, then filters to only the pod’s local node usingenv("NODE_NAME") - cAdvisor metrics: Scrapes container metrics from kubelet’s
/metrics/cadvisorendpoint - Infrastructure metrics: Scrapes Traefik, Longhorn, and MetalLB metrics
- Log collection: Collects all pod logs from
/var/log/pods - Distributed tracing: Receives OpenTelemetry traces on ports 4317 (gRPC) and 4318 (HTTP)
- Self-monitoring: Monitors Alloy’s own metrics
Key improvements:
- Mimir remote write URL: Changed from
mimir-gateway:8080/prometheus/api/v1/pushtomimir-distributor:8080/api/v1/push(distributor is the correct endpoint for ingestion) - Node filtering: Prevents duplicate metrics and out-of-order sample errors by ensuring each Alloy pod only scrapes its local node
- Multi-tenancy headers: All data sent to Mimir and Loki includes the cluster-specific orgId
Grafana Configuration
File: cluster/roles/lgtm/templates/grafana-values.yaml.j2
admin:
existingSecret: grafana-admin-secret
passwordKey: admin-password
persistence:
enabled: true
storageClassName: {{ grafana_storage_class }}
size: {{ grafana_storage_size }}
ingress:
enabled: true
ingressClassName: {{ grafana_ingress_class }}
hosts:
- {{ grafana_hostname }}
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki-gateway:80
uid: loki-{{ mimir_org_id }}
isDefault: true
editable: true
jsonData:
# Add X-Scope-OrgID header for Loki multi-tenancy
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
# Set the orgId value for this cluster's datasource
httpHeaderValue1: "{{ mimir_org_id }}"
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3100
uid: tempo-{{ mimir_org_id }}
editable: true
# Note: Tempo may not support X-Scope-OrgID directly via OTLP
# But we configure it for consistency with the multi-tenant setup
jsonData:
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "{{ mimir_org_id }}"
- name: Mimir
type: prometheus
access: proxy
url: http://mimir-gateway:80/prometheus
uid: mimir-{{ mimir_org_id }}
editable: true
jsonData:
httpMethod: POST
# Add X-Scope-OrgID header for Mimir multi-tenancy
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
# Set the orgId value for this cluster's datasource
httpHeaderValue1: "{{ mimir_org_id }}"
version: 1
withCredentials: false
isDefault: false
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: false
options:
path: /var/lib/grafana/dashboards
extraConfigmapMounts:
- name: grafana-dashboards
mountPath: /var/lib/grafana/dashboards
configMap: grafana-dashboards
readOnly: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Key points:
- Admin password: Comes from Kubernetes Secret (created from
cluster/secrets/grafana.yaml) - Storage: Uses
lg-hdd-raw-x1-deletestorage class for persistence (updated from x3-retain) - Ingress: Configured for
grafana.lab.x.y.z - Multi-tenancy: All datasources include
X-Scope-OrgIDheader with cluster-specific orgId - Datasource UIDs: Include orgId in UID (e.g.,
mimir-homelab,loki-homelab) for proper identification - Dashboard provisioning: ConfigMaps can be mounted to automatically provision dashboards
- Loki default: Loki is set as the default data source for log queries
Challenge 4: Storage Class Configuration Issues
One of the most challenging issues during deployment was getting persistent storage correctly configured for StatefulSets. The Helm charts use different structures for different components, and getting the storage class names right was critical.
Longhorn Storage Class Fix
Longhorn 1.10.0 requires the dataEngine: "v1" parameter in StorageClass definitions for CSI storage capacity tracking to work correctly. Without this, Kubernetes cannot see available Longhorn storage, causing pods to remain in Pending state with “no persistent volumes available” errors.
File: cluster/roles/longhorn/templates/storage-classes.yaml.j2
parameters:
numberOfReplicas: "1"
dataEngine: "v1" # Required for Longhorn 1.10.0+ CSI capacity tracking
Helm Chart Structure Differences
Different Helm charts use different field names for persistence configuration:
- Loki: Uses
backend.persistence.storageClassandwrite.persistence.storageClasswithvolumeClaimsEnabled: true - Mimir: Uses
component.persistentVolume.storageClass(note:persistentVolume, notpersistence) - Grafana: Uses
persistence.storageClassName(standard Deployment) - Tempo: Uses
persistence.storageClassName(standard StatefulSet)
The key difference is that StatefulSets use volumeClaimTemplates which cannot be patched after creation. If the storage class is wrong, you must delete the StatefulSet and its PVCs to recreate them with the correct configuration.
Mimir Rollout Operator Webhook Issues
The Mimir rollout operator creates webhooks for zero-downtime rollouts, but these webhooks had TLS certificate issues in our homelab setup. The certificates were valid for rollout-operator.lgtm-system.svc, but the webhooks referenced mimir-rollout-operator.lgtm-system.svc, causing validation failures.
Solution: Disable the rollout operator for homelab use:
rolloutOperator:
enabled: false
This is safe for single-node deployments where zero-downtime rollouts are less critical.
Challenge 5: Configuration Validation Errors
Several configuration validation errors required careful attention:
Tempo Storage Configuration
The Tempo Helm chart expects storage.trace (singular), not storage.traces (plural). This caused validation errors on startup.
Mimir Ingester Configuration
Mimir v3.0.0+ requires ingester.push_grpc_method_enabled: true when ingest_storage.enabled: false. Without this, ingester pods crash with:
error validating config: cannot disable Push gRPC method in ingester,
while ingest storage (-ingest-storage.enabled) is not enabled
Mimir Storage Path Conflicts
Mimir requires all components to use non-overlapping filesystem paths. The initial configuration used /var/mimir/data for multiple components, causing conflicts. The solution is to use separate paths under /data:
blocks_storage:
filesystem:
dir: /data/blocks
tsdb:
dir: /data/tsdb
common:
storage:
filesystem:
dir: /data/blocks
ruler_storage:
filesystem:
dir: /data/ruler-storage
alertmanager:
data_dir: /data/alertmanager
compactor:
data_dir: /data/compactor
ruler:
rule_path: /data/ruler-rules
Each component must have its own unique path to avoid “directory cannot overlap” errors.
Grafana Datasource URL
The Mimir datasource URL must point to the gateway for queries:
url: http://mimir-gateway:80/prometheus # Not http://mimir:9009
However, for ingestion (remote write), Alloy must use the distributor:
url = "http://mimir-distributor:8080/api/v1/push" # Distributor for ingestion
The gateway is for queries, the distributor is for ingestion.
Challenge 6: Kubernetes Reconciliation
One of the key design decisions was to rely on Kubernetes reconciliation instead of explicit wait conditions. This means:
- Components deploy in sequence but don’t wait for dependencies
- Kubernetes automatically retries connections
- Grafana retries data source connections until backends are ready
- Alloy retries remote write connections until Mimir is ready
This approach is simpler and more resilient—if a component restarts, it will automatically reconnect.
Challenge 7: Self-Signed CA Certificates for HTTPS
After deploying the LGTM stack and other services, I wanted to enable HTTPS for all internal services. While Let’s Encrypt is the standard solution for public-facing services, it requires public DNS resolution and can be complex to set up with DNS-01 challenges. For a homelab environment, a self-signed Certificate Authority (CA) provides a simpler, more flexible solution.
Why Self-Signed CA?
Benefits:
- No External Dependencies: No need for public DNS or external certificate authorities
- Full Control: Complete control over certificate validity periods and configurations
- Internal Network Focus: Perfect for services only accessible within the home network
- Simple Setup: No webhook integrations or DNS API credentials required
Trade-offs:
- Manual Trust: Users must manually install the CA certificate in their browsers/OS
- No Public Trust: Certificates won’t be trusted by default (expected for internal services)
CA Certificate Generation
The first step was creating our own Certificate Authority. All CA files are stored in cluster/secrets/ca/ and encrypted with git-crypt:
File: .gitattributes
cluster/secrets/ca/* filter=git-crypt diff=git-crypt
CA Root Certificate Generation:
# Generate CA private key
openssl genrsa -out cluster/secrets/ca/ca.key 4096
# Generate CA certificate (valid for 10 years)
openssl req -new -x509 -days 3650 -key cluster/secrets/ca/ca.key \
-out cluster/secrets/ca/ca.crt \
-subj "/CN=Home Lab CA/O=Home Lab/C=FR"
Service Certificate Generation:
For each service (Grafana, Longhorn, ntfy, and the CA service itself), we generate certificates with Subject Alternative Names (SAN) and proper extensions:
# Example: Grafana certificate
openssl genrsa -out cluster/secrets/ca/grafana.key 2048
# Create certificate signing request
openssl req -new -key cluster/secrets/ca/grafana.key \
-out cluster/secrets/ca/grafana.csr \
-subj "/CN=grafana.lab.x.y.z"
# Generate certificate with SAN and extensions
openssl x509 -req -in cluster/secrets/ca/grafana.csr \
-CA cluster/secrets/ca/ca.crt \
-CAkey cluster/secrets/ca/ca.key \
-CAcreateserial \
-out cluster/secrets/ca/grafana.crt \
-days 3650 \
-extensions v3_req \
-extfile <(echo -e "[v3_req]\nkeyUsage = keyEncipherment, dataEncipherment\nextendedKeyUsage = serverAuth\nsubjectAltName = @alt_names\n[alt_names]\nDNS.1 = grafana.lab.x.y.z")
# Repeat for other services:
# - longhorn.lab.x.y.z
# - ntfy.lab.x.y.z
# - ca.lab.x.y.z
Key Requirements:
- Subject Alternative Names (SAN): Required for modern browsers to trust certificates
- Key Usage Extensions:
keyEncipherment,dataEncipherment, andserverAuthare essential - Certificate Chain: Service certificates must include the CA certificate in the chain for browsers to trust them
Kubernetes TLS Secret Creation
Each service needs a Kubernetes TLS secret containing both the service certificate and the CA certificate (for the full chain). The Ansible role loads the certificates and creates the secrets:
File: cluster/roles/lgtm/tasks/main.yaml
- name: Load Grafana certificate
slurp:
src: "{{ playbook_dir }}/../secrets/ca/grafana.crt"
register: grafana_cert_file
- name: Load CA certificate for chain
slurp:
src: "{{ playbook_dir }}/../secrets/ca/ca.crt"
register: ca_cert_file
- name: Load Grafana private key
slurp:
src: "{{ playbook_dir }}/../secrets/ca/grafana.key"
register: grafana_key_file
- name: Create Grafana TLS secret with certificate chain
kubernetes.core.k8s:
state: present
resource_definition:
apiVersion: v1
kind: Secret
metadata:
name: grafana-tls
namespace: "{{ lgtm_namespace }}"
type: kubernetes.io/tls
stringData:
tls.crt: "{{ grafana_cert_file.content | b64decode }}{{ ca_cert_file.content | b64decode }}"
tls.key: "{{ grafana_key_file.content | b64decode }}"
Critical Point: The tls.crt field must contain the full certificate chain—first the service certificate, then the CA certificate. Browsers require the complete chain to establish trust.
CA Certificate Distribution Service
To make it easy for users to install the CA certificate, we deployed a lightweight Nginx service that serves the CA certificate and installation instructions:
File: cluster/roles/ca/templates/ca-deployment.yaml.j2
The CA service includes:
- Nginx Deployment: Lightweight static web server
- HTML Download Page: Instructions for installing the CA certificate on different platforms (Chrome, Firefox, macOS, Windows)
- CA Certificate Endpoint: Direct download link at
/ca.crt - HTTPS Access: The service itself uses HTTPS with a certificate from our CA
File: cluster/roles/ca/templates/nginx.conf.j2
server {
listen 80;
server_name _;
root /usr/share/nginx/html;
index index.html;
# Serve CA certificate file
location /ca.crt {
alias /etc/ssl/ca/ca.crt;
default_type application/x-x509-ca-cert;
add_header Content-Disposition 'attachment; filename="home-lab-ca.crt"';
}
location / {
try_files $uri $uri/ /index.html;
}
}
The CA service is accessible at https://ca.lab.x.y.z, providing a user-friendly way to download and install the root CA certificate.
Ingress TLS Configuration
All services were updated to use HTTPS with the self-signed certificates:
Grafana Ingress:
ingress:
enabled: true
ingressClassName: traefik
hosts:
- grafana.lab.x.y.z
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
tls:
- hosts:
- grafana.lab.x.y.z
secretName: grafana-tls
Longhorn Ingress (with WebSocket support):
ingress:
enabled: true
ingressClassName: traefik
host: longhorn.lab.x.y.z
tls: true
tlsSecretName: longhorn-tls
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/websocket: "true" # Required for Longhorn UI
Important: Longhorn’s UI uses WebSocket connections, which require the traefik.ingress.kubernetes.io/websocket: "true" annotation. Without this, WebSocket connections fail with HTTP 200 responses instead of proper WebSocket upgrades (HTTP 101).
Certificate Chain Validation
One of the most common issues was browsers showing “Potential Security Risk Ahead” warnings even after installing the CA certificate. This was caused by:
- Missing Certificate Chain: The TLS secret only contained the service certificate, not the full chain
- Missing SAN Extensions: Certificates without Subject Alternative Names are rejected by modern browsers
- Incorrect Key Usage: Missing
serverAuthextension
Solution: Ensure all service certificates include:
- SAN extensions with the correct DNS names
- Proper key usage extensions (
keyEncipherment,dataEncipherment,serverAuth) - Full certificate chain in the Kubernetes TLS secret (service cert + CA cert)
Longhorn Ingress Patching
The Longhorn Helm chart generates its own ingress with a default TLS secret name. We need to patch it after deployment to use our custom TLS secret:
File: cluster/roles/longhorn/tasks/main.yaml
- name: Patch Longhorn ingress to use correct TLS secret and WebSocket support
kubernetes.core.k8s:
state: present
definition:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: longhorn-ingress
namespace: "{{ longhorn_namespace }}"
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/websocket: "true"
spec:
tls:
- hosts:
- "{{ longhorn_ingress_host }}"
secretName: longhorn-tls
This ensures the Longhorn ingress uses our custom TLS secret and has WebSocket support enabled.
DNS Configuration
All services are accessible via DNS names configured in the router:
File: network/roles/router/templates/network/dhcp.conf
# Cluster services
cname=longhorn.lab.x.y.z,cluster.lab.x.y.z
cname=grafana.lab.x.y.z,cluster.lab.x.y.z
cname=ntfy.lab.x.y.z,cluster.lab.x.y.z
cname=ca.lab.x.y.z,cluster.lab.x.y.z
All services resolve to the cluster VIP (cluster.lab.x.y.z), which is handled by MetalLB and Traefik.
User Experience
After deploying the CA service and installing the root CA certificate:
- Visit CA Service: Navigate to
https://ca.lab.x.y.z - Download Certificate: Click the download button to get
home-lab-ca.crt - Install in Browser/OS: Follow platform-specific instructions (Chrome, Firefox, macOS, Windows)
- Access Services: All services (
grafana,longhorn,ntfy) now load without security warnings
Security Considerations
- Private Keys: All private keys are stored in
cluster/secrets/ca/and encrypted withgit-crypt - Certificate Validity: CA certificate is valid for 10 years; service certificates are also valid for 10 years
- Internal Use Only: These certificates are only valid for internal network services
- No Public Trust: This is expected and acceptable for homelab environments
Challenge 8: Alert Notifications with ntfy
Having metrics and alerts is only useful if you’re notified when something goes wrong. I set up ntfy as a self-hosted push notification service to receive alerts from Grafana’s Unified Alerting system.
Why ntfy?
ntfy is a simple, self-hosted push notification service that:
- Has a web interface for testing
- Supports mobile apps (iOS/Android) for receiving notifications
- Works with webhooks (perfect for Grafana integration)
- Is lightweight and easy to deploy
- Requires no external services or API keys
Deploying ntfy
Created a dedicated Ansible role for ntfy:
File: cluster/roles/ntfy/tasks/main.yaml
- name: Create ntfy namespace
kubernetes.core.k8s:
name: "{{ ntfy_namespace }}"
api_version: v1
kind: Namespace
state: present
- name: Create ntfy deployment
kubernetes.core.k8s:
state: present
definition:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ntfy
namespace: "{{ ntfy_namespace }}"
spec:
replicas: 1
selector:
matchLabels:
app: ntfy
template:
metadata:
labels:
app: ntfy
spec:
containers:
- name: ntfy
image: binwiederhier/ntfy:latest
command: ["ntfy", "serve"]
ports:
- containerPort: 80
name: http
File: cluster/roles/ntfy/templates/ntfy-ingress.yaml.j2
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ntfy
namespace: {{ ntfy_namespace }}
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
spec:
ingressClassName: {{ ntfy_ingress_class }}
tls:
- hosts:
- {{ ntfy_hostname }}
secretName: ntfy-tls
rules:
- host: {{ ntfy_hostname }}
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ntfy
port:
number: 80
Like other services, ntfy uses our self-signed CA certificate for HTTPS.
Configuring Grafana Unified Alerting
Grafana’s Unified Alerting system replaced the old alerting mechanisms in Grafana 8.0+. It uses a more flexible architecture with:
- Alert Rules: Define the conditions that trigger alerts
- Contact Points: Define where to send notifications (webhooks, email, Slack, etc.)
- Notification Policies: Define routing rules for alerts to contact points
Creating Alert Rules
Created alert rules for CPU and RAM usage:
File: cluster/roles/lgtm/templates/grafana-alert-rules.yaml.j2
apiVersion: 1
groups:
- name: cpu-ram-usage
interval: 1m
orgId: 1
folder: "Infrastructure"
rules:
- uid: high-cpu-usage
title: High CPU Usage
condition: cpu_threshold
data:
- refId: cpu_usage
relativeTimeRange:
from: 300
to: 0
datasourceUid: mimir-homelab
model:
expr: (sum(rate(container_cpu_usage_seconds_total{id=~"/kubepods/.*"}[5m])) / sum(machine_cpu_cores)) * 100
- refId: cpu_reduced
datasourceUid: __expr__
model:
expression: cpu_usage
reducer: last
type: reduce
- refId: cpu_threshold
datasourceUid: __expr__
model:
expression: cpu_reduced
type: threshold
conditions:
- evaluator:
params: [90]
type: gt
noDataState: NoData
execErrState: Alerting
for: 5m
annotations:
description: "CPU usage is above 90% for more than 5 minutes. Current value: {{ $values.cpu_usage }}%"
summary: "High CPU usage detected"
labels:
severity: warning
cluster: "homelab"
Critical Insight: Alert expressions must follow a specific flow: Query → Reduce → Threshold. The threshold type requires reduced (single-value) data, not time series data. Without the reduce step, alerts will fail with “invalid format of evaluation results: looks like time series data, only reduced data can be alerted on.”
Configuring Contact Points via API
After experimenting with file-based provisioning, I switched to API-based contact point configuration for greater flexibility and better notification formatting. This approach allows us to use ntfy’s JSON message format for rich notifications:
File: cluster/roles/lgtm/tasks/configure.yaml
- name: Create or update ntfy contact point via API
shell: |
GRAFANA_POD="{{ grafana_pod_info.resources[0].metadata.name }}"
GRAFANA_PASSWORD="{{ grafana_secret.resources[0].data['admin-password'] | b64decode }}"
NAMESPACE="{{ lgtm_namespace }}"
# Check if contact point exists
EXISTING=$(kubectl exec -n $NAMESPACE "$GRAFANA_POD" -- sh -c "
curl -s -u admin:\"$GRAFANA_PASSWORD\" \
http://localhost:3000/api/v1/provisioning/contact-points
" | grep -o '\"uid\":\"ntfy\"' || echo "")
if [ -n "$EXISTING" ]; then
# Update existing contact point
kubectl exec -n $NAMESPACE "$GRAFANA_POD" -- sh -c "
curl -s -X PUT \
-H 'Content-Type: application/json' \
-H 'X-Disable-Provenance: true' \
-u admin:\"$GRAFANA_PASSWORD\" \
-d '{
\"uid\": \"ntfy\",
\"name\": \"ntfy\",
\"type\": \"webhook\",
\"settings\": {
\"url\": \"http://ntfy.ntfy-system.svc.cluster.local/grafana-alerts\",
\"httpMethod\": \"POST\",
\"message\": \"{{ if eq .Status \\\"firing\\\" }}🔥 FIRING{{ else }}✅ RESOLVED{{ end }}: {{ .CommonLabels.alertname }}\\n\\n📊 Cluster: {{ .CommonLabels.cluster }}\\n⚠️ Severity: {{ .CommonLabels.severity }}\\n\\n{{ .CommonAnnotations.description }}\\n\\n🔗 View in Grafana: https://grafana.lab.x.y.z/alerting/list\"
}
}' \
http://localhost:3000/api/v1/provisioning/contact-points/ntfy
"
else
# Create new contact point (same body as update)
fi
This configuration provides:
- Emojis: 🔥 for firing alerts, ✅ for resolved alerts
- Dynamic Content: Alert status, cluster name, and severity information
- Rich Formatting: Multi-line messages with emojis for better readability
- Direct Links: Direct links to Grafana’s alerting page in the message
Configuring Notification Policies via API
File-based provisioning for notification policies proved unreliable. The solution was to configure notification policies via Grafana’s API in a separate playbook:
File: cluster/playbooks/configure.yaml
---
- name: Configure Grafana notification policies via API
hosts: localhost
gather_facts: false
vars:
kubeconfig_path: "{{ playbook_dir }}/../../machines/data/kubeconfig"
tasks:
- name: Include configure tasks from lgtm role
ansible.builtin.include_role:
name: lgtm
tasks_from: configure
File: cluster/roles/lgtm/tasks/configure.yaml
- name: Configure notification policy via kubectl exec
shell: |
GRAFANA_POD="{{ grafana_pod_info.resources[0].metadata.name }}"
GRAFANA_PASSWORD="{{ grafana_secret.resources[0].data['admin-password'] | b64decode }}"
kubectl exec -n {{ lgtm_namespace }} "$GRAFANA_POD" -- sh -c "
curl -s -X PUT \
-H 'Content-Type: application/json' \
-H 'X-Disable-Provenance: true' \
-u admin:\"$GRAFANA_PASSWORD\" \
-d '{
\"receiver\": \"ntfy\",
\"group_by\": [\"alertname\", \"cluster\"],
\"group_wait\": \"10s\",
\"group_interval\": \"10s\",
\"repeat_interval\": \"12h\"
}' \
http://localhost:3000/api/v1/provisioning/policies
"
The X-Disable-Provenance: true header is critical—it allows API-based configuration to work alongside file-based provisioning.
Makefile Integration
Added a separate target for configuring notification policies:
File: Makefile
.PHONY: cluster/configure
cluster/configure: machines/kubeconfig
@echo "⚙️ Configuring Grafana notification policies via API..."
ansible-playbook -i inventory.yaml cluster/playbooks/configure.yaml
Deployment workflow:
make cluster/deploy # Deploy infrastructure and Grafana with alert rules
make cluster/configure # Configure contact points and notification policies via API
Fixing False Positive Alerts
After deployment, both CPU and RAM alerts were firing constantly despite actual usage being only ~15%. The issue was the alert expression structure.
Root Cause: Using threshold type directly on time series data. Grafana’s threshold expressions require reduced (single-value) data, not time series data. The error in Grafana logs:
Error: invalid format of evaluation results for the alert definition cpu_threshold: looks like time series data, only reduced data can be alerted on.
Solution: Add a reduce step between the query and threshold:
- Query (
cpu_usage): Get metrics from Prometheus/Mimir → time series data - Reduce (
cpu_reduced): Convert time series to single value usinglastreducer - Threshold (
cpu_threshold): Evaluate if reduced value > 90
This three-step flow ensures alerts only fire when metrics actually exceed thresholds.
Testing Notifications
With contact points and notification policies configured via API, notifications flow correctly:
curl -s "https://ntfy.lab.x.y.z/grafana-alerts/json?poll=1&since=5m" --insecure
Output shows successful notifications:
{
"receiver": "ntfy",
"status": "firing",
"alerts": [{
"status": "firing",
"labels": {
"alertname": "High CPU Usage",
"cluster": "homelab",
"severity": "warning"
},
"annotations": {
"description": "CPU usage is above 90% for more than 5 minutes.",
"summary": "High CPU usage detected"
}
}],
"title": "Grafana Alert: High CPU Usage",
"message": "CPU usage is above 90% for more than 5 minutes."
}
Notifications are received for both:
- Firing alerts (
status: "firing") when thresholds are exceeded - Resolved alerts (
status: "resolved") when conditions return to normal
Mobile App Integration
The ntfy mobile app (available for iOS and Android) can subscribe to the grafana-alerts topic at https://ntfy.lab.x.y.z. This enables push notifications to your phone when alerts fire.
Key Takeaways
Alert Expression Flow: Grafana Unified Alerting requires a specific flow: Query → Reduce → Threshold. Threshold expressions need reduced data, not time series. Always test alert rules with real data to avoid false positives.
API-Based Contact Points: API-based provisioning for contact points offers greater flexibility than file-based, especially for complex message formatting (like ntfy’s JSON format with emojis, priorities, and clickable links).
Separate Playbooks: Keep infrastructure deployment (
make cluster/deploy) separate from configuration (make cluster/configure). This separation allows you to reconfigure notification policies without redeploying the entire stack.X-Disable-Provenance Header: This header allows API-configured resources to coexist with file-based provisioned resources, enabling a hybrid approach.
ntfy is Simple: No external services, no API keys, no complex configuration—just a webhook URL and you’re done. Perfect for homelab environments.
Rich Notifications: ntfy’s JSON message format supports emojis, priority levels, tags, and clickable links—making alerts more informative and actionable on mobile devices.
Test Alert Rules: Always verify alert thresholds with actual metrics before deployment. False positive alerts create alert fatigue and undermine the entire monitoring system.
The Deployment
With all roles created and configured, deployment is straightforward:
make cluster/deploy
This runs the playbook that deploys:
- MetalLB
- Traefik (with metrics enabled)
- Longhorn
- Metrics Server
- LGTM Stack (Loki, Tempo, Mimir, Alloy, Grafana)
After deployment, Grafana is accessible at grafana.lab.x.y.z with all data sources pre-configured.
Note: Ensure Grafana is added to the cluster_services list in your inventory and redeploy the router configuration to create the DNS CNAME record:
cluster_services:
- longhorn
- grafana
Then redeploy router configuration:
make router/deploy
What We Achieved
With the LGTM stack deployed, we now have:
✅ Comprehensive Log Collection: All pod logs collected and stored in Loki ✅ Metrics Collection: Kubernetes metrics, node metrics, and infrastructure metrics collected and stored in Mimir ✅ Distributed Tracing: OpenTelemetry traces collected and stored in Tempo ✅ Unified Visualization: Grafana providing a single interface for logs, metrics, and traces ✅ Infrastructure Monitoring: Traefik, Longhorn, and MetalLB metrics exposed and scraped ✅ Kubernetes Metrics: Metrics server providing resource usage data ✅ Node Failure Detection: Alloy can detect node failures and alert via Grafana ✅ Data Persistence: Critical configs retained, ephemeral data deletable ✅ HTTPS for All Services: Self-signed CA certificates enable secure HTTPS access to Grafana, Longhorn, ntfy, and the CA distribution service ✅ CA Certificate Distribution: User-friendly service for downloading and installing the root CA certificate ✅ Alert Notifications: Grafana Unified Alerting integrated with ntfy for push notifications to web and mobile ✅ Self-Hosted Notifications: ntfy provides notifications without external dependencies or API keys
The Road Ahead
The observability stack is now complete with metrics, logs, traces, and notifications. The next steps will be:
- Dashboard Expansion: Create custom dashboards for cluster health, node status, and application metrics
- Additional Alert Rules: Expand alerting to cover node failures, pod crashes, disk space, and network issues
- Service Monitoring: Add application-specific monitoring as services are deployed
- Trace Instrumentation: Add OpenTelemetry instrumentation to applications for distributed tracing
- Alert Tuning: Refine alert thresholds and notification policies based on real-world usage
Lessons Learned
This phase taught me several important lessons:
Storage Strategy Matters: Different data types need different storage policies—ephemeral data can use single replica with Delete, while configs need triple replica with Retain
Helm Chart Structure Varies: Different Helm charts use different field names (
persistencevspersistentVolume,storageClassvsstorageClassName). Always check the chart’s template files to understand the expected structure.StatefulSet Limitations: StatefulSet
volumeClaimTemplatescannot be patched. If storage configuration is wrong, you must delete the StatefulSet and its PVCs to recreate them correctly.Longhorn CSI Requirements: Longhorn 1.10.0+ requires
dataEngine: "v1"in StorageClass parameters for CSI storage capacity tracking. Without this, Kubernetes cannot see available storage.Component Naming Conventions: Some charts use camelCase (
storeGateway) while templates expect snake_case (store_gateway). Always verify the actual template structure.Version-Specific Configuration: Newer versions may require additional configuration. For example, Mimir v3.0.0+ requires
push_grpc_method_enabled: truewhen ingest storage is disabled.Webhook Complexity: Webhooks can cause TLS certificate issues in homelab environments. Disable optional webhooks (like rollout operator) unless specifically needed.
Storage Path Conflicts: All Mimir components (blocks_storage, tsdb, compactor) must use different filesystem paths under
/datato avoid “directory cannot overlap” errors. Ruler and Alertmanager are disabled since we use Grafana Unified Alerting.Metrics Enablement: Infrastructure components need metrics enabled—don’t assume they’re on by default
Kubernetes Reconciliation: Trust Kubernetes reconciliation—explicit waits are often unnecessary
Alloy is Powerful: Grafana Alloy provides a unified way to collect logs, metrics, and traces—much simpler than separate agents
Pre-configured Data Sources: Helm provisioning of Grafana data sources saves manual configuration
Security First: Store sensitive data (like Grafana passwords) in git-crypted files
DNS Configuration: Adding services to
cluster_servicesin inventory automatically creates DNS CNAME records, but router configuration must be redeployed to take effect.Node-Local Scraping: Each Alloy pod should only scrape its local node’s kubelet to avoid duplicate metrics and out-of-order sample errors.
Mimir Endpoints: Use
mimir-distributorfor ingestion (remote write) andmimir-gatewayfor queries. The gateway is not the correct endpoint for ingestion.Grafana Custom Headers: Grafana datasources support custom HTTP headers via
jsonData.httpHeaderName1andsecureJsonData.httpHeaderValue1for multi-tenancy support.Dashboard Provisioning: Grafana can automatically provision dashboards from ConfigMaps mounted at
/var/lib/grafana/dashboardsusing theextraConfigmapMountsfeature.Self-Signed CA for Homelab: A self-signed Certificate Authority is simpler and more flexible than Let’s Encrypt for internal-only services. No external dependencies, full control over validity periods, and perfect for homelab environments.
Certificate Chain is Critical: Kubernetes TLS secrets must include the full certificate chain (service certificate + CA certificate) in the
tls.crtfield. Browsers require the complete chain to establish trust.SAN Extensions Required: Modern browsers reject certificates without Subject Alternative Names (SAN). Always include SAN extensions with the correct DNS names when generating service certificates.
WebSocket Support in Traefik: Longhorn and other services using WebSocket require the
traefik.ingress.kubernetes.io/websocket: "true"annotation. Without this, WebSocket connections fail with HTTP 200 instead of proper WebSocket upgrades (HTTP 101).Longhorn Ingress Patching: The Longhorn Helm chart generates an ingress with a default TLS secret name. You must patch it after deployment to use your custom TLS secret and add WebSocket annotations.
CA Distribution Service: Deploying a lightweight Nginx service to distribute the CA certificate makes it easy for users to install the root certificate. Include platform-specific installation instructions for better user experience.
Alert Expression Flow Matters: Grafana Unified Alerting requires a specific expression flow: Query → Reduce → Threshold. Threshold expressions need reduced (single-value) data, not time series. Using threshold directly on time series results in “invalid format of evaluation results” errors.
API-Based Contact Points for Flexibility: API-based provisioning for contact points offers greater flexibility than file-based, especially for complex message formatting. Use API when you need dynamic content, conditional logic (like emojis based on status), or rich notification formats.
Separate Infrastructure and Configuration: Use file-based provisioning for infrastructure (alert rules) and API-based configuration for operational resources (contact points, notification policies). This separation keeps Ansible playbooks focused on infrastructure while allowing operational flexibility.
X-Disable-Provenance Header: When using API-based Grafana configuration, include the
X-Disable-Provenance: trueheader to prevent provenance conflicts and allow API modifications to work seamlessly.ntfy for Simple Notifications: ntfy provides a self-hosted notification solution without external dependencies, API keys, or complex configuration. Its JSON message format supports emojis, priority levels, tags, and clickable links—perfect for homelab environments.
Grafana Unified Alerting Architecture: Understand the three-layer architecture: Alert Rules (conditions), Contact Points (destinations), and Notification Policies (routing). Each layer has different provisioning options and constraints.
Test Alert Rules with Real Data: Always verify alert thresholds with actual metrics before deployment. False positive alerts create alert fatigue and undermine the entire monitoring system. Test your alert expression flow to ensure it evaluates correctly.
Conclusion
The LGTM observability stack is now fully deployed with comprehensive monitoring and alerting capabilities. We have:
- Logs collected and searchable via Loki
- Metrics stored and queryable via Mimir
- Traces collected via Tempo (ready for instrumentation)
- Unified Visualization through Grafana
- Alert Rules for CPU and memory thresholds
- Push Notifications via ntfy to web and mobile
- Secure HTTPS access to all services
The infrastructure is production-ready and accurately monitoring the cluster’s health. Notifications arrive on both desktop and mobile devices when actual issues arise (metrics above 90%), enabling rapid response while avoiding alert fatigue from false positives.
In the next post, we’ll explore how to expand this observability stack with custom dashboards, additional alert rules, and application-specific monitoring.
Check out the previous post to see how we built the foundation infrastructure, or read the first post for the complete journey from the beginning.
comments powered by Disqus