1 - Technology Analysis

In-depth analysis of technologies and tools evaluated for home lab infrastructure

Technology Analysis

This section contains detailed research and analysis of various technologies evaluated for potential use in the home lab infrastructure.

Network Boot & Provisioning

  • Matchbox - Network boot service for bare-metal provisioning
    • Comprehensive analysis of PXE/iPXE/GRUB support
    • Configuration model (profiles, groups, templating)
    • Deployment patterns and operational considerations
    • Use case evaluation and comparison with alternatives

Cloud Providers

  • Google Cloud Platform - GCP capabilities for network boot infrastructure
    • Network boot protocol support (TFTP, HTTP, HTTPS)
    • WireGuard VPN deployment and integration
    • Cost analysis and performance considerations
  • Amazon Web Services - AWS capabilities for network boot infrastructure
    • Network boot protocol support (TFTP, HTTP, HTTPS)
    • WireGuard VPN deployment and integration
    • Cost analysis and performance considerations

Operating Systems

  • Server Operating Systems - OS evaluation for Kubernetes homelab infrastructure
    • Ubuntu Server analysis (kubeadm, k3s, MicroK8s)
    • Fedora Server analysis (kubeadm with CRI-O)
    • Talos Linux analysis (purpose-built Kubernetes OS)
    • Harvester HCI analysis (hyperconverged platform)
    • Comparison of setup complexity, maintenance, security, and resource overhead

Hardware

Future Analysis Topics

Planned technology evaluations:

  • Storage Solutions: Ceph, GlusterFS, ZFS over iSCSI
  • Container Orchestration: Kubernetes distributions (k3s, Talos, etc.)
  • Observability: Prometheus, Grafana, Loki, Tempo stack
  • Service Mesh: Istio, Linkerd, Cilium comparison
  • CI/CD: GitLab Runner, Tekton, Argo Workflows
  • Secret Management: Vault, External Secrets Operator
  • Load Balancing: MetalLB, kube-vip, Cilium LB-IPAM

1.1 - Server Operating System Analysis

Evaluation of operating systems for homelab Kubernetes infrastructure

This section provides detailed analysis of operating systems evaluated for the homelab server infrastructure, with a focus on Kubernetes cluster setup and maintenance.

Overview

The selection of a server operating system is critical for homelab infrastructure. The primary evaluation criterion is ease of Kubernetes cluster initialization and ongoing maintenance burden.

Evaluated Options

  • Ubuntu - Traditional general-purpose Linux distribution

    • Kubernetes via kubeadm, k3s, or MicroK8s
    • Strong community support and extensive documentation
    • Familiar package management and system administration
  • Fedora - Cutting-edge Linux distribution

    • Latest kernel and system components
    • Kubernetes via kubeadm or k3s
    • Shorter support lifecycle with more frequent upgrades
  • Talos Linux - Purpose-built Kubernetes OS

    • API-driven, immutable infrastructure
    • Built-in Kubernetes with minimal attack surface
    • Designed specifically for container workloads
  • Harvester - Hyperconverged infrastructure platform

    • Built on Rancher and K3s
    • Combines compute, storage, and networking
    • VM and container workloads on unified platform

Evaluation Criteria

Each option is evaluated based on:

  1. Kubernetes Installation Methods - Available tooling and installation approaches
  2. Cluster Initialization Process - Steps required to bootstrap a cluster
  3. Maintenance Requirements - OS updates, Kubernetes upgrades, security patches
  4. Resource Overhead - Memory, CPU, and storage footprint
  5. Learning Curve - Ease of adoption and operational complexity
  6. Community Support - Documentation quality and ecosystem maturity
  7. Security Posture - Attack surface and security-first design

1.1.1 - Ubuntu Analysis

Analysis of Ubuntu for Kubernetes homelab infrastructure

Overview

Ubuntu Server is a popular general-purpose Linux distribution developed by Canonical. It provides Long Term Support (LTS) releases with 5 years of standard support and optional Extended Security Maintenance (ESM).

Key Facts:

  • Latest LTS: Ubuntu 24.04 LTS (Noble Numbat)
  • Support Period: 5 years standard, 10 years with Ubuntu Pro (free for personal use)
  • Kernel: Linux 6.8+ (LTS), regular HWE updates
  • Package Manager: APT/DPKG, Snap
  • Init System: systemd

Kubernetes Installation Methods

Ubuntu supports multiple Kubernetes installation approaches:

1. kubeadm (Official Kubernetes Tool)

Installation:

# Install container runtime (containerd)
sudo apt-get update
sudo apt-get install -y containerd

# Configure containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd

# Install kubeadm, kubelet, kubectl
sudo apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.31/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.31/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

Cluster Initialization:

# Initialize control plane
sudo kubeadm init --pod-network-cidr=10.244.0.0/16

# Configure kubectl for admin
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install CNI (e.g., Calico, Flannel)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

# Join worker nodes
kubeadm token create --print-join-command

Pros:

  • Official Kubernetes tooling, well-documented
  • Full control over cluster configuration
  • Supports latest Kubernetes versions
  • Large community and extensive resources

Cons:

  • More manual steps than turnkey solutions
  • Requires understanding of Kubernetes architecture
  • Manual upgrade process for each component
  • More complex troubleshooting

2. k3s (Lightweight Kubernetes)

Installation:

# Single-command install on control plane
curl -sfL https://get.k3s.io | sh -

# Get node token for workers
sudo cat /var/lib/rancher/k3s/server/node-token

# Install on worker nodes
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -

Pros:

  • Extremely simple installation (single command)
  • Lightweight (< 512MB RAM)
  • Built-in container runtime (containerd)
  • Automatic updates via Rancher System Upgrade Controller
  • Great for edge and homelab use cases

Cons:

  • Less customizable than kubeadm
  • Some features removed (e.g., in-tree storage, cloud providers)
  • Slightly different from upstream Kubernetes

3. MicroK8s (Canonical’s Distribution)

Installation:

# Install via snap
sudo snap install microk8s --classic

# Join cluster
sudo microk8s add-node
# Run output command on worker nodes

# Enable addons
microk8s enable dns storage ingress

Pros:

  • Zero-ops, single package install
  • Snap-based automatic updates
  • Addons for common services (DNS, storage, ingress)
  • Canonical support available

Cons:

  • Requires snap (not universally liked)
  • Less ecosystem compatibility than vanilla Kubernetes
  • Ubuntu-specific (less portable)

Cluster Initialization Sequence

kubeadm Approach

sequenceDiagram
    participant Admin
    participant Server as Ubuntu Server
    participant K8s as Kubernetes Components
    
    Admin->>Server: Install Ubuntu 24.04 LTS
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system (apt update && upgrade)
    Admin->>Server: Install containerd
    Server->>Server: Configure containerd (CRI)
    Admin->>Server: Install kubeadm/kubelet/kubectl
    Server->>Server: Disable swap, configure kernel modules
    Admin->>K8s: kubeadm init --pod-network-cidr=10.244.0.0/16
    K8s->>Server: Generate certificates
    K8s->>Server: Start etcd
    K8s->>Server: Start API server
    K8s->>Server: Start controller-manager
    K8s->>Server: Start scheduler
    K8s-->>Admin: Control plane ready
    Admin->>K8s: kubectl apply -f calico.yaml
    K8s->>Server: Deploy CNI pods
    Admin->>K8s: kubeadm join (on workers)
    K8s->>Server: Add worker nodes
    K8s-->>Admin: Cluster ready

k3s Approach

sequenceDiagram
    participant Admin
    participant Server as Ubuntu Server
    participant K3s as k3s Components
    
    Admin->>Server: Install Ubuntu 24.04 LTS
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system
    Admin->>Server: curl -sfL https://get.k3s.io | sh -
    Server->>K3s: Download k3s binary
    K3s->>Server: Configure containerd
    K3s->>Server: Start k3s service
    K3s->>Server: Initialize etcd (embedded)
    K3s->>Server: Start API server
    K3s->>Server: Start controller-manager
    K3s->>Server: Start scheduler
    K3s->>Server: Deploy built-in CNI (Flannel)
    K3s-->>Admin: Control plane ready
    Admin->>Server: Retrieve node token
    Admin->>Server: Install k3s agent on workers
    K3s->>Server: Join workers to cluster
    K3s-->>Admin: Cluster ready (5-10 minutes total)

Maintenance Requirements

OS Updates

Security Patches:

# Automatic security updates (recommended)
sudo apt-get install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades

# Manual updates
sudo apt-get update
sudo apt-get upgrade

Frequency:

  • Security patches: Weekly to monthly
  • Kernel updates: Monthly (may require reboot)
  • Major version upgrades: Every 2 years (LTS to LTS)

Kubernetes Upgrades

kubeadm Upgrade:

# Upgrade control plane
sudo apt-get update
sudo apt-get install -y kubeadm=1.32.0-*
sudo kubeadm upgrade apply v1.32.0
sudo apt-get install -y kubelet=1.32.0-* kubectl=1.32.0-*
sudo systemctl restart kubelet

# Upgrade workers
kubectl drain <node> --ignore-daemonsets
sudo apt-get install -y kubeadm=1.32.0-* kubelet=1.32.0-* kubectl=1.32.0-*
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon <node>

k3s Upgrade:

# Manual upgrade
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.32.0+k3s1 sh -

# Automatic upgrade via system-upgrade-controller
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml

Upgrade Frequency: Every 3-6 months (Kubernetes minor versions)

Resource Overhead

Minimal Installation (Ubuntu Server + k3s):

  • RAM: ~512MB (OS) + 512MB (k3s) = 1GB total
  • CPU: 1 core minimum, 2 cores recommended
  • Disk: 10GB (OS) + 10GB (container images) = 20GB
  • Network: 1 Gbps recommended

Full Installation (Ubuntu Server + kubeadm):

  • RAM: ~512MB (OS) + 1-2GB (Kubernetes components) = 2GB+ total
  • CPU: 2 cores minimum
  • Disk: 15GB (OS) + 20GB (container images/etcd) = 35GB
  • Network: 1 Gbps recommended

Security Posture

Strengths:

  • Regular security updates via Ubuntu Security Team
  • AppArmor enabled by default
  • SELinux support available
  • Kernel hardening features (ASLR, stack protection)
  • Ubuntu Pro ESM for extended CVE coverage (free for personal use)

Attack Surface:

  • Full general-purpose OS (larger attack surface than minimal OS)
  • Many installed packages by default (can be minimized)
  • Requires manual hardening for production use

Hardening Steps:

# Disable unnecessary services
sudo systemctl disable snapd.service
sudo systemctl disable bluetooth.service

# Configure firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 6443/tcp  # Kubernetes API
sudo ufw allow 10250/tcp # Kubelet
sudo ufw enable

# CIS Kubernetes Benchmark compliance
# Use tools like kube-bench for validation

Learning Curve

Ease of Adoption: ⭐⭐⭐⭐⭐ (Excellent)

  • Most familiar Linux distribution for many users
  • Extensive documentation and tutorials
  • Large community support (forums, Stack Overflow)
  • Straightforward package management
  • Similar to Debian-based systems

Required Knowledge:

  • Basic Linux system administration (apt, systemd, networking)
  • Kubernetes concepts (pods, services, deployments)
  • Container runtime basics (containerd, Docker)
  • Text editor (vim, nano) for configuration

Community Support

Ecosystem Maturity: ⭐⭐⭐⭐⭐ (Excellent)

  • Documentation: Comprehensive official docs, community guides
  • Community: Massive user base, active forums
  • Commercial Support: Available from Canonical (Ubuntu Pro)
  • Third-Party Tools: Excellent compatibility with all Kubernetes tools
  • Tutorials: Abundant resources for Kubernetes on Ubuntu

Resources:

Pros and Cons Summary

Pros

  • Good, because most familiar and well-documented Linux distribution
  • Good, because 5-year LTS support (10 years with Ubuntu Pro)
  • Good, because multiple Kubernetes installation options (kubeadm, k3s, MicroK8s)
  • Good, because k3s provides extremely simple setup (single command)
  • Good, because extensive package ecosystem (60,000+ packages)
  • Good, because strong community support and resources
  • Good, because automatic security updates available
  • Good, because low learning curve for most administrators
  • Good, because compatible with all Kubernetes tooling and addons
  • Good, because Ubuntu Pro free for personal use (extended security)

Cons

  • Bad, because general-purpose OS has larger attack surface than minimal OS
  • Bad, because more resource overhead than purpose-built Kubernetes OS (1-2GB RAM)
  • Bad, because requires manual OS updates and reboots
  • Bad, because kubeadm setup is complex with many manual steps
  • Bad, because snap packages controversial (for MicroK8s)
  • Bad, because Kubernetes upgrades require manual intervention (unless using k3s auto-upgrade)
  • Bad, because managing OS + Kubernetes lifecycle separately increases complexity
  • Neutral, because many preinstalled packages (can be removed, but require effort)

Recommendations

Best for:

  • Users familiar with Ubuntu/Debian ecosystem
  • Homelabs requiring general-purpose server functionality (not just Kubernetes)
  • Teams wanting multiple Kubernetes installation options
  • Users prioritizing community support and documentation

Best Installation Method:

  • Homelab/Learning: k3s (simplest, auto-updates, lightweight)
  • Production-like: kubeadm (full control, upstream Kubernetes)
  • Ubuntu-specific: MicroK8s (Canonical support, snap-based)

Avoid if:

  • Seeking minimal attack surface (consider Talos Linux)
  • Want infrastructure-as-code for OS layer (consider Talos Linux)
  • Prefer hyperconverged platform (consider Harvester)

1.1.2 - Fedora Analysis

Analysis of Fedora Server for Kubernetes homelab infrastructure

Overview

Fedora Server is a cutting-edge Linux distribution sponsored by Red Hat, serving as the upstream for Red Hat Enterprise Linux (RHEL). It emphasizes innovation with the latest software packages and kernel versions.

Key Facts:

  • Latest Version: Fedora 41 (October 2024)
  • Support Period: ~13 months per release (shorter than Ubuntu LTS)
  • Kernel: Linux 6.11+ (latest stable)
  • Package Manager: DNF/RPM, Flatpak
  • Init System: systemd

Kubernetes Installation Methods

Fedora supports standard Kubernetes installation approaches:

1. kubeadm (Official Kubernetes Tool)

Installation:

# Install container runtime (CRI-O preferred on Fedora)
sudo dnf install -y cri-o
sudo systemctl enable --now crio

# Add Kubernetes repository
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.key
EOF

# Install kubeadm, kubelet, kubectl
sudo dnf install -y kubelet kubeadm kubectl
sudo systemctl enable --now kubelet

Cluster Initialization:

# Initialize control plane
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/crio/crio.sock

# Configure kubectl
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

# Join workers
kubeadm token create --print-join-command

Pros:

  • CRI-O is native to Fedora ecosystem (same as RHEL/OpenShift)
  • Latest Kubernetes versions available quickly
  • Familiar to RHEL/CentOS users
  • Fully upstream Kubernetes

Cons:

  • Manual setup process (same as Ubuntu/kubeadm)
  • Requires Kubernetes knowledge
  • More complex than turnkey solutions

2. k3s (Lightweight Kubernetes)

Installation:

# Same single-command install
curl -sfL https://get.k3s.io | sh -

# Retrieve token
sudo cat /var/lib/rancher/k3s/server/node-token

# Install on workers
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -

Pros:

  • Simple installation (identical to Ubuntu)
  • Lightweight and fast
  • Well-tested on Fedora/RHEL family

Cons:

  • Less customizable
  • Not using native CRI-O by default (uses embedded containerd)

3. OKD (OpenShift Kubernetes Distribution)

Installation (Single-Node):

# Download and install OKD
wget https://github.com/okd-project/okd/releases/download/4.15.0-0.okd-2024-01-27-070424/openshift-install-linux-4.15.0-0.okd-2024-01-27-070424.tar.gz
tar -xvf openshift-install-linux-*.tar.gz
sudo mv openshift-install /usr/local/bin/

# Create install config
./openshift-install create install-config --dir=cluster

# Install cluster
./openshift-install create cluster --dir=cluster

Pros:

  • Enterprise features (operators, web console, image registry)
  • Built-in CI/CD and developer tools
  • Based on Fedora CoreOS (immutable, auto-updating)

Cons:

  • Very heavy resource requirements (16GB+ RAM)
  • Complex installation and management
  • Overkill for simple homelab use

Cluster Initialization Sequence

kubeadm with CRI-O

sequenceDiagram
    participant Admin
    participant Server as Fedora Server
    participant K8s as Kubernetes Components
    
    Admin->>Server: Install Fedora 41
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system (dnf update)
    Admin->>Server: Install CRI-O
    Server->>Server: Configure CRI-O runtime
    Server->>Server: Enable crio.service
    Admin->>Server: Install kubeadm/kubelet/kubectl
    Server->>Server: Disable swap, load kernel modules
    Server->>Server: Configure SELinux (permissive for Kubernetes)
    Admin->>K8s: kubeadm init --cri-socket=unix:///var/run/crio/crio.sock
    K8s->>Server: Generate certificates
    K8s->>Server: Start etcd
    K8s->>Server: Start API server
    K8s->>Server: Start controller-manager
    K8s->>Server: Start scheduler
    K8s-->>Admin: Control plane ready
    Admin->>K8s: kubectl apply CNI
    K8s->>Server: Deploy CNI pods
    Admin->>K8s: kubeadm join (workers)
    K8s->>Server: Add worker nodes
    K8s-->>Admin: Cluster ready

k3s Approach

sequenceDiagram
    participant Admin
    participant Server as Fedora Server
    participant K3s as k3s Components
    
    Admin->>Server: Install Fedora 41
    Server->>Server: Configure network
    Admin->>Server: Update system (dnf update)
    Admin->>Server: Disable firewalld (or configure)
    Admin->>Server: curl -sfL https://get.k3s.io | sh -
    Server->>K3s: Download k3s binary
    K3s->>Server: Configure containerd
    K3s->>Server: Start k3s service
    K3s->>Server: Initialize embedded etcd
    K3s->>Server: Start API server
    K3s->>Server: Deploy built-in CNI
    K3s-->>Admin: Control plane ready
    Admin->>Server: Retrieve node token
    Admin->>Server: Install k3s agent on workers
    K3s->>Server: Join workers
    K3s-->>Admin: Cluster ready (5-10 minutes)

Maintenance Requirements

OS Updates

Security and System Updates:

# Automatic updates (dnf-automatic)
sudo dnf install -y dnf-automatic
sudo systemctl enable --now dnf-automatic.timer

# Manual updates
sudo dnf update -y
sudo reboot  # if kernel updated

Frequency:

  • Security patches: Weekly to monthly
  • Kernel updates: Monthly (frequent updates)
  • Major version upgrades: Every ~13 months (Fedora releases)

Version Upgrade:

# Upgrade to next Fedora release
sudo dnf upgrade --refresh
sudo dnf install dnf-plugin-system-upgrade
sudo dnf system-upgrade download --releasever=42
sudo dnf system-upgrade reboot

Kubernetes Upgrades

kubeadm Upgrade:

# Upgrade control plane
sudo dnf update -y kubeadm
sudo kubeadm upgrade apply v1.32.0
sudo dnf update -y kubelet kubectl
sudo systemctl restart kubelet

# Upgrade workers
kubectl drain <node> --ignore-daemonsets
sudo dnf update -y kubeadm kubelet kubectl
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon <node>

k3s Upgrade: Same as Ubuntu (curl script or system-upgrade-controller)

Upgrade Frequency: Kubernetes every 3-6 months, Fedora OS every ~13 months

Resource Overhead

Minimal Installation (Fedora Server + k3s):

  • RAM: ~600MB (OS) + 512MB (k3s) = 1.2GB total
  • CPU: 1 core minimum, 2 cores recommended
  • Disk: 12GB (OS) + 10GB (containers) = 22GB
  • Network: 1 Gbps recommended

Full Installation (Fedora Server + kubeadm + CRI-O):

  • RAM: ~700MB (OS) + 1.5GB (Kubernetes) = 2.2GB total
  • CPU: 2 cores minimum
  • Disk: 15GB (OS) + 20GB (containers) = 35GB
  • Network: 1 Gbps recommended

Note: Slightly higher overhead than Ubuntu due to SELinux and newer components.

Security Posture

Strengths:

  • SELinux enabled by default (stronger than AppArmor)
  • Latest security patches and kernel (bleeding edge)
  • CRI-O container runtime (security-focused, used by OpenShift)
  • Shorter support window = less legacy CVEs
  • Active security team and rapid response

Attack Surface:

  • General-purpose OS (larger surface than minimal OS)
  • More installed packages than minimal server
  • SELinux can be complex to configure for Kubernetes

Hardening Steps:

# Configure firewall (firewalld default on Fedora)
sudo firewall-cmd --permanent --add-port=6443/tcp  # API server
sudo firewall-cmd --permanent --add-port=10250/tcp # Kubelet
sudo firewall-cmd --reload

# SELinux configuration for Kubernetes
sudo setenforce 0  # Permissive (Kubernetes not fully SELinux-ready)
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

# Disable unnecessary services
sudo systemctl disable bluetooth.service

Learning Curve

Ease of Adoption: ⭐⭐⭐⭐ (Good)

  • Familiar for RHEL/CentOS/Alma/Rocky users
  • DNF package manager (similar to APT)
  • Excellent documentation
  • SELinux learning curve can be steep

Required Knowledge:

  • RPM-based system administration (dnf, systemd)
  • SELinux basics (or willingness to use permissive mode)
  • Kubernetes concepts
  • Firewalld configuration

Differences from Ubuntu:

  • DNF vs APT package manager
  • SELinux vs AppArmor
  • Firewalld vs UFW
  • Faster release cycle (more frequent upgrades)

Community Support

Ecosystem Maturity: ⭐⭐⭐⭐ (Good)

  • Documentation: Excellent official docs, Red Hat resources
  • Community: Large user base, active forums
  • Commercial Support: RHEL support available (paid)
  • Third-Party Tools: Good compatibility with Kubernetes tools
  • Tutorials: Abundant resources, especially for RHEL ecosystem

Resources:

Pros and Cons Summary

Pros

  • Good, because latest kernel and software packages (bleeding edge)
  • Good, because SELinux enabled by default (stronger MAC than AppArmor)
  • Good, because native CRI-O support (same as RHEL/OpenShift)
  • Good, because upstream for RHEL (enterprise compatibility)
  • Good, because multiple Kubernetes installation options
  • Good, because k3s simplifies setup dramatically
  • Good, because strong security focus and rapid CVE response
  • Good, because familiar to RHEL/CentOS ecosystem
  • Good, because automatic updates available (dnf-automatic)
  • Neutral, because shorter support cycle (13 months) ensures latest features

Cons

  • Bad, because short support cycle requires frequent OS upgrades (every ~13 months)
  • Bad, because bleeding-edge packages can introduce instability
  • Bad, because SELinux configuration for Kubernetes is complex (often set to permissive)
  • Bad, because smaller community than Ubuntu (though still large)
  • Bad, because general-purpose OS has larger attack surface than minimal OS
  • Bad, because more resource overhead than purpose-built Kubernetes OS
  • Bad, because OS upgrade every 13 months adds maintenance burden
  • Bad, because less beginner-friendly than Ubuntu
  • Bad, because managing OS + Kubernetes lifecycle separately
  • Neutral, because rapid release cycle can be pro or con depending on preference

Recommendations

Best for:

  • Users familiar with RHEL/CentOS/Rocky/Alma ecosystem
  • Teams wanting latest kernel and software features
  • Environments requiring SELinux (compliance, enterprise standards)
  • Learning OpenShift/OKD ecosystem (Fedora CoreOS foundation)
  • Users comfortable with frequent OS upgrades

Best Installation Method:

  • Homelab/Learning: k3s (simplest, lightweight)
  • Enterprise-like: kubeadm + CRI-O (OpenShift compatibility)
  • Advanced: OKD (if resources available, 16GB+ RAM)

Avoid if:

  • Prefer long-term stability (choose Ubuntu LTS)
  • Want minimal maintenance (frequent Fedora upgrades required)
  • Seeking minimal attack surface (consider Talos Linux)
  • Uncomfortable with SELinux complexity
  • Want infrastructure-as-code for OS (consider Talos Linux)

Comparison with Ubuntu

AspectFedoraUbuntu LTS
Support Period13 months5 years (10 with Pro)
KernelLatest (6.11+)LTS (6.8+)
SecuritySELinuxAppArmor
Package ManagerDNF/RPMAPT/DEB
Release Cycle6 months2 years (LTS)
Upgrade FrequencyEvery 13 monthsEvery 2-5 years
Community SizeLargeVery Large
Enterprise UpstreamRHELN/A
StabilityBleeding edgeStable/Conservative
Learning CurveModerateEasy

Verdict: Fedora is excellent for those wanting latest features and comfortable with frequent upgrades. Ubuntu LTS is better for long-term stability and minimal maintenance.

1.1.3 - Talos Linux Analysis

Analysis of Talos Linux for Kubernetes homelab infrastructure

Overview

Talos Linux is a modern operating system designed specifically for running Kubernetes. It is API-driven, immutable, and minimal, with no SSH access, shell, or package manager. All configuration is done via a declarative API.

Key Facts:

  • Latest Version: Talos 1.9 (supports Kubernetes 1.31)
  • Support: Community-driven, commercial support available from Sidero Labs
  • Kernel: Linux 6.6+ LTS
  • Architecture: Immutable, API-driven, no shell access
  • Management: talosctl CLI + Kubernetes API

Kubernetes Installation Methods

Talos Linux has built-in Kubernetes - there is only one installation method.

Built-in Kubernetes (Only Option)

Installation Process:

  1. Boot Talos ISO/PXE (maintenance mode)
  2. Apply machine configuration via talosctl
  3. Bootstrap Kubernetes via talosctl bootstrap

Machine Configuration (YAML):

# controlplane.yaml
version: v1alpha1
machine:
  type: controlplane
  install:
    disk: /dev/sda
  network:
    hostname: control-plane-1
    interfaces:
      - interface: eth0
        dhcp: false
        addresses:
          - 192.168.1.10/24
        routes:
          - network: 0.0.0.0/0
            gateway: 192.168.1.1
cluster:
  clusterName: homelab
  controlPlane:
    endpoint: https://192.168.1.10:6443
  network:
    cni:
      name: custom
      urls:
        - https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

Cluster Initialization:

# Generate machine configs
talosctl gen config homelab https://192.168.1.10:6443

# Apply config to control plane node (booted from ISO)
talosctl apply-config --insecure --nodes 192.168.1.10 --file controlplane.yaml

# Wait for install to complete, then bootstrap
talosctl bootstrap --nodes 192.168.1.10 --endpoints 192.168.1.10

# Retrieve kubeconfig
talosctl kubeconfig --nodes 192.168.1.10 --endpoints 192.168.1.10

# Apply config to worker nodes
talosctl apply-config --insecure --nodes 192.168.1.11 --file worker.yaml

Pros:

  • Kubernetes built-in, no separate installation
  • Declarative configuration (GitOps-friendly)
  • Extremely minimal attack surface (no shell, no SSH)
  • Immutable infrastructure (config changes require reboot)
  • Automatic updates via Talos controller
  • Designed from ground up for Kubernetes

Cons:

  • Steep learning curve (completely different paradigm)
  • No SSH/shell access (all via API)
  • Troubleshooting requires different mindset
  • Limited to Kubernetes workloads only (not general-purpose)
  • Smaller community than traditional distros

Cluster Initialization Sequence

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Talos as Talos Linux
    participant K8s as Kubernetes Components
    
    Admin->>Server: Boot Talos ISO (PXE or USB)
    Server->>Talos: Start in maintenance mode
    Talos-->>Admin: API endpoint ready (no shell)
    Admin->>Admin: Generate configs (talosctl gen config)
    Admin->>Talos: talosctl apply-config (controlplane.yaml)
    Talos->>Server: Partition disk
    Talos->>Server: Install Talos to /dev/sda
    Talos->>Server: Write machine config
    Server->>Server: Reboot from disk
    Talos->>Talos: Load machine config
    Talos->>K8s: Start kubelet
    Talos->>K8s: Start etcd
    Talos->>K8s: Start API server
    Admin->>Talos: talosctl bootstrap
    Talos->>K8s: Initialize cluster
    K8s->>Talos: Start controller-manager
    K8s->>Talos: Start scheduler
    K8s-->>Admin: Control plane ready
    Admin->>K8s: Apply CNI (via talosctl or kubectl)
    K8s->>Talos: Deploy CNI pods
    Admin->>Talos: Apply worker configs
    Talos->>K8s: Join workers to cluster
    K8s-->>Admin: Cluster ready (10-15 minutes)

Maintenance Requirements

OS Updates

Declarative Upgrades:

# Upgrade Talos version (rolling upgrade)
talosctl upgrade --nodes 192.168.1.10 --image ghcr.io/siderolabs/installer:v1.9.0

# Kubernetes version upgrade (also declarative)
talosctl upgrade-k8s --nodes 192.168.1.10 --to 1.32.0

Automatic Updates (via Talos System Extensions):

# machine config with auto-update extension
machine:
  install:
    extensions:
      - image: ghcr.io/siderolabs/system-upgrade-controller

Frequency:

  • Talos releases: Every 2-3 months
  • Kubernetes upgrades: Follow upstream cadence (quarterly)
  • Security patches: Built into Talos releases
  • No traditional OS patching (immutable system)

Configuration Changes

All changes via machine config:

# Edit machine config YAML
vim controlplane.yaml

# Apply updated config (triggers reboot if needed)
talosctl apply-config --nodes 192.168.1.10 --file controlplane.yaml

No manual package installs - everything declarative.

Resource Overhead

Minimal Footprint (Talos Linux + Kubernetes):

  • RAM: ~256MB (OS) + 512MB (Kubernetes) = 768MB total
  • CPU: 1 core minimum, 2 cores recommended
  • Disk: ~500MB (OS) + 10GB (container images/etcd) = 10-15GB total
  • Network: 1 Gbps recommended

Comparison:

  • Ubuntu + k3s: ~1GB RAM
  • Talos: ~768MB RAM (lighter)
  • Ubuntu + kubeadm: ~2GB RAM
  • Talos: ~768MB RAM (much lighter)

Minimal install size: ~500MB (vs 10GB+ for Ubuntu/Fedora)

Security Posture

Strengths: ⭐⭐⭐⭐⭐ (Excellent)

  • No SSH access - attack surface eliminated
  • No shell - cannot install malware
  • No package manager - no additional software installation
  • Immutable filesystem - rootfs read-only
  • Minimal components: Only Kubernetes and essential services
  • API-only access - mTLS-authenticated talosctl
  • KSPP compliance: Kernel Self-Protection Project standards
  • Signed images: Cryptographically signed Talos images
  • Secure Boot support: UEFI Secure Boot compatible

Attack Surface:

  • Smallest possible: Only Kubernetes API, kubelet, and Talos API
  • ~30 running processes (vs 100+ on Ubuntu/Fedora)
  • ~200MB filesystem (vs 5-10GB on Ubuntu/Fedora)

No hardening needed - secure by default.

Security Features:

# Built-in security (example config)
machine:
  sysctls:
    kernel.kptr_restrict: "2"
    kernel.yama.ptrace_scope: "1"
  kernel:
    modules:
      - name: br_netfilter
  features:
    kubernetesTalosAPIAccess:
      enabled: true
      allowedRoles:
        - os:reader

Learning Curve

Ease of Adoption: ⭐⭐ (Challenging)

  • Paradigm shift: No shell/SSH, API-only management
  • Requires understanding of declarative infrastructure
  • Talosctl CLI has learning curve
  • Excellent documentation helps
  • Different troubleshooting approach (logs via API)

Required Knowledge:

  • Kubernetes fundamentals (critical)
  • YAML configuration syntax
  • Networking basics (especially CNI)
  • GitOps concepts helpful
  • Comfort with “infrastructure as code”

Debugging without shell:

# View logs via API
talosctl logs --nodes 192.168.1.10 kubelet

# Get system metrics
talosctl dashboard --nodes 192.168.1.10

# Interactive mode (limited shell in emergency)
talosctl dashboard --nodes 192.168.1.10

# Service status
talosctl service --nodes 192.168.1.10

Community Support

Ecosystem Maturity: ⭐⭐⭐ (Growing)

  • Documentation: Excellent official docs
  • Community: Smaller but very active (Slack, GitHub Discussions)
  • Commercial Support: Available from Sidero Labs
  • Third-Party Tools: Growing ecosystem (Cluster API, GitOps tools)
  • Tutorials: Increasing number of community guides

Resources:

Community Size: Smaller than Ubuntu/Fedora, but dedicated and helpful.

Pros and Cons Summary

Pros

  • Good, because Kubernetes is built-in (no separate installation)
  • Good, because minimal attack surface (no SSH, shell, or package manager)
  • Good, because immutable infrastructure (config drift impossible)
  • Good, because API-driven management (GitOps-friendly)
  • Good, because extremely low resource overhead (~768MB RAM)
  • Good, because automatic security patches via Talos upgrades
  • Good, because declarative configuration (version-controlled)
  • Good, because secure by default (no hardening required)
  • Good, because smallest disk footprint (~500MB OS)
  • Good, because designed specifically for Kubernetes (opinionated and optimized)
  • Good, because UEFI Secure Boot support
  • Good, because upgrades are simple and declarative (talosctl upgrade)

Cons

  • Bad, because steep learning curve (no shell/SSH paradigm shift)
  • Bad, because limited to Kubernetes workloads only (not general-purpose)
  • Bad, because troubleshooting without shell requires different approach
  • Bad, because smaller community than Ubuntu/Fedora
  • Bad, because relatively new (less mature than traditional distros)
  • Bad, because no escape hatch for manual intervention
  • Bad, because requires comfort with declarative infrastructure
  • Bad, because debugging is harder for beginners
  • Neutral, because opinionated design (pro for K8s-only, con for general use)

Recommendations

Best for:

  • Kubernetes-dedicated infrastructure (no general-purpose workloads)
  • Security-focused environments (minimal attack surface)
  • GitOps workflows (declarative configuration)
  • Immutable infrastructure advocates
  • Teams comfortable with API-driven management
  • Production Kubernetes clusters (once team is trained)

Best Installation Method:

  • Only option: Built-in Kubernetes via talosctl

Avoid if:

  • Need general-purpose server functionality (SSH, cron jobs, etc.)
  • Team unfamiliar with Kubernetes (too steep a learning curve)
  • Require shell access for troubleshooting comfort
  • Want traditional package management (apt, dnf)
  • Prefer familiar Linux administration tools

Comparison with Ubuntu and Fedora

AspectTalos LinuxUbuntu + k3sFedora + kubeadm
K8s InstallationBuilt-inSingle commandManual (kubeadm)
Attack SurfaceMinimal (~30 processes)Medium (~100)Medium (~100)
Resource Overhead768MB RAM1GB RAM2.2GB RAM
Disk Footprint500MB10GB15GB
Security ModelImmutable, no shellAppArmor, shellSELinux, shell
ManagementAPI-only (talosctl)SSH + kubectlSSH + kubectl
Learning CurveSteepEasyModerate
Community SizeSmall (growing)Very LargeLarge
Support PeriodRolling releases5-10 years13 months
Use CaseKubernetes onlyGeneral-purposeGeneral-purpose
UpgradesDeclarative, simpleManual OS + K8sManual OS + K8s
ConfigurationDeclarative YAMLImperative + YAMLImperative + YAML
TroubleshootingAPI logs/metricsSSH + logsSSH + logs
GitOps-FriendlyExcellentGoodGood
Best forK8s-dedicated infraHomelabs, learningRHEL ecosystem

Verdict: Talos is the most secure and efficient option for Kubernetes-only infrastructure, but requires team buy-in to API-driven, immutable paradigm. Ubuntu/Fedora better for general-purpose servers or teams wanting shell access.

Advanced Features

Talos System Extensions

Extend Talos functionality with extensions:

machine:
  install:
    extensions:
      - image: ghcr.io/siderolabs/intel-ucode:20240312
      - image: ghcr.io/siderolabs/iscsi-tools:v0.1.4

Cluster API Integration

Talos works natively with Cluster API:

# Install Cluster API + Talos provider
clusterctl init --infrastructure talos

# Create cluster from template
clusterctl generate cluster homelab --infrastructure talos > cluster.yaml
kubectl apply -f cluster.yaml

Image Factory

Custom Talos images with extensions:

# Build custom image
curl -X POST https://factory.talos.dev/image \
  -d '{"talos_version":"v1.9.0","extensions":["siderolabs/intel-ucode"]}'

Disaster Recovery

Talos supports etcd backup/restore:

# Backup etcd
talosctl etcd snapshot --nodes 192.168.1.10

# Restore from snapshot
talosctl bootstrap --recover-from ./etcd-snapshot.db

Production Readiness

Production Use: ✅ Yes (many companies run Talos in production)

High Availability:

  • 3+ control plane nodes recommended
  • External etcd supported
  • Load balancer for API server

Monitoring:

  • Prometheus metrics built-in
  • Talos dashboard for health
  • Standard Kubernetes observability tools

Example Production Clusters:

  • Sidero Metal (bare metal provisioning)
  • Various cloud providers (AWS, GCP, Azure)
  • Edge deployments (minimal footprint)

1.1.4 - Harvester Analysis

Analysis of Harvester HCI for Kubernetes homelab infrastructure

Overview

Harvester is a Hyperconverged Infrastructure (HCI) platform built on Kubernetes, designed to provide VM and container management on a unified platform. It combines compute, storage, and networking with built-in K3s for orchestration.

Key Facts:

  • Latest Version: Harvester 1.4 (based on K3s 1.30+)
  • Foundation: Built on RancherOS 2.0, K3s, and KubeVirt
  • Support: Supported by SUSE (acquired Rancher)
  • Architecture: HCI platform with VM + container workloads
  • Management: Web UI + kubectl + Rancher integration

Kubernetes Installation Methods

Harvester includes K3s as its foundation - Kubernetes is built-in.

Built-in K3s (Only Option)

Installation Process:

  1. Boot Harvester ISO (interactive installer or PXE)
  2. Complete installation wizard (web UI or console)
  3. Create cluster (automatic K3s deployment)
  4. Access via web UI or kubectl

Interactive Installation:

# Boot from Harvester ISO
1. Choose "Create a new Harvester cluster"
2. Configure:
   - Cluster token
   - Node role (management/worker/witness)
   - Network interface (management network)
   - VIP (Virtual IP for cluster access)
   - Storage disk (Longhorn persistent storage)
3. Install completes (15-20 minutes)
4. Access web UI at https://<VIP>

Configuration (cloud-init for automated install):

# config.yaml
token: my-cluster-token
os:
  hostname: harvester-node-1
  modules:
    - kvm
  kernel_parameters:
    - intel_iommu=on
install:
  mode: create
  device: /dev/sda
  iso_url: https://releases.rancher.com/harvester/v1.4.0/harvester-v1.4.0-amd64.iso
  vip: 192.168.1.100
  vip_mode: static
  networks:
    harvester-mgmt:
      interfaces:
        - name: eth0
      default_route: true
      ip: 192.168.1.10
      subnet_mask: 255.255.255.0
      gateway: 192.168.1.1

Pros:

  • Complete HCI solution (VMs + containers)
  • Web UI for management (no CLI required)
  • Built-in storage (Longhorn CSI)
  • Built-in networking (multus, SR-IOV)
  • VM live migration
  • Rancher integration for multi-cluster management
  • K3s built-in (no separate Kubernetes install)

Cons:

  • Heavy resource requirements (8GB+ RAM per node)
  • Complex architecture (steep learning curve)
  • Larger attack surface than minimal OS
  • Overkill for container-only workloads
  • Requires 3+ nodes for production HA

Cluster Initialization Sequence

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Harvester as Harvester HCI
    participant K3s as K3s / KubeVirt
    participant Storage as Longhorn Storage
    
    Admin->>Server: Boot Harvester ISO
    Server->>Harvester: Start installation wizard
    Harvester-->>Admin: Interactive console/web UI
    Admin->>Harvester: Configure cluster (token, VIP, storage)
    Harvester->>Server: Partition disks (OS + Longhorn storage)
    Harvester->>Server: Install RancherOS 2.0 base
    Harvester->>Server: Install K3s components
    Server->>Server: Reboot
    Harvester->>K3s: Start K3s server
    K3s->>Server: Initialize control plane
    K3s->>Server: Deploy Harvester operators
    K3s->>Storage: Deploy Longhorn for persistent storage
    K3s->>Server: Deploy KubeVirt for VM management
    K3s->>Server: Deploy multus CNI (multi-network)
    Harvester-->>Admin: Web UI ready at https://<VIP>
    Admin->>Harvester: Add additional nodes (join cluster)
    Harvester->>K3s: Join nodes to cluster
    K3s->>Storage: Replicate storage across nodes
    Harvester-->>Admin: Cluster ready (20-30 minutes)
    Admin->>Harvester: Create VMs or deploy containers

Maintenance Requirements

OS Updates

Harvester Upgrades (includes OS + K3s):

# Via Web UI:
# Settings → Upgrade → Select version → Start upgrade

# Via kubectl (after downloading upgrade image):
kubectl apply -f https://releases.rancher.com/harvester/v1.4.0/version.yaml

# Monitor upgrade progress
kubectl get upgrades -n harvester-system

Frequency:

  • Harvester releases: Every 2-3 months (minor versions)
  • Security patches: Included in Harvester releases
  • K3s upgrades: Bundled with Harvester upgrades
  • No separate OS patching (managed by Harvester)

Kubernetes Upgrades

K3s is upgraded with Harvester - no separate upgrade process.

Version Compatibility:

  • Harvester 1.4.x → K3s 1.30+
  • Harvester 1.3.x → K3s 1.28+
  • Harvester 1.2.x → K3s 1.26+

Upgrade Process:

  1. Web UI or kubectl to trigger upgrade
  2. Rolling upgrade of nodes (one at a time)
  3. VM live migration during node upgrades
  4. Automatic rollback on failure

Resource Overhead

Single Node (Harvester HCI):

  • RAM: 8GB minimum (16GB recommended for VMs)
  • CPU: 4 cores minimum (8 cores recommended)
  • Disk: 250GB minimum (SSD recommended)
    • 100GB for OS/Harvester components
    • 150GB+ for Longhorn storage (VM disks)
  • Network: 1 Gbps minimum (10 Gbps for production)

Three-Node Cluster (Production HA):

  • RAM: 32GB per node (64GB for VM-heavy workloads)
  • CPU: 8 cores per node minimum
  • Disk: 500GB+ per node (NVMe SSD recommended)
  • Network: 10 Gbps recommended (separate storage network ideal)

Comparison:

  • Ubuntu + k3s: 1GB RAM
  • Talos: 768MB RAM
  • Harvester: 8GB+ RAM (much heavier)

Note: Harvester is designed for multi-node HCI, not single-node homelabs.

Security Posture

Strengths:

  • SELinux-based (RancherOS 2.0 foundation)
  • Immutable OS layer (similar to Talos)
  • RBAC built-in (Kubernetes + Rancher)
  • Network segmentation (multus CNI)
  • VM isolation (KubeVirt)
  • Signed images and secure boot support

Attack Surface:

  • Larger than Talos/k3s: Includes web UI, VM management, storage layer
  • KubeVirt adds additional components
  • Web UI is additional attack vector
  • More processes than minimal OS (~50+ services)

Security Features:

# VM network isolation example
apiVersion: network.harvesterhci.io/v1beta1
kind: VlanConfig
metadata:
  name: production-vlan
spec:
  vlanID: 100
  uplink:
    linkAttributes: 1500

Hardening:

  • Firewall rules (web UI or kubectl)
  • RBAC policies (restrict VM/namespace access)
  • Network policies (isolate workloads)
  • Rancher authentication integration (LDAP, SAML)

Learning Curve

Ease of Adoption: ⭐⭐⭐ (Moderate)

  • Web UI simplifies management (no CLI required for basic tasks)
  • Requires understanding of VMs + containers
  • Kubernetes knowledge helpful but not required initially
  • Longhorn storage concepts (replicas, snapshots)
  • KubeVirt for VM management (learning curve)

Required Knowledge:

  • Basic Kubernetes concepts (pods, services)
  • VM management (KubeVirt/libvirt)
  • Storage concepts (Longhorn, CSI)
  • Networking (VLANs, SR-IOV optional)
  • Web UI navigation

Debugging:

# Access via kubectl (kubeconfig from web UI)
kubectl get nodes -n harvester-system

# View Harvester logs
kubectl logs -n harvester-system <pod-name>

# VM console access (via web UI or virtctl)
virtctl console <vm-name>

# Storage debugging
kubectl get volumes -A

Community Support

Ecosystem Maturity: ⭐⭐⭐⭐ (Good)

  • Documentation: Excellent official docs
  • Community: Active Slack, GitHub Discussions, forums
  • Commercial Support: Available from SUSE/Rancher
  • Third-Party Tools: Rancher ecosystem integration
  • Tutorials: Growing number of guides and videos

Resources:

Pros and Cons Summary

Pros

  • Good, because unified platform for VMs + containers (no separate hypervisor)
  • Good, because built-in K3s (Kubernetes included)
  • Good, because web UI simplifies management (no CLI required)
  • Good, because built-in persistent storage (Longhorn CSI)
  • Good, because VM live migration (no downtime during maintenance)
  • Good, because multi-network support (multus CNI, SR-IOV)
  • Good, because Rancher integration (multi-cluster management)
  • Good, because automatic upgrades (OS + K3s + components)
  • Good, because commercial support available (SUSE)
  • Good, because designed for bare-metal HCI (no cloud dependencies)
  • Neutral, because immutable OS layer (similar to Talos benefits)

Cons

  • Bad, because very heavy resource requirements (8GB+ RAM minimum)
  • Bad, because complex architecture (KubeVirt, Longhorn, multus, etc.)
  • Bad, because overkill for container-only workloads (use k3s/Talos instead)
  • Bad, because larger attack surface than minimal OS (web UI, VM layer)
  • Bad, because requires 3+ nodes for production HA (not single-node friendly)
  • Bad, because steep learning curve for full feature set (VMs + storage + networking)
  • Bad, because relatively new platform (less mature than Ubuntu/Fedora)
  • Bad, because limited to Rancher ecosystem (vendor lock-in)
  • Bad, because slower to adopt latest Kubernetes versions (depends on K3s bundle)
  • Neutral, because opinionated HCI design (pro for VM use cases, con for simplicity)

Recommendations

Best for:

  • Hybrid workloads (VMs + containers on same platform)
  • Homelab users wanting to consolidate VM hypervisor + Kubernetes
  • Teams familiar with Rancher ecosystem
  • Multi-node clusters (3+ nodes)
  • Environments requiring VM live migration
  • Users wanting web UI for infrastructure management
  • Replacing VMware/Proxmox + Kubernetes with unified platform

Best Installation Method:

  • Only option: Interactive ISO install or PXE with cloud-init

Avoid if:

  • Running container-only workloads (use k3s or Talos instead)
  • Limited resources (< 8GB RAM per node)
  • Single-node homelab (Harvester designed for multi-node)
  • Want minimal attack surface (use Talos)
  • Prefer traditional Linux shell access (use Ubuntu/Fedora)
  • Need latest Kubernetes versions immediately (Harvester lags upstream)

Comparison with Other Options

AspectHarvesterTalos LinuxUbuntu + k3sFedora + kubeadm
Primary Use CaseVMs + ContainersContainers onlyGeneral-purposeGeneral-purpose
Resource Overhead8GB+ RAM768MB RAM1GB RAM2.2GB RAM
KubernetesBuilt-in K3sBuilt-inInstall k3sInstall kubeadm
ManagementWeb UI + kubectlAPI-only (talosctl)SSH + kubectlSSH + kubectl
StorageBuilt-in LonghornExternal CSIExternal CSIExternal CSI
VM SupportNative (KubeVirt)NoVia KubeVirtVia KubeVirt
Learning CurveModerateSteepEasyModerate
Attack SurfaceLargeMinimalMediumMedium
Multi-NodeDesigned forSupportsSupportsSupports
Single-NodeNot idealExcellentExcellentGood
Best forVM + K8s hybridK8s-onlyHomelab/learningRHEL ecosystem

Verdict: Harvester is excellent for VM + container hybrid workloads with 3+ nodes, but overkill for container-only infrastructure. Use Talos or k3s for Kubernetes-only clusters, Ubuntu/Fedora for general-purpose servers.

Advanced Features

VM Management (KubeVirt)

Create VMs via YAML:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: ubuntu-vm
spec:
  running: true
  template:
    spec:
      domain:
        devices:
          disks:
            - name: root
              disk:
                bus: virtio
        resources:
          requests:
            memory: 4Gi
            cpu: 2
      volumes:
        - name: root
          containerDisk:
            image: docker.io/harvester/ubuntu:22.04

Live Migration

Move VMs between nodes:

# Via web UI: VM → Actions → Migrate

# Via kubectl
kubectl patch vm ubuntu-vm --type merge -p '{"spec":{"running":false}}'
kubectl patch vm ubuntu-vm --type merge -p '{"spec":{"running":true}}'

Backup and Restore

Harvester supports VM backups:

# Configure S3 backup target (web UI)
# Create VM snapshot
# Restore from snapshot or backup

Rancher Integration

Manage multiple clusters:

# Import Harvester cluster into Rancher
# Deploy workloads across clusters
# Central authentication and RBAC

Use Case Examples

Use Case 1: Replace VMware + Kubernetes

Scenario: Currently running VMware ESXi for VMs + separate Kubernetes cluster

Harvester Solution:

  • Consolidate to 3-node Harvester cluster
  • Migrate VMs to KubeVirt
  • Deploy containers on same cluster
  • Save VMware licensing costs

Benefits:

  • Single platform for VMs + containers
  • Unified management (web UI + kubectl)
  • Built-in HA and live migration

Use Case 2: Homelab with Mixed Workloads

Scenario: Need Windows VMs + Linux containers + storage server

Harvester Solution:

  • Windows VMs via KubeVirt (GPU passthrough supported)
  • Linux containers via K3s workloads
  • Longhorn for persistent storage (NFS export supported)

Benefits:

  • No need for separate Proxmox/ESXi
  • Kubernetes-native management
  • Learn enterprise HCI platform

Use Case 3: Edge Computing

Scenario: Deploy compute at remote sites (3-5 nodes each)

Harvester Solution:

  • Harvester cluster at each edge location
  • Rancher for central management
  • VM + container workloads

Benefits:

  • Autonomous operation (no cloud dependency)
  • Rancher multi-cluster management
  • Built-in storage and networking

Production Readiness

Production Use: ✅ Yes (used in enterprise environments)

High Availability:

  • 3+ nodes required for HA
  • Witness node for even-node clusters
  • VM live migration during maintenance
  • Longhorn 3-replica storage

Monitoring:

  • Built-in Prometheus + Grafana
  • Rancher monitoring integration
  • Alerting and notifications

Disaster Recovery:

  • VM backups to S3
  • Cluster backups (etcd + config)
  • Restore to new cluster

Enterprise Features:

  • Rancher authentication (LDAP, SAML, OAuth)
  • Multi-tenancy (namespaces, RBAC)
  • Audit logging
  • Network policies

1.2 - Amazon Web Services Analysis

Technical analysis of Amazon Web Services capabilities for hosting network boot infrastructure

This section contains detailed analysis of Amazon Web Services (AWS) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.

Overview

Amazon Web Services is Amazon’s comprehensive cloud computing platform, offering compute, storage, networking, and managed services. This analysis focuses on AWS’s capabilities to support the network boot architecture decided in ADR-0002.

Key Services Evaluated

  • EC2: Virtual machine instances for hosting boot server
  • VPN / VPC: Network connectivity and VPN capabilities
  • Elastic Load Balancing: Application and Network Load Balancers
  • NAT Gateway: Network address translation for outbound connectivity
  • VPC: Virtual Private Cloud networking and routing

Documentation Sections

1.2.1 - AWS Network Boot Protocol Support

Analysis of Amazon Web Services support for TFTP, HTTP, and HTTPS routing for network boot infrastructure

Network Boot Protocol Support on Amazon Web Services

This document analyzes AWS’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.

TFTP (Trivial File Transfer Protocol) Support

Native Support

Status: ❌ Not natively supported by Elastic Load Balancing

AWS’s Elastic Load Balancing services do not support TFTP protocol natively:

  • Application Load Balancer (ALB): HTTP/HTTPS only (Layer 7)
  • Network Load Balancer (NLB): TCP/UDP support, but not TFTP-aware
  • Classic Load Balancer: Deprecated, similar limitations

TFTP operates on UDP port 69 with unique protocol semantics (variable block sizes, retransmissions, port negotiation) that standard load balancers cannot parse.

Implementation Options

Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from an EC2 instance:

  • Approach: Run TFTP server (e.g., tftpd-hpa, dnsmasq) on an EC2 instance
  • Access: Home lab connects via VPN tunnel to instance’s private IP
  • Security Group: Allow UDP/69 from VPN subnet/security group
  • Pros:
    • Simple implementation
    • No load balancer needed (single boot server sufficient for home lab)
    • TFTP traffic encrypted through VPN tunnel
    • Direct instance-to-client communication
  • Cons:
    • Single point of failure (no HA)
    • Manual failover if instance fails

Option 2: Network Load Balancer (NLB) UDP Passthrough

While NLB doesn’t understand TFTP protocol, it can forward UDP traffic:

  • Approach: Configure NLB to forward UDP/69 to target group
  • Limitations:
    • No TFTP-specific health checks
    • Health checks would use TCP or different protocol
    • Adds cost and complexity without significant benefit for single server
  • Use Case: Only relevant for multi-AZ HA deployment (overkill for home lab)

TFTP Security Considerations

  • Encryption: TFTP itself is unencrypted, but VPN tunnel provides encryption
  • Security Groups: Restrict UDP/69 to VPN security group or CIDR only
  • File Access Control: Configure TFTP server with restricted file access
  • Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads

HTTP Support

Native Support

Status: ✅ Fully supported

AWS provides comprehensive HTTP support through multiple services:

Elastic Load Balancing - Application Load Balancer

  • Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (preview)
  • Port: Any port (typically 80 for HTTP)
  • Routing: Path-based, host-based, query string, header-based routing
  • Health Checks: HTTP health checks with configurable paths and response codes
  • SSL Offloading: Terminate SSL at ALB and use HTTP to backend
  • Backend: EC2 instances, ECS, EKS, Lambda

EC2 Direct Access

For VPN scenario, HTTP can be served directly from EC2 instance:

  • Approach: Run HTTP server (nginx, Apache, custom service) on EC2
  • Access: Home lab accesses via VPN tunnel to private IP
  • Security Group: Allow TCP/80 from VPN security group
  • Pros: Simpler than ALB for single boot server

HTTP Boot Flow for Network Boot

  1. PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
  2. iPXE → HTTP: iPXE chainloads kernel/initrd via HTTP
  3. Kernel/Initrd: Large boot files served efficiently over HTTP

Performance Considerations

  • Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
  • Compression: gzip compression for text-based configs
  • CloudFront: Optional CDN for caching boot files (probably overkill for VPN scenario)
  • TCP Optimization: AWS network optimized for low-latency TCP

HTTPS Support

Native Support

Status: ✅ Fully supported with advanced features

AWS provides enterprise-grade HTTPS support:

Elastic Load Balancing - Application Load Balancer

  • Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 (preview)
  • SSL/TLS Termination: Terminate SSL at ALB
  • Certificate Management:
    • AWS Certificate Manager (ACM) - free SSL certificates with automatic renewal
    • Import custom certificates
    • Integration with private CA via ACM Private CA
  • TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable via security policy)
  • Cipher Suites: Predefined security policies (modern, compatible, legacy)
  • SNI Support: Multiple certificates on single load balancer

AWS Certificate Manager (ACM)

  • Free Certificates: No cost for public SSL certificates used with AWS services
  • Automatic Renewal: ACM automatically renews certificates before expiration
  • Private CA: ACM Private CA for internal PKI (additional cost)
  • Integration: Native integration with ALB, CloudFront, API Gateway

HTTPS for Network Boot

Use Case

Modern UEFI firmware and iPXE support HTTPS boot:

  • iPXE HTTPS: iPXE compiled with DOWNLOAD_PROTO_HTTPS can fetch over HTTPS
  • UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot
  • Security: Boot file integrity verified via HTTPS chain of trust

Implementation on AWS

  1. Certificate Provisioning:

    • Use ACM certificate for public domain (free, auto-renewed)
    • Use self-signed certificate for VPN-only access (add to iPXE trust store)
    • Use ACM Private CA for internal PKI ($400/month - expensive for home lab)
  2. ALB Configuration:

    • HTTPS listener on port 443
    • Target group pointing to EC2 boot server
    • Security policy with TLS 1.2+ minimum
  3. Alternative: Direct EC2 HTTPS:

    • Run nginx/Apache with TLS on EC2 instance
    • Access via VPN tunnel to private IP with HTTPS
    • Simpler setup for VPN-only scenario
    • Use Let’s Encrypt or self-signed certificate

Mutual TLS (mTLS) Support

AWS ALB supports mutual TLS authentication (as of 2022):

  • Client Certificates: Require client certificates for authentication
  • Trust Store: Upload trusted CA certificates to ALB
  • Use Case: Ensure only authorized home lab servers can access boot files
  • Integration: Combine with VPN for defense-in-depth
  • Passthrough Mode: ALB can pass client cert to backend for validation

Routing and Load Balancing Capabilities

VPC Routing

  • Route Tables: Define routes to direct traffic through VPN gateway
  • Route Propagation: BGP route propagation for VPN connections
  • Transit Gateway: Advanced multi-VPC/VPN routing (overkill for home lab)

Security Groups

  • Stateful Firewall: Automatic return traffic handling
  • Ingress/Egress Rules: Fine-grained control by protocol, port, source/destination
  • Security Group Chaining: Reference security groups in rules (elegant for VPN setup)
  • VPN Subnet Restriction: Allow traffic only from VPN-connected subnet

Network ACLs (Optional)

  • Stateless Firewall: Subnet-level access control
  • Defense in Depth: Additional layer beyond security groups
  • Use Case: Probably unnecessary for simple VPN boot server

Cost Implications

Data Transfer Costs

  • VPN Traffic: Data transfer through VPN gateway charged at standard rates
  • Intra-Region: Free for traffic within same region/VPC
  • Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
  • Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.14/month (US East egress)

Load Balancing Costs

  • Application Load Balancer: $0.0225/hour + $0.008 per LCU-hour ($16-20/month minimum)
  • Network Load Balancer: $0.0225/hour + $0.006 per NLCU-hour ($16-18/month minimum)
  • For VPN Scenario: Load balancer unnecessary (single EC2 instance sufficient)

Compute Costs

  • t3.micro Instance: ~$7.50/month (on-demand pricing, US East)
  • t4g.micro Instance: ~$6.00/month (ARM-based, cheaper, sufficient for boot server)
  • Reserved Instances: Up to 72% savings with 1-year or 3-year commitment
  • Savings Plans: Flexible discounts for consistent compute usage

ACM Certificate Costs

  • Public Certificates: Free when used with AWS services
  • Private CA: $400/month (too expensive for home lab)

Comparison with Requirements

RequirementAWS SupportImplementation
TFTP⚠️ Via EC2, not ELBDirect EC2 access via VPN
HTTP✅ Full supportEC2 or ALB
HTTPS✅ Full supportEC2 or ALB with ACM
VPN Integration✅ Native VPNSite-to-Site VPN or self-managed
Load Balancing✅ ALB, NLBOptional for HA
Certificate Mgmt✅ ACM (free)Automatic renewal
Cost Efficiency✅ Low-cost instancest4g.micro sufficient

Recommendations

For VPN-Based Architecture (per ADR-0002)

  1. EC2 Instance: Deploy single t4g.micro or t3.micro instance with:

    • TFTP server (tftpd-hpa or dnsmasq)
    • HTTP server (nginx or simple Python HTTP server)
    • Optional HTTPS with Let’s Encrypt or self-signed certificate
  2. VPN Connection: Connect home lab to AWS via:

    • Site-to-Site VPN (IPsec) - managed service, higher cost (~$36/month)
    • Self-managed WireGuard on EC2 - lower cost, more control
  3. Security Groups: Restrict access to:

    • UDP/69 (TFTP) from VPN security group only
    • TCP/80 (HTTP) from VPN security group only
    • TCP/443 (HTTPS) from VPN security group only
  4. No Load Balancer: For home lab scale, direct EC2 access is sufficient

  5. Health Monitoring: Use CloudWatch for instance and service health

If HA Required (Future Enhancement)

  • Deploy multi-AZ EC2 instances with Network Load Balancer
  • Use S3 as backend for boot files with EC2 serving as cache
  • Implement auto-recovery with Auto Scaling Group (min=max=1)

References

1.2.2 - AWS WireGuard VPN Support

Analysis of WireGuard VPN deployment options on Amazon Web Services for secure site-to-site connectivity

WireGuard VPN Support on Amazon Web Services

This document analyzes options for deploying WireGuard VPN on AWS to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.

WireGuard Overview

WireGuard is a modern VPN protocol that provides:

  • Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
  • Performance: High throughput with low overhead
  • Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
  • Configuration: Simple key-based configuration
  • Kernel Integration: Mainline Linux kernel support since 5.6

AWS Native VPN Support

Site-to-Site VPN (IPsec)

Status: ❌ WireGuard not natively supported

AWS’s managed Site-to-Site VPN supports:

  • IPsec VPN: IKEv1, IKEv2 with pre-shared keys
  • Redundancy: Two VPN tunnels per connection for high availability
  • BGP Support: Dynamic routing via BGP
  • Transit Gateway: Scalable multi-VPC VPN hub

Limitation: Site-to-Site VPN does not support WireGuard protocol natively.

Cost: Site-to-Site VPN

  • VPN Connection: ~$0.05/hour = ~$36/month
  • Data Transfer: Standard data transfer out rates (~$0.09/GB for first 10TB)
  • Total Estimate: ~$36-50/month for managed IPsec VPN

Self-Managed WireGuard on EC2

Implementation Approach

Since AWS doesn’t offer managed WireGuard, deploy WireGuard on an EC2 instance:

Status: ✅ Fully supported via EC2

Architecture

graph LR
    A[Home Lab] -->|WireGuard Tunnel| B[AWS EC2 Instance]
    B -->|VPC Network| C[Boot Server EC2]
    B -->|IP Forwarding| C
    
    subgraph "Home Network"
        A
        D[UDM Pro]
        D -.WireGuard Client.- A
    end
    
    subgraph "AWS VPC"
        B[WireGuard Gateway EC2]
        C[Boot Server EC2]
    end

EC2 Configuration

  1. WireGuard Gateway Instance:

    • Instance Type: t4g.micro or t3.micro ($6-7.50/month)
    • OS: Ubuntu 22.04 LTS or Amazon Linux 2023 (native WireGuard support)
    • Source/Dest Check: Disable to allow IP forwarding
    • Elastic IP: Allocate Elastic IP for stable WireGuard endpoint
    • Security Group: Allow UDP port 51820 from home lab public IP
  2. Boot Server Instance:

    • Network: Same VPC as WireGuard gateway
    • Private IP Only: No Elastic IP (accessed via VPN)
    • Route Traffic: Through WireGuard gateway instance

Installation Steps

# On EC2 Instance (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools

# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key

# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf

Example /etc/wireguard/wg0.conf on AWS EC2:

[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24

Corresponding config on UDM Pro:

[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>

[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <AWS_ELASTIC_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.0.0.0/16
PersistentKeepalive = 25

Enable and Start WireGuard

# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0

# Verify status
sudo wg show

AWS VPC Configuration

Security Groups

Create security group for WireGuard gateway:

aws ec2 create-security-group \
    --group-name wireguard-gateway-sg \
    --description "WireGuard VPN gateway" \
    --vpc-id vpc-xxxxxx

aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxx \
    --protocol udp \
    --port 51820 \
    --cidr <HOME_LAB_PUBLIC_IP>/32

Allow SSH for management (optional, restrict to trusted IP):

aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxx \
    --protocol tcp \
    --port 22 \
    --cidr <TRUSTED_IP>/32

Disable Source/Destination Check

Required for IP forwarding to work:

aws ec2 modify-instance-attribute \
    --instance-id i-xxxxxx \
    --no-source-dest-check

Elastic IP Allocation

Allocate and associate Elastic IP for stable endpoint:

aws ec2 allocate-address --domain vpc

aws ec2 associate-address \
    --instance-id i-xxxxxx \
    --allocation-id eipalloc-xxxxxx

Cost: Elastic IP is free when associated with running instance, but charged ~$3.60/month if unattached.

Route Table Configuration

Add route to direct home lab subnet traffic through WireGuard gateway:

aws ec2 create-route \
    --route-table-id rtb-xxxxxx \
    --destination-cidr-block 192.168.1.0/24 \
    --instance-id i-xxxxxx

This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway instance.

UDM Pro WireGuard Integration

Native Support

Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)

The UniFi Dream Machine Pro includes native WireGuard VPN support:

  • GUI Configuration: Web UI for WireGuard VPN setup
  • Site-to-Site: Support for site-to-site VPN tunnels
  • Performance: Hardware acceleration for encryption (if available)
  • Routing: Automatic route injection for remote subnets

Configuration Steps on UDM Pro

  1. Network Settings → VPN:

    • Create new VPN connection
    • Select “WireGuard”
    • Generate key pair or import existing
  2. Peer Configuration:

    • Peer Public Key: AWS EC2 WireGuard instance’s public key
    • Endpoint: AWS Elastic IP address
    • Port: 51820
    • Allowed IPs: AWS VPC CIDR (e.g., 10.0.0.0/16)
    • Persistent Keepalive: 25 seconds
  3. Route Injection:

    • UDM Pro automatically adds routes to AWS subnets
    • Home lab servers can reach AWS boot server via VPN
  4. Firewall Rules:

    • Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN

Alternative: Manual WireGuard on UDM Pro

If native support is insufficient, use wireguard-go via udm-utilities:

  • Repository: boostchicken/udm-utilities
  • Script: on_boot.d script to start WireGuard on boot
  • Persistence: Survives firmware updates with on-boot script

Performance Considerations

Throughput

WireGuard on EC2 performance varies by instance type:

  • t4g.micro (2 vCPU, ARM): ~100-300 Mbps
  • t3.micro (2 vCPU, x86): ~100-300 Mbps
  • t3.small (2 vCPU): ~500-800 Mbps
  • t3.medium (2 vCPU): ~1+ Gbps

For network boot (typical boot = 50-200MB), even t4g.micro is sufficient:

  • Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
  • Recommendation: t4g.micro adequate and most cost-effective

Latency

  • VPN Overhead: WireGuard adds minimal latency (~1-5ms)
  • AWS Network: Low-latency network infrastructure
  • Total Latency: Primarily dependent on home ISP and AWS region proximity

CPU Usage

  • Encryption: ChaCha20 is CPU-efficient
  • Kernel Module: Minimal CPU overhead in kernel space
  • t4g.micro: Sufficient CPU for home lab VPN throughput
  • ARM Advantage: t4g instances use Graviton processors (better price/performance)

Security Considerations

Key Management

  • Private Keys: Store securely, never commit to version control
  • Key Rotation: Rotate keys periodically (e.g., annually)
  • Secrets Manager: Store WireGuard private keys in AWS Secrets Manager
    • Retrieve at instance startup via user data script
    • Avoid storing in AMIs or instance metadata
  • IAM Role: Grant EC2 instance IAM role to read secret

Firewall Hardening

  • Security Group Restriction: Limit WireGuard port to home lab public IP only
  • Least Privilege: Boot server security group allows only VPN security group
  • No Public Access: Boot server has no Elastic IP or public route

Monitoring and Alerts

  • CloudWatch Logs: Stream WireGuard logs to CloudWatch
  • CloudWatch Alarms: Alert on VPN tunnel down (no recent handshakes)
  • VPC Flow Logs: Monitor VPN traffic patterns

DDoS Protection

  • UDP Amplification: WireGuard resistant to DDoS amplification attacks
  • AWS Shield: Basic DDoS protection included free on all AWS resources
  • Shield Advanced: Optional ($3,000/month - overkill for VPN endpoint)

High Availability Options

Multi-AZ Failover

Deploy WireGuard gateways in multiple Availability Zones:

  • Primary: us-east-1a WireGuard instance
  • Secondary: us-east-1b WireGuard instance
  • Failover: UDM Pro switches endpoints if primary fails
  • Cost: Doubles instance costs (~$12-15/month for 2 instances)

Auto Scaling Group (Single Instance)

Use Auto Scaling Group with min=max=1 for auto-recovery:

  • Health Checks: EC2 status checks
  • Auto-Recovery: ASG replaces failed instance automatically
  • Elastic IP: Reassociate Elastic IP to new instance via Lambda/script
  • Limitation: Brief downtime during recovery (~2-5 minutes)

Health Monitoring

Monitor WireGuard tunnel health with CloudWatch custom metrics:

# On EC2 instance, run periodically via cron
#!/bin/bash
HANDSHAKE=$(wg show wg0 latest-handshakes | awk '{print $2}')
NOW=$(date +%s)
AGE=$((NOW - HANDSHAKE))

aws cloudwatch put-metric-data \
    --namespace WireGuard \
    --metric-name TunnelAge \
    --value $AGE \
    --unit Seconds

Alert if handshake age exceeds threshold (e.g., 180 seconds).

User Data Script for Auto-Configuration

EC2 user data script to configure WireGuard on launch:

#!/bin/bash
# Install WireGuard
apt update && apt install -y wireguard wireguard-tools

# Retrieve private key from Secrets Manager
aws secretsmanager get-secret-value \
    --secret-id wireguard-server-key \
    --query SecretString \
    --output text > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key

# Configure interface (full config omitted for brevity)
# ...

# Enable and start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0

Requires IAM instance role with secretsmanager:GetSecretValue permission.

Cost Analysis

Self-Managed WireGuard on EC2

ComponentCost (US East)
t4g.micro instance (730 hrs/month)~$6.00
Elastic IP (attached)$0.00
Data transfer out (1GB/month)~$0.09
Monthly Total~$6.09
Annual Total~$73

With Reserved Instance (1-year, no upfront):

ComponentCost
t4g.micro RI (1-year)~$3.50/month
Elastic IP$0.00
Data transfer~$0.09
Monthly Total~$3.59
Annual Total~$43

Site-to-Site VPN (IPsec - if WireGuard not used)

ComponentCost
VPN Connection (2 tunnels)~$36
Data transfer (1GB/month)~$0.09
Monthly Total~$36
Annual Total~$432

Cost Savings: Self-managed WireGuard saves ~$360/year vs Site-to-Site VPN (or ~$390/year with Reserved Instance).

Comparison with Requirements

RequirementAWS SupportImplementation
WireGuard Protocol✅ Via EC2Self-managed on instance
Site-to-Site VPN✅ YesWireGuard tunnel
UDM Pro Integration✅ Native supportWireGuard peer config
Cost Efficiency✅ Very low costt4g.micro ~$6/month (on-demand)
Performance✅ Sufficient100+ Mbps on t4g.micro
Security✅ Modern cryptoChaCha20, Curve25519
HA (optional)⚠️ Manual setupMulti-AZ or ASG

Recommendations

For Home Lab VPN (per ADR-0002)

  1. Self-Managed WireGuard: Deploy on EC2 t4g.micro instance

    • Cost: ~$6/month on-demand, ~$3.50/month with Reserved Instance
    • Performance: Sufficient for network boot traffic
    • Simplicity: Easy to configure and maintain
  2. Single AZ Deployment: Unless HA required, single instance adequate

    • Region Selection: Choose region closest to home lab for lowest latency
    • AZ: Single AZ sufficient (boot server not mission-critical)
  3. UDM Pro Native WireGuard: Use built-in WireGuard client

    • Configuration: Add AWS instance as WireGuard peer in UDM Pro UI
    • Route Injection: UDM Pro automatically routes AWS subnets
  4. Security Best Practices:

    • Store WireGuard private key in Secrets Manager
    • Restrict security group to home lab public IP only
    • Use user data script to retrieve key and configure on boot
    • Enable CloudWatch logging for VPN events
    • Assign IAM instance role with minimal permissions
  5. Monitoring: Set up CloudWatch alarms for:

    • Instance status check failures
    • High CPU usage
    • VPN tunnel age (custom metric)

Cost Optimization

  • Reserved Instance: Commit to 1-year Reserved Instance for ~40% savings
  • Spot Instance: Consider Spot for even lower cost (~70% savings), but adds complexity (handle interruptions)
  • ARM Architecture: Use t4g (Graviton) for 20% better price/performance vs t3

Future Enhancements

  • HA Setup: Deploy secondary WireGuard instance in different AZ
  • Automated Failover: Lambda function to reassociate Elastic IP on failure
  • IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
  • Mesh VPN: Expand to mesh topology if multiple sites added

References

1.3 - Google Cloud Platform Analysis

Technical analysis of Google Cloud Platform capabilities for hosting network boot infrastructure

This section contains detailed analysis of Google Cloud Platform (GCP) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.

Overview

Google Cloud Platform is Google’s suite of cloud computing services, offering compute, storage, networking, and managed services. This analysis focuses on GCP’s capabilities to support the network boot architecture decided in ADR-0002.

Key Services Evaluated

  • Compute Engine: Virtual machine instances for hosting boot server
  • Cloud VPN / VPC: Network connectivity and VPN capabilities
  • Cloud Load Balancing: Layer 4 and Layer 7 load balancing for HTTP/HTTPS
  • Cloud NAT: Network address translation for outbound connectivity
  • VPC Network: Software-defined networking and routing

Documentation Sections

1.3.1 - Cloud Storage FUSE (gcsfuse)

Analysis of Google Cloud Storage FUSE for mounting GCS buckets as local filesystems in network boot infrastructure

Overview

Cloud Storage FUSE (gcsfuse) is a FUSE-based filesystem adapter that allows Google Cloud Storage (GCS) buckets to be mounted and accessed as local filesystems on Linux systems. This enables applications to interact with object storage using standard filesystem operations (open, read, write, etc.) rather than requiring GCS-specific APIs.

Project: GoogleCloudPlatform/gcsfuse License: Apache 2.0 Status: Generally Available (GA) Latest Version: v2.x (as of 2024)

How gcsfuse Works

gcsfuse translates filesystem operations into GCS API calls:

  1. Mount Operation: gcsfuse bucket-name /mount/point maps a GCS bucket to a local directory
  2. Directory Structure: Interprets / in object names as directory separators
  3. File Operations: Translates read(), write(), open(), etc. into GCS API requests
  4. Metadata: Maintains file attributes (size, modification time) via GCS metadata
  5. Caching: Optional stat, type, list, and file caching to reduce API calls

Example:

  • GCS object: gs://boot-assets/kernels/talos-v1.6.0.img
  • Mounted path: /mnt/boot-assets/kernels/talos-v1.6.0.img

Relevance to Network Boot Infrastructure

In the context of ADR-0005 Network Boot Infrastructure, gcsfuse offers a potential approach for serving boot assets from Cloud Storage without custom integration code.

Potential Use Cases

  1. Boot Asset Storage: Mount gs://boot-assets/ to /var/lib/boot-server/assets/
  2. Configuration Sync: Access boot profiles and machine mappings from GCS as local files
  3. Matchbox Integration: Mount GCS bucket to /var/lib/matchbox/ for assets/profiles/groups
  4. Simplified Development: Eliminate custom Cloud Storage SDK integration in boot server code

Architecture Pattern

┌─────────────────────────┐
│   Boot Server Process   │
│  (Cloud Run/Compute)    │
└───────────┬─────────────┘
            │ filesystem operations
            │ (read, open, stat)
            ▼
┌─────────────────────────┐
│   gcsfuse mount point   │
│   /var/lib/boot-assets  │
└───────────┬─────────────┘
            │ FUSE layer
            │ (translates to GCS API)
            ▼
┌─────────────────────────┐
│  Cloud Storage Bucket   │
│   gs://boot-assets/     │
└─────────────────────────┘

Performance Characteristics

Latency

  • Much higher latency than local filesystem: Every operation requires GCS API call(s)
  • No default caching: Without caching enabled, every read re-fetches from GCS
  • Network round-trip: Minimum ~10-50ms latency per operation (depending on region)

Throughput

Single Large File:

  • Read: ~4.1 MiB/s (individual file), up to 63.3 MiB/s (archive files)
  • Write: Comparable to gsutil cp for large files
  • With parallel downloads: Up to 9x faster for single-threaded reads of large files

Small Files:

  • Poor performance for random I/O on small files
  • Bulk operations on many small files create significant bottlenecks
  • ls on directories with thousands of objects can take minutes

Concurrent Access:

  • Performance degrades significantly with parallel readers (8 instances: ~30 hours vs 16 minutes with local data)
  • Not recommended for high-concurrency scenarios (web servers, NAS)

Performance Improvements (Recent Features)

  1. Streaming Writes (default): Upload data directly to GCS as written

    • Up to 40% faster for large sequential writes
    • Reduces local disk usage (no staging file)
  2. Parallel Downloads: Download large files using multiple workers

    • Up to 9x faster model load times
    • Best for single-threaded reads of large files
  3. File Cache: Cache file contents locally (Local SSD, Persistent Disk, or tmpfs)

    • Up to 2.3x faster training time (AI/ML workloads)
    • Up to 3.4x higher throughput
    • Requires explicit cache directory configuration
  4. Metadata Cache: Cache stat, type, and list operations

    • Stat and type caches enabled by default
    • Configurable TTL (default: 60s, set -1 for unlimited)

Caching Configuration

gcsfuse provides four types of caching:

1. Stat Cache

Caches file attributes (size, modification time, existence).

# Enable with unlimited size and TTL
gcsfuse \
  --stat-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Reduces API calls for repeated stat() operations (e.g., checking file existence).

2. Type Cache

Caches file vs directory type information.

gcsfuse \
  --type-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Speeds up directory traversal and ls operations.

3. List Cache

Caches directory listing results.

gcsfuse \
  --max-conns-per-host=100 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Improves performance for applications that repeatedly list directory contents.

4. File Cache

Caches actual file contents locally.

gcsfuse \
  --file-cache-max-size-mb=-1 \
  --cache-dir=/mnt/local-ssd \
  --file-cache-cache-file-for-range-read=true \
  --file-cache-enable-parallel-downloads=true \
  bucket-name /mount/point

Use case: Essential for AI/ML training, repeated reads of large files.

Recommended cache storage:

  • Local SSD: Fastest, but ephemeral (data lost on restart)
  • Persistent Disk: Persistent but slower than Local SSD
  • tmpfs (RAM disk): Fastest but limited by memory

Production Configuration Example

# config.yaml for gcsfuse
metadata-cache:
  ttl-secs: -1  # Never expire (use only if bucket is read-only or single-writer)
  stat-cache-max-size-mb: -1
  type-cache-max-size-mb: -1

file-cache:
  max-size-mb: -1  # Unlimited (limited by disk space)
  cache-file-for-range-read: true
  enable-parallel-downloads: true
  parallel-downloads-per-file: 16
  download-chunk-size-mb: 50

write:
  create-empty-file: false  # Streaming writes (default)

logging:
  severity: info
  format: json
gcsfuse --config-file=config.yaml boot-assets /mnt/boot-assets

Limitations and Considerations

Filesystem Semantics

gcsfuse provides approximate POSIX semantics but is not fully POSIX-compliant:

  • No atomic rename: Rename operations are copy-then-delete (not atomic)
  • No hard links: GCS doesn’t support hard links
  • No file locking: flock() is a no-op
  • Limited permissions: GCS has simpler ACLs than POSIX permissions
  • No sparse files: Writes always materialize full file content

Performance Anti-Patterns

Avoid:

  • Serving web content or acting as NAS (concurrent connections)
  • Random I/O on many small files (image datasets, text corpora)
  • Reading during ML training loops (download first, then train)
  • High-concurrency workloads (multiple parallel readers/writers)

Good for:

  • Sequential reads of large files (models, checkpoints, kernels)
  • Infrequent writes of entire files
  • Read-mostly workloads with caching enabled
  • Single-writer scenarios

Consistency Trade-offs

With caching enabled:

  • Stale reads possible if cache TTL > 0 and external modifications occur
  • Safe only for:
    • Read-only buckets
    • Single-writer, single-mount scenarios
    • Workloads tolerant of eventual consistency

Without caching:

  • Strong consistency (every read fetches latest from GCS)
  • Much slower performance

Resource Requirements

  • Disk space: File cache and streaming writes require local storage
    • File cache: Size of cached files (can be large for ML datasets)
    • Streaming writes: Temporary staging (proportional to concurrent writes)
  • Memory: Metadata caches consume RAM
  • File handles: Can exceed system limits with high concurrency
  • Network bandwidth: All data transfers via GCS API

Installation

On Compute Engine (Container-Optimized OS)

# Install gcsfuse (Container-Optimized OS doesn't include package managers)
export GCSFUSE_VERSION=2.x.x
curl -L -O https://github.com/GoogleCloudPlatform/gcsfuse/releases/download/v${GCSFUSE_VERSION}/gcsfuse_${GCSFUSE_VERSION}_amd64.deb
sudo dpkg -i gcsfuse_${GCSFUSE_VERSION}_amd64.deb

On Debian/Ubuntu

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

sudo apt-get update
sudo apt-get install gcsfuse

In Docker/Cloud Run

FROM ubuntu:22.04

# Install gcsfuse
RUN apt-get update && apt-get install -y \
    curl \
    gnupg \
    lsb-release \
  && export GCSFUSE_REPO=gcsfuse-$(lsb_release -c -s) \
  && echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list \
  && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
  && apt-get update \
  && apt-get install -y gcsfuse \
  && rm -rf /var/lib/apt/lists/*

# Create mount point
RUN mkdir -p /mnt/boot-assets

# Mount gcsfuse at startup
CMD gcsfuse --foreground boot-assets /mnt/boot-assets & \
    /usr/local/bin/boot-server

Note: Cloud Run does not support FUSE filesystems (requires privileged mode). gcsfuse only works on Compute Engine or GKE.

Network Boot Infrastructure Evaluation

Applicability to ADR-0005

Based on the analysis, gcsfuse is not recommended for the network boot infrastructure for the following reasons:

❌ Cloud Run Incompatibility

  • gcsfuse requires FUSE kernel module and privileged containers
  • Cloud Run does not support FUSE or privileged mode
  • ADR-0005 prefers Cloud Run deployment (HTTP-only boot enables serverless)
  • Impact: Blocks Cloud Run deployment, forcing Compute Engine VM

❌ Boot Latency Requirements

  • Boot file requests target < 100ms latency (ADR-0005 confirmation criteria)
  • gcsfuse adds 10-50ms+ latency per operation (network round-trips)
  • Kernel/initrd downloads are latency-sensitive (network boot timeout)
  • Impact: May exceed boot timeout thresholds

❌ No Caching for Read-Write Workloads

  • Boot server needs to write new assets and read existing ones
  • File cache with unlimited TTL requires read-only or single-writer assumption
  • Multiple boot server instances (autoscaling) violate single-writer constraint
  • Impact: Either accept stale reads or disable caching (slow)

❌ Small File Performance

  • Machine mapping configs, boot scripts, profiles are small files (KB range)
  • gcsfuse performs poorly on small, random I/O
  • ls operations on directories with many profiles can be slow
  • Impact: Slow boot configuration lookups

✅ Alternative: Direct Cloud Storage SDK

Using cloud.google.com/go/storage SDK directly offers:

  • Lower latency: Direct API calls without FUSE overhead
  • Cloud Run compatible: No kernel module or privileged mode required
  • Better control: Explicit caching, parallel downloads, streaming
  • Simpler deployment: No mount management, no FUSE dependencies
  • Cost: Similar API call costs to gcsfuse

Recommended approach (from ADR-0005):

// Custom boot server using Cloud Storage SDK
storage := storage.NewClient(ctx)
bucket := storage.Bucket("boot-assets")

// Stream kernel to boot client
obj := bucket.Object("kernels/talos-v1.6.0.img")
reader, _ := obj.NewReader(ctx)
defer reader.Close()
io.Copy(w, reader)  // Stream to HTTP response

When gcsfuse MIGHT Be Useful

Despite the above limitations, gcsfuse could be considered for:

  1. Matchbox on Compute Engine:

    • Matchbox expects filesystem paths for assets (/var/lib/matchbox/assets/)
    • Compute Engine VM supports FUSE
    • Read-heavy workload (boot assets rarely change)
    • Could mount gs://boot-assets/ to /var/lib/matchbox/assets/ with file cache
  2. Development/Testing:

    • Quick prototyping without writing Cloud Storage integration
    • Local development with production bucket access
    • Not recommended for production deployment
  3. Low-Throughput Scenarios:

    • Home lab scale (< 10 boots/hour)
    • File cache enabled with Local SSD
    • Single Compute Engine VM (not autoscaled)

Configuration for Matchbox + gcsfuse:

#!/bin/bash
# Mount boot assets for Matchbox

BUCKET="boot-assets"
MOUNT_POINT="/var/lib/matchbox/assets"
CACHE_DIR="/mnt/disks/local-ssd/gcsfuse-cache"

mkdir -p "$MOUNT_POINT" "$CACHE_DIR"

gcsfuse \
  --stat-cache-max-size-mb=-1 \
  --type-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  --file-cache-max-size-mb=-1 \
  --cache-dir="$CACHE_DIR" \
  --file-cache-cache-file-for-range-read=true \
  --file-cache-enable-parallel-downloads=true \
  --implicit-dirs \
  --foreground \
  "$BUCKET" "$MOUNT_POINT"

Monitoring and Troubleshooting

Metrics

gcsfuse exposes Prometheus metrics:

gcsfuse --prometheus --prometheus-port=9101 bucket /mnt/point

Key metrics:

  • gcs_read_count: Number of GCS read operations
  • gcs_write_count: Number of GCS write operations
  • gcs_read_bytes: Bytes read from GCS
  • gcs_write_bytes: Bytes written to GCS
  • fs_ops_count: Filesystem operations by type (open, read, write, etc.)
  • fs_ops_error_count: Filesystem operation errors

Logging

# JSON logging for Cloud Logging integration
gcsfuse --log-format=json --log-file=/var/log/gcsfuse.log bucket /mnt/point

Common Issues

Issue: ls on large directories takes minutes

Solution:

  • Enable list caching with --metadata-cache-ttl-secs=-1
  • Reduce directory depth (flatten object hierarchy)
  • Consider prefix-based filtering instead of full listings

Issue: Stale reads after external bucket modifications

Solution:

  • Reduce --metadata-cache-ttl-secs (default 60s)
  • Disable caching entirely for strong consistency
  • Use versioned object names (immutable assets)

Issue: Transport endpoint is not connected errors

Solution:

  • Unmount cleanly before remounting: fusermount -u /mnt/point
  • Check GCS bucket permissions (IAM roles)
  • Verify network connectivity to storage.googleapis.com

Issue: High memory usage

Solution:

  • Limit metadata cache sizes: --stat-cache-max-size-mb=1024
  • Disable file cache if not needed
  • Monitor with --prometheus metrics

Comparison to Alternatives

gcsfuse vs Direct Cloud Storage SDK

AspectgcsfuseCloud Storage SDK
LatencyHigher (FUSE overhead + GCS API)Lower (direct GCS API)
Cloud Run❌ Not supported✅ Fully supported
Development EffortLow (standard filesystem code)Medium (SDK integration)
PerformanceSlower (filesystem abstraction)Faster (optimized for use case)
CachingBuilt-in (stat, type, list, file)Manual (application-level)
StreamingAutomaticExplicit (io.Copy)
DependenciesFUSE kernel module, privileged modeNone (pure Go library)

Recommendation: Use Cloud Storage SDK directly for production network boot infrastructure.

gcsfuse vs rsync/gsutil Sync

Periodic sync pattern:

# Sync bucket to local disk every 5 minutes
*/5 * * * * gsutil -m rsync -r gs://boot-assets /var/lib/boot-assets
Aspectgcsfusersync/gsutil sync
ConsistencyEventual (with caching)Strong (within sync interval)
Disk UsageMinimal (file cache optional)Full copy of assets
LatencyGCS API per requestLocal disk (fast)
Sync LagReal-time (no caching) or TTLSync interval (minutes)
DeploymentRequires FUSESimple cron job

Recommendation: For read-heavy, infrequent-write workloads on Compute Engine, rsync/gsutil sync is simpler and faster than gcsfuse.

Conclusion

Cloud Storage FUSE (gcsfuse) provides a convenient filesystem abstraction over GCS buckets, but is not recommended for the network boot infrastructure due to:

  1. Cloud Run incompatibility (requires FUSE kernel module)
  2. Added latency (FUSE overhead + network round-trips)
  3. Poor performance for small files and concurrent access
  4. Caching trade-offs (consistency vs performance)

Recommended alternatives:

  • Custom Boot Server: Direct Cloud Storage SDK integration (cloud.google.com/go/storage)
  • Matchbox on Compute Engine: rsync/gsutil sync to local disk
  • Cloud Run Deployment: Direct SDK (no gcsfuse possible)

gcsfuse may be useful for development/testing or Matchbox prototyping on Compute Engine, but production deployments should use direct SDK integration or periodic sync for optimal performance and Cloud Run compatibility.

References

1.3.2 - GCP Network Boot Protocol Support

Analysis of Google Cloud Platform’s support for TFTP, HTTP, and HTTPS routing for network boot infrastructure

Network Boot Protocol Support on Google Cloud Platform

This document analyzes GCP’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.

TFTP (Trivial File Transfer Protocol) Support

Native Support

Status: ❌ Not natively supported by Cloud Load Balancing

GCP’s Cloud Load Balancing services (Application Load Balancer, Network Load Balancer) do not support TFTP protocol natively. TFTP operates on UDP port 69 and has unique protocol requirements that are not compatible with GCP’s load balancing services.

Implementation Options

Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from a Compute Engine VM without load balancing:

  • Approach: Run TFTP server (e.g., tftpd-hpa, dnsmasq) on a Compute Engine VM
  • Access: Home lab connects via VPN tunnel to the VM’s private IP
  • Routing: VPC firewall rules allow UDP/69 from VPN subnet
  • Pros:
    • Simple implementation
    • No need for load balancing (single boot server sufficient)
    • TFTP traffic encrypted through VPN tunnel
    • Direct VM-to-client communication
  • Cons:
    • Single point of failure (no load balancing/HA)
    • Manual failover required if VM fails

Option 2: Network Load Balancer (NLB) Passthrough

While NLB doesn’t parse TFTP protocol, it can forward UDP traffic:

  • Approach: Configure Network Load Balancer for UDP/69 passthrough
  • Limitations:
    • No protocol-aware health checks for TFTP
    • Health checks would use TCP or HTTP on alternate port
    • Adds complexity without significant benefit for single boot server
  • Use Case: Only relevant for multi-region HA deployment (overkill for home lab)

TFTP Security Considerations

  • Encryption: TFTP protocol itself is unencrypted, but VPN tunnel provides encryption
  • Firewall Rules: Restrict UDP/69 to VPN subnet only (no public access)
  • File Access Control: Configure TFTP server with restricted file access
  • Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads

HTTP Support

Native Support

Status: ✅ Fully supported

GCP provides comprehensive HTTP support through multiple services:

Cloud Load Balancing - Application Load Balancer

  • Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (QUIC)
  • Port: Any port (typically 80 for HTTP)
  • Routing: URL-based routing, host-based routing, path-based routing
  • Health Checks: HTTP health checks with configurable paths
  • SSL Offloading: Can terminate SSL at load balancer and use HTTP backend
  • Backend: Compute Engine VMs, instance groups, Cloud Run, GKE

Compute Engine Direct Access

For VPN scenario, HTTP can be served directly from VM:

  • Approach: Run HTTP server (nginx, Apache, custom service) on Compute Engine VM
  • Access: Home lab accesses via VPN tunnel to private IP
  • Firewall: VPC firewall rules allow TCP/80 from VPN subnet
  • Pros: Simpler than load balancer for single boot server

HTTP Boot Flow for Network Boot

  1. PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
  2. iPXE → HTTP: iPXE chainloads boot files via HTTP from same server
  3. Kernel/Initrd: Large boot files served efficiently over HTTP

Performance Considerations

  • Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
  • Compression: gzip compression for text-based boot configs
  • Caching: Cloud CDN can cache boot files for faster delivery
  • TCP Optimization: GCP’s network optimized for low-latency TCP

HTTPS Support

Native Support

Status: ✅ Fully supported with advanced features

GCP provides enterprise-grade HTTPS support:

Cloud Load Balancing - Application Load Balancer

  • Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 with QUIC
  • SSL/TLS Termination: Terminate SSL at load balancer
  • Certificate Management:
    • Google-managed SSL certificates (automatic renewal)
    • Self-managed certificates (bring your own)
    • Certificate Map for multiple domains
  • TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable minimum version)
  • Cipher Suites: Modern, compatible, or custom cipher suites
  • mTLS Support: Mutual TLS authentication (client certificates)

Certificate Manager

  • Managed Certificates: Automatic provisioning and renewal via Let’s Encrypt integration
  • Private CA: Integration with Google Cloud Certificate Authority Service
  • Certificate Maps: Route different domains to different backends based on SNI
  • Certificate Monitoring: Automatic alerts before expiration

HTTPS for Network Boot

Use Case

Modern UEFI firmware and iPXE support HTTPS boot:

  • iPXE HTTPS: iPXE compiled with DOWNLOAD_PROTO_HTTPS can fetch over HTTPS
  • UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot (RFC 3720 iSCSI boot)
  • Security: Boot file integrity verified via HTTPS chain of trust

Implementation on GCP

  1. Certificate Provisioning:

    • Use Google-managed certificate for public domain (if boot server has public DNS)
    • Use self-signed certificate for VPN-only access (add to iPXE trust store)
    • Use private CA for internal PKI
  2. Load Balancer Configuration:

    • HTTPS frontend (port 443)
    • Backend service to Compute Engine VM running boot server
    • SSL policy with TLS 1.2+ minimum
  3. Alternative: Direct VM HTTPS:

    • Run nginx/Apache with TLS on Compute Engine VM
    • Access via VPN tunnel to private IP with HTTPS
    • Simpler setup for VPN-only scenario

mTLS Support for Enhanced Security

GCP’s Application Load Balancer supports mutual TLS authentication:

  • Client Certificates: Require client certificates for additional authentication
  • Certificate Validation: Validate client certificates against trusted CA
  • Use Case: Ensure only authorized home lab servers can access boot files
  • Integration: Combine with VPN for defense-in-depth

Routing and Load Balancing Capabilities

VPC Routing

  • Custom Routes: Define routes to direct traffic through VPN gateway
  • Route Priority: Configure route priorities for failover scenarios
  • BGP Support: Dynamic routing with Cloud Router (for advanced VPN setups)

Firewall Rules

  • Ingress/Egress Rules: Fine-grained control over traffic
  • Source/Destination Filters: IP ranges, tags, service accounts
  • Protocol Filtering: Allow specific protocols (UDP/69, TCP/80, TCP/443)
  • VPN Subnet Restriction: Limit access to VPN-connected home lab subnet

Cloud Armor (Optional)

For additional security if boot server has public access:

  • DDoS Protection: Layer 3/4 DDoS mitigation
  • WAF Rules: Application-level filtering
  • IP Allowlisting: Restrict to known public IPs
  • Rate Limiting: Prevent abuse

Cost Implications

Network Egress Costs

  • VPN Traffic: Egress to VPN endpoint charged at standard internet egress rates
  • Intra-Region: Free for traffic within same region
  • Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
  • Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.18/month (US egress)

Load Balancing Costs

  • Application Load Balancer: ~$0.025/hour + $0.008 per LCU-hour
  • Network Load Balancer: ~$0.025/hour + data processing charges
  • For VPN Scenario: Load balancer likely unnecessary (single VM sufficient)

Compute Costs

  • e2-micro Instance: ~$6-7/month (suitable for boot server)
  • f1-micro Instance: ~$4-5/month (even smaller, might suffice)
  • Reserved/Committed Use: Discounts for long-term commitment

Comparison with Requirements

RequirementGCP SupportImplementation
TFTP⚠️ Via VM, not LBDirect VM access via VPN
HTTP✅ Full supportVM or ALB
HTTPS✅ Full supportVM or ALB with Certificate Manager
VPN Integration✅ Native VPNCloud VPN or self-managed WireGuard
Load Balancing✅ ALB, NLBOptional for HA
Certificate Mgmt✅ Managed certsCertificate Manager
Cost Efficiency✅ Low-cost VMse2-micro sufficient

Recommendations

For VPN-Based Architecture (per ADR-0002)

  1. Compute Engine VM: Deploy single e2-micro VM with:

    • TFTP server (tftpd-hpa or dnsmasq)
    • HTTP server (nginx or simple Python HTTP server)
    • Optional HTTPS with self-signed certificate
  2. VPN Tunnel: Connect home lab to GCP via:

    • Cloud VPN (IPsec) - easier setup, higher cost
    • Self-managed WireGuard on Compute Engine - lower cost, more control
  3. VPC Firewall: Restrict access to:

    • UDP/69 (TFTP) from VPN subnet only
    • TCP/80 (HTTP) from VPN subnet only
    • TCP/443 (HTTPS) from VPN subnet only
  4. No Load Balancer: For home lab scale, direct VM access is sufficient

  5. Health Monitoring: Use Cloud Monitoring for VM and service health

If HA Required (Future Enhancement)

  • Deploy multi-zone VMs with Network Load Balancer
  • Use Cloud Storage as backend for boot files with VM serving as cache
  • Implement failover automation with Cloud Functions

References

1.3.3 - GCP WireGuard VPN Support

Analysis of WireGuard VPN deployment options on Google Cloud Platform for secure site-to-site connectivity

WireGuard VPN Support on Google Cloud Platform

This document analyzes options for deploying WireGuard VPN on GCP to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.

WireGuard Overview

WireGuard is a modern VPN protocol that provides:

  • Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
  • Performance: High throughput with low overhead
  • Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
  • Configuration: Simple key-based configuration
  • Kernel Integration: Mainline Linux kernel support since 5.6

GCP Native VPN Support

Cloud VPN (IPsec)

Status: ❌ WireGuard not natively supported

GCP’s managed Cloud VPN service supports:

  • IPsec VPN: IKEv1, IKEv2 with PSK or certificate authentication
  • HA VPN: Highly available VPN with 99.99% SLA
  • Classic VPN: Single-tunnel VPN (deprecated)

Limitation: Cloud VPN does not support WireGuard protocol natively.

Cost: Cloud VPN

  • HA VPN: ~$0.05/hour per tunnel × 2 tunnels = ~$73/month
  • Egress: Standard internet egress rates (~$0.12/GB for first 1TB)
  • Total Estimate: ~$75-100/month for managed VPN

Self-Managed WireGuard on Compute Engine

Implementation Approach

Since GCP doesn’t offer managed WireGuard, deploy WireGuard on a Compute Engine VM:

Status: ✅ Fully supported via Compute Engine

Architecture

graph LR
    A[Home Lab] -->|WireGuard Tunnel| B[GCP Compute Engine VM]
    B -->|Private VPC Network| C[Boot Server VM]
    B -->|IP Forwarding| C
    
    subgraph "Home Network"
        A
        D[UDM Pro]
        D -.WireGuard Client.- A
    end
    
    subgraph "GCP VPC"
        B[WireGuard Gateway VM]
        C[Boot Server VM]
    end

VM Configuration

  1. WireGuard Gateway VM:

    • Instance Type: e2-micro or f1-micro ($4-7/month)
    • OS: Ubuntu 22.04 LTS or Debian 12 (native WireGuard kernel support)
    • IP Forwarding: Enable IP forwarding to route traffic to other VMs
    • External IP: Static external IP for stable WireGuard endpoint
    • Firewall: Allow UDP port 51820 (WireGuard) from home lab public IP
  2. Boot Server VM:

    • Network: Same VPC as WireGuard gateway
    • Private IP Only: No external IP (accessed via VPN)
    • Route Traffic: Through WireGuard gateway VM

Installation Steps

# On GCP Compute Engine VM (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools

# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key

# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf

Example /etc/wireguard/wg0.conf on GCP VM:

[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o ens4 -j MASQUERADE

[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24

Corresponding config on UDM Pro:

[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>

[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <GCP_VM_EXTERNAL_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.128.0.0/20
PersistentKeepalive = 25

Enable and Start WireGuard

# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0

# Verify status
sudo wg show

GCP VPC Configuration

Firewall Rules

Create VPC firewall rule to allow WireGuard:

gcloud compute firewall-rules create allow-wireguard \
    --direction=INGRESS \
    --priority=1000 \
    --network=default \
    --action=ALLOW \
    --rules=udp:51820 \
    --source-ranges=<HOME_LAB_PUBLIC_IP>/32 \
    --target-tags=wireguard-gateway

Tag the WireGuard VM:

gcloud compute instances add-tags wireguard-gateway-vm \
    --tags=wireguard-gateway \
    --zone=us-central1-a

Static External IP

Reserve static IP for stable WireGuard endpoint:

gcloud compute addresses create wireguard-gateway-ip \
    --region=us-central1

gcloud compute instances delete-access-config wireguard-gateway-vm \
    --access-config-name="external-nat" \
    --zone=us-central1-a

gcloud compute instances add-access-config wireguard-gateway-vm \
    --access-config-name="external-nat" \
    --address=wireguard-gateway-ip \
    --zone=us-central1-a

Cost: Static IP ~$3-4/month if VM is always running (free if attached to running VM in some regions).

Route Configuration

For traffic from boot server to reach home lab via WireGuard VM:

gcloud compute routes create route-to-homelab \
    --network=default \
    --priority=100 \
    --destination-range=192.168.1.0/24 \
    --next-hop-instance=wireguard-gateway-vm \
    --next-hop-instance-zone=us-central1-a

This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway VM.

UDM Pro WireGuard Integration

Native Support

Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)

The UniFi Dream Machine Pro includes native WireGuard VPN support:

  • GUI Configuration: Web UI for WireGuard VPN setup
  • Site-to-Site: Support for site-to-site VPN tunnels
  • Performance: Hardware acceleration for encryption (if available)
  • Routing: Automatic route injection for remote subnets

Configuration Steps on UDM Pro

  1. Network Settings → VPN:

    • Create new VPN connection
    • Select “WireGuard”
    • Generate key pair or import existing
  2. Peer Configuration:

    • Peer Public Key: GCP WireGuard VM’s public key
    • Endpoint: GCP VM’s static external IP
    • Port: 51820
    • Allowed IPs: GCP VPC subnet (e.g., 10.128.0.0/20)
    • Persistent Keepalive: 25 seconds
  3. Route Injection:

    • UDM Pro automatically adds routes to GCP subnets
    • Home lab servers can reach GCP boot server via VPN
  4. Firewall Rules:

    • Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN

Alternative: Manual WireGuard on UDM Pro

If native support is insufficient, use wireguard-go via udm-utilities:

  • Repository: boostchicken/udm-utilities
  • Script: on_boot.d script to start WireGuard
  • Persistence: Survives firmware updates with on-boot script

Performance Considerations

Throughput

WireGuard on Compute Engine performance:

  • e2-micro (2 vCPU, shared core): ~100-300 Mbps
  • e2-small (2 vCPU): ~500-800 Mbps
  • e2-medium (2 vCPU): ~1+ Gbps

For network boot (typical boot = 50-200MB), even e2-micro is sufficient:

  • Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
  • Recommendation: e2-micro adequate for home lab scale

Latency

  • VPN Overhead: WireGuard adds minimal latency (~1-5ms overhead)
  • GCP Network: Low-latency network to most regions
  • Total Latency: Primarily dependent on home ISP and GCP region proximity

CPU Usage

  • Encryption: ChaCha20 is CPU-efficient
  • Kernel Module: Minimal CPU overhead in kernel space
  • e2-micro: Sufficient CPU for home lab VPN throughput

Security Considerations

Key Management

  • Private Keys: Store securely, never commit to version control
  • Key Rotation: Rotate keys periodically (e.g., annually)
  • Secret Manager: Store WireGuard private keys in GCP Secret Manager
    • Retrieve at VM startup via startup script
    • Avoid storing in VM metadata or disk images

Firewall Hardening

  • Source IP Restriction: Limit WireGuard port to home lab public IP only
  • Least Privilege: Boot server firewall allows only VPN subnet
  • No Public Access: Boot server has no external IP

Monitoring and Alerts

  • Cloud Logging: Log WireGuard connection events
  • Cloud Monitoring: Alert on VPN tunnel down
  • Metrics: Monitor handshake failures, data transfer

DDoS Protection

  • UDP Amplification: WireGuard resistant to DDoS amplification
  • Cloud Armor: Optional layer for additional DDoS protection (overkill for VPN)

High Availability Options

Multi-Region Failover

Deploy WireGuard gateways in multiple regions:

  • Primary: us-central1 WireGuard VM
  • Secondary: us-east1 WireGuard VM
  • Failover: UDM Pro switches endpoints if primary fails
  • Cost: Doubles VM costs (~$8-14/month for 2 VMs)

Health Checks

Monitor WireGuard tunnel health:

# On UDM Pro (via SSH)
wg show wg0 latest-handshakes

# If handshake timestamp old (>3 minutes), tunnel may be down

Automate failover with script on UDM Pro or external monitoring.

Startup Scripts for Auto-Healing

GCP VM startup script to ensure WireGuard starts on boot:

#!/bin/bash
# /etc/startup-script.sh

# Retrieve WireGuard private key from Secret Manager
gcloud secrets versions access latest --secret="wireguard-server-key" > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key

# Start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0

Attach as metadata:

gcloud compute instances add-metadata wireguard-gateway-vm \
    --metadata-from-file startup-script=/path/to/startup-script.sh \
    --zone=us-central1-a

Cost Analysis

Self-Managed WireGuard on Compute Engine

ComponentCost
e2-micro VM (730 hrs/month)~$6.50
Static External IP~$3.50
Egress (1GB/month boot traffic)~$0.12
Monthly Total~$10.12
Annual Total~$121

Cloud VPN (IPsec - if WireGuard not used)

ComponentCost
HA VPN Gateway (2 tunnels)~$73
Egress (1GB/month)~$0.12
Monthly Total~$73
Annual Total~$876

Cost Savings: Self-managed WireGuard saves ~$755/year vs Cloud VPN.

Comparison with Requirements

RequirementGCP SupportImplementation
WireGuard Protocol✅ Via Compute EngineSelf-managed on VM
Site-to-Site VPN✅ YesWireGuard tunnel
UDM Pro Integration✅ Native supportWireGuard peer config
Cost Efficiency✅ Low coste2-micro ~$10/month
Performance✅ Sufficient100+ Mbps on e2-micro
Security✅ Modern cryptoChaCha20, Curve25519
HA (optional)⚠️ Manual setupMulti-region VMs

Recommendations

For Home Lab VPN (per ADR-0002)

  1. Self-Managed WireGuard: Deploy on Compute Engine e2-micro VM

    • Cost: ~$10/month (vs ~$73/month for Cloud VPN)
    • Performance: Sufficient for network boot traffic
    • Simplicity: Easy to configure and maintain
  2. Single Region Deployment: Unless HA required, single VM adequate

    • Region Selection: Choose region closest to home lab for lowest latency
    • Zone: Single zone sufficient (boot server not mission-critical)
  3. UDM Pro Native WireGuard: Use built-in WireGuard client

    • Configuration: Add GCP VM as WireGuard peer in UDM Pro UI
    • Route Injection: UDM Pro automatically routes GCP subnets
  4. Security Best Practices:

    • Store WireGuard private key in Secret Manager
    • Restrict WireGuard port to home public IP only
    • Use startup script to configure VM on boot
    • Enable Cloud Logging for VPN events
  5. Monitoring: Set up Cloud Monitoring alerts for:

    • VM down
    • High CPU usage (indicates traffic spike or issue)
    • Firewall rule blocks (indicates misconfiguration)

Future Enhancements

  • HA Setup: Deploy secondary WireGuard VM in different region
  • Automated Failover: Script on UDM Pro to switch endpoints
  • IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
  • Mesh VPN: Expand to mesh topology if multiple sites added

References

1.4 - HP ProLiant DL360 Gen9 Analysis

Technical analysis of HP ProLiant DL360 Gen9 server capabilities with focus on network boot support

This section contains detailed analysis of the HP ProLiant DL360 Gen9 server platform, including hardware specifications, network boot capabilities, and configuration guidance for home lab deployments.

Overview

The HP ProLiant DL360 Gen9 is a 1U rack-mountable server released by HPE as part of their Generation 9 (Gen9) product line, introduced in 2014. It’s a popular choice for home labs due to its balance of performance, density, and relative power efficiency compared to earlier generations.

Key Features

  • Form Factor: 1U rack-mountable
  • Processor Support: Dual Intel Xeon E5-2600 v3/v4 processors (Haswell/Broadwell)
  • Memory: Up to 768GB DDR4 RAM (24 DIMM slots)
  • Storage: Flexible SFF/LFF drive configurations
  • Network: Integrated quad-port 1GbE or 10GbE FlexibleLOM options
  • Management: iLO 4 (Integrated Lights-Out) with remote KVM and virtual media
  • Boot Options: UEFI and Legacy BIOS support with extensive network boot capabilities

Documentation Sections

1.4.1 - Configuration Guide

Setup, optimization, and configuration recommendations for HP ProLiant DL360 Gen9 in home lab environments

Initial Setup

Hardware Assembly

  1. Install Processors:

    • Use thermal paste (HPE thermal grease recommended)
    • Align CPU carefully with socket (LGA 2011-3)
    • Secure heatsink with proper torque (hand-tighten screws in cross pattern)
    • Install both CPUs for dual-socket configuration
  2. Install Memory:

    • Populate channels evenly (see Memory Configuration below)
    • Seat DIMMs firmly until retention clips engage
    • Verify all DIMMs recognized in POST
  3. Install Storage:

    • Insert drives into hot-swap caddies
    • Label drives clearly for identification
    • Configure RAID controller (see Storage Configuration below)
  4. Install Network Cards:

    • FlexibleLOM: Slide into dedicated slot until seated
    • PCIe cards: Ensure low-profile brackets, secure with screw
    • Note MAC addresses for DHCP reservations
  5. Connect Power:

    • Install PSUs (both for redundancy)
    • Connect power cords
    • Verify PSU LEDs indicate proper operation
  6. Initial Power-On:

    • Press power button
    • Monitor POST on screen or via iLO remote console
    • Address any POST errors before proceeding

iLO 4 Initial Configuration

Physical iLO Connection

  1. Connect Ethernet cable to dedicated iLO port (not FlexibleLOM)
  2. Default iLO IP: Obtains via DHCP, or use temporary address via RBSU
  3. Check DHCP server logs for iLO MAC and assigned IP

First Login

  1. Access iLO web interface: https://<ilo-ip>
  2. Default credentials:
    • Username: Administrator
    • Password: On label on server pull-out tab (or rear label)
  3. Immediately change default password (Administration > Access Settings)

Essential iLO Settings

Network Configuration (Administration > Network):

  • Set static IP or DHCP reservation
  • Configure DNS servers
  • Set hostname (e.g., ilo-dl360-01)
  • Enable SNTP time sync

Security (Administration > Security):

  • Enforce HTTPS only (disable HTTP)
  • Configure SSH key authentication if using CLI
  • Set strong password policy
  • Enable iLO Security features

Access (Administration > Access Settings):

  • Configure iLO username/password for automation
  • Create additional user accounts (separation of duties)
  • Set session timeout (default: 30 minutes)

Date and Time (Administration > Date and Time):

  • Set NTP servers for accurate timestamps
  • Configure timezone

Licenses (Administration > Licensing):

  • Install iLO Advanced license key (required for full virtual media)
  • License can be purchased or acquired from secondary market

iLO Firmware Update

Before production use, update iLO to latest version:

  1. Download latest iLO 4 firmware from HPE Support Portal
  2. Administration > Firmware > Update Firmware
  3. Upload .bin file, apply update
  4. iLO will reboot automatically (system stays running)

System ROM (BIOS/UEFI) Configuration

Accessing RBSU

  • Local: Press F9 during POST
  • Remote: iLO Remote Console > Power > Momentary Press > Press F9 when prompted

Boot Mode Selection

System Configuration > BIOS/Platform Configuration (RBSU) > Boot Mode:

  • UEFI Mode (recommended for modern OS):

    • Supports GPT partitions (>2TB disks)
    • Required for Secure Boot
    • Better UEFI HTTP boot support
    • IPv6 PXE boot support
  • Legacy BIOS Mode:

    • For older OS or compatibility
    • MBR partition tables only
    • Traditional PXE boot

Recommendation: Use UEFI Mode unless legacy compatibility required

Boot Order Configuration

System Configuration > BIOS/Platform Configuration (RBSU) > Boot Options > UEFI Boot Order:

Recommended order for network boot deployment:

  1. Network Boot: FlexibleLOM or PCIe NIC
  2. Internal Storage: RAID controller or disk
  3. Virtual Media: iLO virtual CD/DVD (for installation media)
  4. USB: For rescue/recovery

Enable Network Boot:

  • System Configuration > BIOS/Platform Configuration (RBSU) > Network Options > Network Boot
  • Set to “Enabled”

Performance and Power Settings

System Configuration > BIOS/Platform Configuration (RBSU) > Power Management:

  • Power Regulator Mode:

    • HP Dynamic Power Savings: Balanced power/performance (recommended for home lab)
    • HP Static High Performance: Maximum performance, higher power draw
    • HP Static Low Power: Minimize power, reduced performance
    • OS Control: Let OS manage (e.g., Linux cpufreq)
  • Collaborative Power Control: Disabled (for standalone servers)

  • Minimum Processor Idle Power Core C-State: C6 (lower idle power)

  • Energy/Performance Bias: Balanced Performance (or Maximum Performance for compute workloads)

Recommendation: Start with “Dynamic Power Savings” and adjust based on workload

Memory Configuration

Optimal Population (dual-CPU configuration):

For maximum performance, populate all channels before adding second DIMM per channel:

64GB (8x 8GB):

  • CPU1: Slots 1, 4, 7, 10 and CPU2: Slots 1, 4, 7, 10
  • Result: 4 channels per CPU, 1 DIMM per channel

128GB (8x 16GB):

  • Same as above with 16GB DIMMs

192GB (12x 16GB):

  • CPU1: Slots 1, 4, 7, 10, 2, 5 and CPU2: Slots 1, 4, 7, 10, 2, 5
  • Result: 4 channels per CPU, some with 2 DIMMs per channel

768GB (24x 32GB):

  • All slots populated

Check Configuration: RBSU > System Information > Memory Information

Processor Options

System Configuration > BIOS/Platform Configuration (RBSU) > Processor Options:

  • Intel Hyperthreading: Enabled (recommended for most workloads)

    • Doubles logical cores (e.g., 12-core CPU shows as 24 cores)
    • Benefits most virtualization and multi-threaded workloads
    • Disable only for specific security compliance (e.g., some cloud providers)
  • Intel Virtualization Technology (VT-x): Enabled (required for hypervisors)

  • Intel VT-d (IOMMU): Enabled (required for PCI passthrough, SR-IOV)

  • Turbo Boost: Enabled (allows CPU to exceed base clock)

  • Cores Enabled: All (or reduce to lower power/heat if needed)

Integrated Devices

System Configuration > BIOS/Platform Configuration (RBSU) > System Options > Integrated Devices:

  • Embedded SATA Controller: Enabled (if using SATA drives)
  • Embedded RAID Controller: Enabled (for Smart Array controllers)
  • SR-IOV: Enabled (if using virtual network interfaces with VMs)

Network Controller Options

For each NIC (FlexibleLOM, PCIe):

System Configuration > BIOS/Platform Configuration (RBSU) > Network Options > [Adapter]:

  • Network Boot: Enabled (for network boot on that NIC)
  • PXE/iSCSI: Select PXE for standard network boot
  • Link Speed: Auto-Negotiation (recommended) or force 1G/10G
  • IPv4: Enabled (for IPv4 PXE boot)
  • IPv6: Enabled (if using IPv6 PXE boot)

Boot Order: Configure which NIC boots first if multiple are enabled

Secure Boot Configuration

System Configuration > BIOS/Platform Configuration (RBSU) > Boot Options > Secure Boot:

  • Secure Boot: Disabled (for unsigned boot loaders, custom kernels)
  • Secure Boot: Enabled (for signed boot loaders, Windows, some Linux distros)

Note: If using PXE with unsigned images (e.g., custom iPXE), Secure Boot must be disabled

Firmware Updates

Update System ROM to latest version:

  1. Via iLO:

    • iLO web > Administration > Firmware > Update Firmware
    • Upload System ROM .fwpkg or .bin file
    • Server reboots automatically to apply
  2. Via Service Pack for ProLiant (SPP):

    • Download SPP ISO from HPE Support Portal
    • Mount via iLO Virtual Media
    • Boot server from SPP ISO
    • Smart Update Manager (SUM) runs in Linux environment
    • Select components to update (System ROM, iLO, controller firmware, NIC firmware)
    • Apply updates, reboot

Recommendation: Use SPP for comprehensive updates on initial setup, then iLO for individual component updates

Storage Configuration

Smart Array Controller Setup

Access Smart Array Configuration

  • During POST: Press F5 when “Smart Array Configuration Utility” message appears
  • Via RBSU: System Configuration > BIOS/Platform Configuration (RBSU) > System Options > ROM-Based Setup Utility > Smart Array Configuration

Create RAID Arrays

  1. Delete Existing Arrays (if reconfiguring):

    • Select controller > Configuration > Delete Array
    • Confirm deletion (data loss warning)
  2. Create New Array:

    • Select controller > Configuration > Create Array
    • Select physical drives to include
    • Choose RAID level:
      • RAID 0: Striping, no redundancy (maximum performance, maximum capacity)
      • RAID 1: Mirroring (redundancy, half capacity, good for boot drives)
      • RAID 5: Striping + parity (redundancy, n-1 capacity, balanced)
      • RAID 6: Striping + double parity (dual-drive failure tolerance, n-2 capacity)
      • RAID 10: Mirror + stripe (high performance + redundancy, half capacity)
    • Configure spare drives (hot spares for automatic rebuild)
    • Create logical drive
    • Set bootable flag if boot drive
  3. Recommended Configurations:

    • Boot/OS: 2x SSD in RAID 1 (redundancy, fast boot)
    • Data (performance): 4-6x SSD in RAID 10 (fast, redundant)
    • Data (capacity): 4-8x HDD in RAID 6 (capacity, dual-drive tolerance)

Controller Settings

  • Cache Settings:

    • Write Cache: Enabled (requires battery/flash-backed cache)
    • Read Cache: Enabled
    • No-Battery Write Cache: Disabled (data safety) or Enabled (performance, risk)
  • Rebuild Priority: Medium or High (faster rebuild, may impact performance)

  • Surface Scan Delay: 3-7 days (periodic integrity check)

HBA Mode (Non-RAID)

For software RAID (ZFS, mdadm, Ceph):

  1. Access Smart Array Configuration (F5 during POST)
  2. Controller > Configuration > Enable HBA Mode
  3. Confirm (RAID arrays will be deleted)
  4. Reboot

Note: Not all Smart Array controllers support HBA mode. Check compatibility. Alternative: Use separate LSI HBA in PCIe slot.

Network Configuration for Boot

DHCP Server Setup

For PXE/UEFI network boot, configure DHCP server with appropriate options:

ISC DHCP Example (/etc/dhcp/dhcpd.conf):

# Define subnet
subnet 192.168.10.0 netmask 255.255.255.0 {
    range 192.168.10.100 192.168.10.200;
    option routers 192.168.10.1;
    option domain-name-servers 192.168.10.1;
    
    # PXE boot options
    next-server 192.168.10.5;  # TFTP server IP
    
    # Differentiate UEFI vs BIOS
    if exists user-class and option user-class = "iPXE" {
        # iPXE boot script
        filename "http://boot.example.com/boot.ipxe";
    } elsif option arch = 00:07 or option arch = 00:09 {
        # UEFI (x86-64)
        filename "bootx64.efi";
    } else {
        # Legacy BIOS
        filename "undionly.kpxe";
    }
}

# Static reservation for DL360
host dl360-01 {
    hardware ethernet xx:xx:xx:xx:xx:xx;  # FlexibleLOM MAC
    fixed-address 192.168.10.50;
    option host-name "dl360-01";
}

FlexibleLOM Configuration

Configure FlexibleLOM NIC for network boot:

  1. RBSU > Network Options > FlexibleLOM
  2. Enable “Network Boot”
  3. Select PXE or iSCSI
  4. Configure IPv4/IPv6 as needed
  5. Set as first boot device in boot order

Multi-NIC Boot Priority

If multiple NICs have network boot enabled:

  1. RBSU > Network Options > Network Boot Order
  2. Drag/drop to prioritize NIC boot order
  3. First NIC in list attempts boot first

Recommendation: Enable network boot on one NIC (typically FlexibleLOM port 1) to avoid confusion

Operating System Installation

Traditional Installation (Virtual Media)

  1. Download OS ISO (e.g., Ubuntu Server, ESXi, Proxmox)
  2. Upload ISO to HTTP/HTTPS server or local file
  3. iLO Remote Console > Virtual Devices > Image File CD-ROM/DVD
  4. Browse to ISO location, click “Insert Media”
  5. Set boot order to prioritize virtual media
  6. Reboot server, boot from virtual CD/DVD
  7. Proceed with OS installation

Network Installation (PXE)

See Network Boot Capabilities for detailed PXE/UEFI boot setup

Quick workflow:

  1. Configure DHCP server with PXE options
  2. Setup TFTP server with boot files
  3. Enable network boot in BIOS
  4. Reboot, server PXE boots
  5. Select OS installer from PXE menu
  6. Automated installation proceeds (Kickstart/Preseed/Ignition)

Optimization for Specific Workloads

Virtualization (ESXi, Proxmox, Hyper-V)

BIOS Settings:

  • Hyperthreading: Enabled
  • VT-x: Enabled
  • VT-d: Enabled
  • Power Management: Dynamic or OS Control
  • Turbo Boost: Enabled

Hardware:

  • Maximum memory (384GB+ recommended)
  • Fast storage (SSD RAID 10 for VM storage)
  • 10GbE networking for VM traffic

Configuration:

  • Pass through NICs to VMs (SR-IOV or PCI passthrough)
  • Use storage controller in HBA mode for direct disk access to VM storage (ZFS, Ceph)

Kubernetes/Container Platforms

BIOS Settings:

  • Hyperthreading: Enabled
  • VT-x/VT-d: Enabled (for nested virtualization, kata containers)
  • Power Management: Dynamic or High Performance

Hardware:

  • 128GB+ RAM for multi-tenant workloads
  • Fast local NVMe/SSD for container image cache and ephemeral storage
  • 10GbE for pod networking

OS Recommendations:

  • Talos Linux: Network-bootable, immutable k8s OS
  • Flatcar Container Linux: Auto-updating, minimal OS
  • Ubuntu Server: Broad compatibility, snap/docker native

Storage Server (NAS, SAN)

BIOS Settings:

  • Disable Hyperthreading (slight performance improvement for ZFS)
  • VT-d: Enabled (if passing through HBA to VM)
  • Power Management: High Performance

Hardware:

  • Maximum drive bays (8-10 SFF)
  • HBA mode or separate LSI HBA controller
  • 10GbE or bonded 1GbE for network storage traffic
  • ECC memory (critical for ZFS)

Software:

  • TrueNAS SCALE (Linux-based, k8s apps)
  • OpenMediaVault (Debian-based, plugins)
  • Ubuntu + ZFS (custom setup)

Compute/HPC Workloads

BIOS Settings:

  • Hyperthreading: Depends on workload (test both)
  • Turbo Boost: Enabled
  • Power Management: Maximum Performance
  • C-States: Disabled (reduce latency)

Hardware:

  • High core count CPUs (E5-2680 v4, 2690 v4)
  • Maximum memory bandwidth (populate all channels)
  • Fast local scratch storage (NVMe)

Monitoring and Maintenance

iLO Health Monitoring

Information > System Information:

  • CPU temperature and status
  • Memory status
  • Drive status (via controller)
  • Fan speeds
  • PSU status
  • Overall system health LED status

Alerting (Administration > Alerting):

  • Configure email alerts for:
    • Fan failures
    • Temperature warnings
    • Drive failures
    • Memory errors
    • PSU failures
  • Set up SNMP traps for integration with monitoring systems (Nagios, Zabbix, Prometheus)

Integrated Management Log (IML)

Information > Integrated Management Log:

  • View hardware events and errors
  • Filter by severity (Informational, Caution, Critical)
  • Export log for troubleshooting

Regular Checks:

  • Review IML weekly for early warning signs
  • Address caution-level events before they become critical

Firmware Update Cadence

Recommendation:

  • iLO: Update quarterly or when security advisories released
  • System ROM: Update annually or for bug fixes
  • Storage Controller: Update when issues arise or annually
  • NIC Firmware: Update when issues arise

Method: Use SPP for annual comprehensive updates, iLO web interface for individual component updates

Physical Maintenance

Monthly:

  • Check fan noise (increased noise may indicate clogged air filters or failing fan)
  • Verify PSU and drive LEDs (no amber lights)
  • Check iLO for alerts

Quarterly:

  • Clean air filters (if accessible, depends on rack airflow)
  • Verify backup of iLO configuration
  • Test iLO Virtual Media functionality

Annually:

  • Update all firmware via SPP
  • Verify RAID battery/flash-backed cache status
  • Review and update BIOS settings as workload evolves

Troubleshooting Common Issues

Server Won’t Power On

  1. Check PSU power cords connected
  2. Verify PSU LEDs indicate power
  3. Press iLO power button via web interface
  4. Check iLO IML for power-related errors
  5. Reseat PSUs, check for blown fuses

POST Errors

Memory Errors:

  • Reseat memory DIMMs
  • Test with minimal configuration (1 DIMM per CPU)
  • Replace failing DIMMs identified in POST

CPU Errors:

  • Verify heatsink properly seated
  • Check thermal paste application
  • Reseat CPU (careful with pins)

Drive Errors:

  • Check drive connection to caddy
  • Verify controller recognizes drive
  • Replace failing drive

No Network Boot

See Network Boot Troubleshooting for detailed diagnostics

Quick checks:

  1. Verify NIC link light
  2. Confirm network boot enabled in BIOS
  3. Check DHCP server logs for PXE request
  4. Test TFTP server accessibility

iLO Not Accessible

  1. Check physical Ethernet connection to iLO port
  2. Verify switch port active
  3. Reset iLO: Press and hold iLO NMI button (rear) for 5 seconds
  4. Factory reset iLO via jumper (see maintenance guide)
  5. Check iLO firmware version, update if outdated

High Fan Noise

  1. Check ambient temperature (<25°C recommended)
  2. Verify airflow not blocked (front/rear clearance)
  3. Clean dust from intake (compressed air)
  4. Check iLO temperature sensors for elevated temps
  5. Lower CPU TDP if temperatures excessive (lower power CPUs)
  6. Verify all fans operational (replace failed fans)

Security Hardening

iLO Security

  1. Change Default Credentials: Immediately on first boot
  2. Disable Unused Services: SSH, IPMI if not needed
  3. Use HTTPS Only: Disable HTTP (Administration > Network > HTTP Port)
  4. Network Isolation: Dedicated management VLAN, firewall iLO access
  5. Update Firmware: Apply security patches promptly
  6. Account Management: Use separate accounts, least privilege

BIOS/UEFI Security

  1. BIOS Password: Set administrator password (RBSU > System Options > BIOS Admin Password)
  2. Secure Boot: Enable if using signed boot loaders
  3. Boot Order Lock: Prevent unauthorized boot device changes
  4. TPM: Enable if using BitLocker or LUKS disk encryption

Operating System Security

  1. Minimal Installation: Install only required packages
  2. Firewall: Enable host firewall (iptables, firewalld, ufw)
  3. SSH Hardening: Key-based auth, disable password auth, non-standard port
  4. Automatic Updates: Enable for security patches
  5. Monitoring: Deploy intrusion detection (fail2ban, OSSEC)

Conclusion

Proper configuration of the HP ProLiant DL360 Gen9 ensures optimal performance, reliability, and manageability for home lab and production deployments. The combination of UEFI boot capabilities, iLO remote management, and flexible hardware configuration makes the DL360 Gen9 a versatile platform for virtualization, containerization, storage, and compute workloads.

Key takeaways:

  • Update firmware early (iLO, System ROM, controllers)
  • Configure iLO for remote management and monitoring
  • Choose boot mode (UEFI recommended) and configure network boot appropriately
  • Optimize BIOS settings for specific workload (virtualization, storage, compute)
  • Implement security hardening (iLO, BIOS, OS)
  • Establish monitoring and maintenance schedule

For network boot-specific configuration, refer to the Network Boot Capabilities guide.

1.4.2 - Hardware Specifications

Detailed hardware specifications and configuration options for HP ProLiant DL360 Gen9

System Overview

The HP ProLiant DL360 Gen9 is a dual-socket 1U rack server designed for data center and enterprise deployments, also popular in home lab environments due to its performance and manageability.

Generation: Gen9 (2014-2017 product cycle)
Form Factor: 1U rack-mountable (19-inch standard rack)
Dimensions: 43.46 x 67.31 x 4.29 cm (17.1 x 26.5 x 1.69 in)

Processor Support

Supported CPU Families

The DL360 Gen9 supports Intel Xeon E5-2600 v3 and v4 series processors:

  • E5-2600 v3 (Haswell-EP): Released Q3 2014

    • Process: 22nm
    • Cores: 4-18 per socket
    • TDP: 55W-145W
    • Max Memory Speed: DDR4-2133
  • E5-2600 v4 (Broadwell-EP): Released Q1 2016

    • Process: 14nm
    • Cores: 4-22 per socket
    • TDP: 55W-145W
    • Max Memory Speed: DDR4-2400

Value: E5-2620 v3/v4 (6 cores, 15MB cache, 85W)
Balanced: E5-2650 v3/v4 (10-12 cores, 25-30MB cache, 105W)
Performance: E5-2680 v3/v4 (12-14 cores, 30-35MB cache, 120W)
High Core Count: E5-2699 v4 (22 cores, 55MB cache, 145W)

Configuration Options

  • Single Processor: One CPU socket populated (budget option)
  • Dual Processor: Both sockets populated (full performance)

Note: Memory and I/O performance scales with processor count. Single-CPU configuration limits memory channels and PCIe lanes.

Memory Architecture

Memory Specifications

  • Type: DDR4 RDIMM or LRDIMM
  • Speed: DDR4-2133 (v3) or DDR4-2400 (v4)
  • Slots: 24 DIMM slots (12 per processor)
  • Maximum Capacity:
    • 768GB with 32GB RDIMMs
    • 1.5TB with 64GB LRDIMMs (v4 processors)
  • Minimum: 8GB (1x 8GB DIMM)

Memory Configuration Rules

  • Channels per CPU: 4 channels, 3 DIMMs per channel
  • Population: Populate channels evenly for optimal bandwidth
  • Mixing: Do not mix RDIMM and LRDIMM types
  • Speed: All DIMMs run at speed of slowest DIMM

Basic Home Lab (Single CPU):

  • 4x 16GB = 64GB (one DIMM per channel on both memory boards)

Standard (Dual CPU):

  • 8x 16GB = 128GB (one DIMM per channel)
  • 12x 16GB = 192GB (two DIMMs per channel on primary channels)

High Capacity (Dual CPU):

  • 24x 32GB = 768GB (all slots populated, RDIMM)

Performance Priority: Populate all channels before adding second DIMM per channel

Storage Options

Drive Bay Configurations

The DL360 Gen9 offers multiple drive bay configurations:

  1. 8 SFF (2.5-inch): Most common configuration
  2. 10 SFF: Extended bay version
  3. 4 LFF (3.5-inch): Less common in 1U form factor

Drive Types Supported

  • SAS: 12Gb/s, 6Gb/s (enterprise-grade)
  • SATA: 6Gb/s, 3Gb/s (value option)
  • SSD: SAS/SATA SSD, NVMe (with appropriate controller)

Storage Controllers

Smart Array Controllers (HPE proprietary RAID):

  • P440ar: Entry-level, 2GB FBWC (Flash-Backed Write Cache), RAID 0/1/5/6/10
  • P840ar: High-performance, 4GB FBWC, RAID 0/1/5/6/10/50/60
  • P440: PCIe card version, 2GB FBWC
  • P840: PCIe card version, 4GB FBWC

HBA Mode (non-RAID pass-through):

  • Smart Array controllers in HBA mode for software RAID (ZFS, mdadm)
  • Limited support; check firmware version

Alternative Controllers:

  • LSI/Broadcom HBA controllers in PCIe slots
  • H240ar (12Gb/s HBA mode)

Boot Drive Options

For network-focused deployments:

  • Minimal Local Storage: 2x SSD in RAID 1 for hypervisor/OS
  • USB/SD Boot: iLO supports USB boot, SD card (internal USB)
  • Diskless: Pure network boot (subject of network-boot.md)

Network Connectivity

Integrated FlexibleLOM

The DL360 Gen9 includes a FlexibleLOM slot for swappable network adapters:

Common FlexibleLOM Options:

  • HPE 366FLR: 4x 1GbE (Broadcom BCM5719)

    • Most common, good for general use
    • Supports PXE, UEFI network boot, SR-IOV
  • HPE 560FLR-SFP+: 2x 10GbE SFP+ (Intel X710)

    • High performance, fiber or DAC
    • Supports PXE, UEFI boot, SR-IOV, RDMA (RoCE)
  • HPE 361i: 2x 1GbE (Intel I350)

    • Entry-level, good driver support

PCIe Expansion Slots

Slot Configuration:

  • Slot 1: PCIe 3.0 x16 (low-profile)
  • Slot 2: PCIe 3.0 x8 (low-profile)
  • Slot 3: PCIe 3.0 x8 (low-profile) - optional, depends on riser

Network Card Options:

  • Intel X520/X710 (10GbE)
  • Mellanox ConnectX-3/ConnectX-4 (10/25/40GbE, InfiniBand)
  • Broadcom NetXtreme (1/10/25GbE)

Note: Ensure cards are low-profile for 1U chassis compatibility

Power Supply

PSU Options

  • 500W: Single PSU, non-redundant (not recommended)
  • 800W: Common, supports dual CPU + moderate expansion
  • 1400W: High-power, dual CPU with high TDP + GPUs
  • Redundancy: 1+1 redundant hot-plug recommended

Power Configuration

  • Platinum Efficiency: 94%+ at 50% load
  • Hot-Plug: Replace without powering down
  • Auto-Switching: 100-240V AC, 50/60Hz

Home Lab Power Draw (typical):

  • Idle (dual E5-2650 v3, 128GB RAM): 100-130W
  • Load: 200-350W depending on CPU and drive configuration

Power Management

  • HPE Dynamic Power Capping: Limit max power via iLO
  • Collaborative Power: Share power budget across chassis in blade environments
  • Energy Efficient Ethernet (EEE): Reduce NIC power during low utilization

Cooling and Acoustics

Fan Configuration

  • 6x Hot-Plug Fans: Front-mounted, redundant (N+1)
  • Variable Speed: Controlled by System ROM based on thermal sensors
  • iLO Management: Monitor fan speed, temperature via iLO

Thermal Management

  • Temperature Range: 10-35°C (50-95°F) operating
  • Altitude: Up to 3,050m (10,000 ft) at reduced temperature
  • Airflow: Front-to-back, ensure clear intake and exhaust

Noise Level

  • Idle: ~45 dBA (quiet for 1U server)
  • Load: 55-70 dBA depending on thermal demand
  • Home Lab Consideration: Audible but acceptable in dedicated space; louder than desktop workstation

Noise Reduction:

  • Run lower TDP CPUs (e.g., E5-2620 series)
  • Maintain ambient temperature <25°C
  • Ensure adequate airflow (not in enclosed cabinet without ventilation)

Management - iLO 4

iLO 4 Features

The Integrated Lights-Out 4 (iLO 4) provides out-of-band management:

  • Web Interface: HTTPS management console
  • Remote Console: HTML5 or Java-based KVM
  • Virtual Media: Mount ISOs/images remotely
  • Power Control: Power on/off, reset, cold boot
  • Monitoring: Sensors, event logs, hardware health
  • Alerting: Email alerts, SNMP traps, syslog
  • Scripting: RESTful API (Redfish standard)

iLO Licensing

  • iLO Standard (included): Basic management, remote console
  • iLO Advanced (license required):
    • Virtual media
    • Remote console performance improvements
    • Directory integration (LDAP/AD)
    • Graphical remote console
  • iLO Advanced Premium (license required):
    • Insight Remote Support
    • Federation
    • Jitter smoothing

Home Lab: iLO Advanced license highly recommended for virtual media and full remote console features

iLO Network Configuration

  • Dedicated iLO Port: Separate 1GbE management port (recommended)
  • Shared LOM: Share FlexibleLOM port with OS (not recommended for isolation)

Security: Isolate iLO on dedicated management VLAN, disable if not needed

BIOS and Firmware

System ROM (BIOS/UEFI)

  • Firmware Type: UEFI 2.31 or later
  • Boot Modes: UEFI, Legacy BIOS, or hybrid
  • Configuration: RBSU (ROM-Based Setup Utility) accessible via F9

Firmware Update Methods

  1. Service Pack for ProLiant (SPP): Comprehensive bundle of all firmware
  2. iLO Online Flash: Update via web interface
  3. Online ROM Flash: Linux utility for online updates
  4. USB Flash: Boot from USB with firmware update utility

Recommended Practice: Update to latest SPP for security patches and feature improvements

Secure Boot

  • UEFI Secure Boot: Supported, validates boot loader signatures
  • TPM: Optional Trusted Platform Module 1.2 or 2.0
  • Boot Order Protection: Prevent unauthorized boot device changes

Expansion and Modularity

GPU Support

Limited GPU support due to 1U form factor and power constraints:

  • Low-Profile GPUs: Nvidia T4, AMD Instinct MI25 (may require custom cooling)
  • Power: Consider 1400W PSU for high-power GPUs
  • Not Ideal: For GPU-heavy workloads, consider 2U+ servers (e.g., DL380 Gen9)

USB Ports

  • Front: 1x USB 3.0
  • Rear: 2x USB 3.0
  • Internal: 1x USB 2.0 (for SD/USB boot device)

Serial Port

  • Rear serial port for legacy console access
  • Useful for network equipment serial console, debug

Home Lab Considerations

Pros for Home Lab

  1. Density: 1U form factor saves rack space
  2. iLO Management: Enterprise remote management without KVM
  3. Network Boot: Excellent PXE/UEFI boot support (see network-boot.md)
  4. Serviceability: Hot-swap drives, PSU, fans
  5. Documentation: Extensive HPE documentation and community support
  6. Parts Availability: Common on secondary market, affordable

Cons for Home Lab

  1. Noise: Louder than tower servers or workstations
  2. Power: Higher idle power than consumer hardware (100-130W idle)
  3. 1U Limitations: Limited GPU, PCIe expansion vs 2U/4U chassis
  4. Firmware: Requires HPE account for SPP downloads (free but registration required)

Budget (~$500-800 used):

  • Dual E5-2620 v3 or v4 (6 cores each, 85W TDP)
  • 128GB RAM (8x 16GB DDR4)
  • 2x SSD (boot), 4-6x HDD/SSD (data)
  • HPE 366FLR (4x 1GbE)
  • Dual 500W or 800W PSU (redundant)
  • iLO Advanced license

Performance (~$1000-1500 used):

  • Dual E5-2680 v4 (14 cores each, 120W TDP)
  • 256GB RAM (16x 16GB DDR4)
  • 2x NVMe SSD (boot/cache), 6-8x SSD (data)
  • HPE 560FLR-SFP+ (2x 10GbE) + PCIe 4x1GbE card
  • Dual 800W PSU
  • iLO Advanced license

Comparison with Other Generations

vs Gen8 (Previous)

Gen9 Advantages:

  • DDR4 vs DDR3 (lower power, higher capacity)
  • Better UEFI support and HTTP boot
  • Newer processor architecture (Haswell/Broadwell vs Sandy Bridge/Ivy Bridge)
  • iLO 4 vs iLO 3 (better HTML5 console)

Gen8 Advantages:

  • Lower cost on secondary market
  • Adequate for light workloads

vs Gen10 (Next)

Gen10 Advantages:

  • Newer CPUs (Skylake-SP/Cascade Lake)
  • More PCIe lanes
  • Better UEFI firmware and security features
  • DDR4-2666/2933 support

Gen9 Advantages:

  • Lower cost (mature product cycle)
  • Excellent value for performance/dollar
  • Still well-supported by modern OS and firmware

Technical Resources

  • QuickSpecs: HPE ProLiant DL360 Gen9 Server QuickSpecs
  • User Guide: HPE ProLiant DL360 Gen9 Server User Guide
  • Maintenance and Service Guide: Detailed disassembly and part replacement
  • Firmware Downloads: HPE Support Portal (requires free account)

Summary

The HP ProLiant DL360 Gen9 remains an excellent choice for home labs and small deployments in 2024-2025. Its balance of performance (dual Xeon v4, 768GB RAM capacity), manageability (iLO 4), and network boot capabilities make it particularly well-suited for virtualization, container hosting, and infrastructure automation workflows. While not the latest generation, it offers strong value with robust firmware support and wide secondary market availability.

Best For:

  • Virtualization hosts (ESXi, Proxmox, Hyper-V)
  • Kubernetes/container platforms
  • Network boot/diskless deployments
  • Storage servers (with appropriate controller)
  • General compute workloads

Avoid For:

  • GPU-intensive workloads (1U constraints)
  • Noise-sensitive environments (unless isolated)
  • Extreme low-power requirements (100W+ idle)

1.4.3 - Network Boot Capabilities

Comprehensive analysis of network boot support on HP ProLiant DL360 Gen9

Overview

The HP ProLiant DL360 Gen9 provides robust network boot capabilities through multiple protocols and firmware interfaces. This makes it particularly well-suited for diskless deployments, automated provisioning, and infrastructure-as-code workflows.

Supported Network Boot Protocols

PXE (Preboot Execution Environment)

The DL360 Gen9 fully supports PXE boot via both legacy BIOS and UEFI firmware modes:

  • Legacy BIOS PXE: Traditional PXE implementation using TFTP

    • Protocol: PXEv2 (PXE 2.1)
    • Network Stack: IPv4 only in legacy mode
    • Boot files: pxelinux.0, undionly.kpxe, or custom NBP
    • DHCP options: Standard options 66 (TFTP server) and 67 (boot filename)
  • UEFI PXE: Modern UEFI network boot implementation

    • Protocol: PXEv2 with UEFI extensions
    • Network Stack: IPv4 and IPv6 support
    • Boot files: bootx64.efi, grubx64.efi, shimx64.efi
    • Architecture: x64 (EFI BC)
    • DHCP Architecture ID: 0x0007 (EFI BC) or 0x0009 (EFI x86-64)

iPXE Support

The DL360 Gen9 can boot iPXE, enabling advanced features:

  • Chainloading: Boot standard PXE, then chainload iPXE for enhanced capabilities
  • HTTP/HTTPS Boot: Download kernels and images over HTTP(S) instead of TFTP
  • SAN Boot: iSCSI and AoE (ATA over Ethernet) support
  • Scripting: Conditional boot logic and dynamic configuration
  • Embedded Scripts: iPXE can be compiled with embedded boot scripts

Implementation Methods:

  1. Chainload from standard PXE: DHCP points to undionly.kpxe or ipxe.efi
  2. Flash iPXE to FlexibleLOM option ROM (advanced, requires care)
  3. Boot iPXE from USB, then continue network boot

UEFI HTTP Boot

Native UEFI HTTP boot is supported on Gen9 servers with recent firmware:

  • Protocol: RFC 7230 HTTP/1.1
  • Requirements:
    • UEFI firmware version 2.40 or later (check via iLO)
    • DHCP option 60 (vendor class identifier) = “HTTPClient”
    • DHCP option 67 pointing to HTTP(S) URL
  • Advantages:
    • No TFTP server required
    • Faster transfers than TFTP
    • Support for HTTPS with certificate validation
    • Better suited for large images (kernels, initramfs)
  • Limitations:
    • UEFI mode only (not available in legacy BIOS)
    • Requires DHCP server with HTTP URL support

HTTP(S) Boot Configuration

For UEFI HTTP boot on DL360 Gen9:

# Example ISC DHCP configuration for UEFI HTTP boot
class "httpclients" {
    match if substring(option vendor-class-identifier, 0, 10) = "HTTPClient";
}

pool {
    allow members of "httpclients";
    option vendor-class-identifier "HTTPClient";
    # Point to HTTP boot URI
    filename "http://boot.example.com/boot/efi/bootx64.efi";
}

Network Interface Options

The DL360 Gen9 supports multiple network adapter configurations for boot:

FlexibleLOM (LOM = LAN on Motherboard)

HPE FlexibleLOM slot supports:

  • HPE 366FLR: Quad-port 1GbE (Broadcom BCM5719)
  • HPE 560FLR-SFP+: Dual-port 10GbE (Intel X710)
  • HPE 361i: Dual-port 1GbE (Intel I350)

All FlexibleLOM adapters support PXE and UEFI network boot. The option ROM can be configured via BIOS/UEFI settings.

PCIe Network Adapters

Standard PCIe network cards with PXE/UEFI boot ROM support:

  • Intel X520, X710 series (10GbE)
  • Broadcom NetXtreme series
  • Mellanox ConnectX-3/4 (with appropriate firmware)

Boot Priority: Configure via System ROM > Network Boot Options to select which NIC boots first.

Firmware Configuration

Accessing Boot Configuration

  1. RBSU (ROM-Based Setup Utility): Press F9 during POST
  2. iLO 4 Remote Console: Access via network, then virtual F9
  3. UEFI System Utilities: Modern interface for UEFI firmware settings

Key Settings

Navigate to: System Configuration > BIOS/Platform Configuration (RBSU) > Network Boot Options

  • Network Boot: Enable/Disable
  • Boot Mode: UEFI or Legacy BIOS
  • IPv4/IPv6: Enable protocol support
  • Boot Retry: Number of attempts before falling back to next boot device
  • Boot Order: Prioritize network boot in boot sequence

Per-NIC Configuration

In RBSU > Network Options:

  • Option ROM: Enable/Disable per adapter
  • Link Speed: Force speed/duplex or auto-negotiate
  • VLAN: VLAN tagging for boot (if supported by DHCP/PXE environment)
  • PXE Menu: Enable interactive PXE menu (Ctrl+S during PXE boot)

iLO 4 Integration

The DL360 Gen9’s iLO 4 provides additional network boot features:

Virtual Media Network Boot

  • Mount ISO images remotely via iLO Virtual Media
  • Boot from network-attached ISO without physical media
  • Useful for OS installation or diagnostics

Workflow:

  1. Upload ISO to HTTP/HTTPS server or use SMB/NFS share
  2. iLO Remote Console > Virtual Devices > Image File CD-ROM/DVD
  3. Set boot order to prioritize virtual optical drive
  4. Reboot server

Scripted Deployment via iLO

iLO 4 RESTful API allows:

  • Setting one-time boot to network via API call
  • Automating PXE boot for provisioning pipelines
  • Integration with tools like Terraform, Ansible

Example using iLO RESTful API:

curl -k -u admin:password -X PATCH \
  https://ilo-hostname/redfish/v1/Systems/1/ \
  -d '{"Boot":{"BootSourceOverrideTarget":"Pxe","BootSourceOverrideEnabled":"Once"}}'

Boot Process Flow

Legacy BIOS PXE Boot

  1. Server powers on, initializes NICs
  2. NIC sends DHCPDISCOVER with PXE vendor options
  3. DHCP server responds with IP, TFTP server (option 66), boot file (option 67)
  4. NIC downloads NBP (Network Bootstrap Program) via TFTP
  5. NBP executes (e.g., pxelinux.0 loads syslinux menu)
  6. User selects boot target or automated script continues
  7. Kernel and initramfs download and boot

UEFI PXE Boot

  1. UEFI firmware initializes network stack
  2. UEFI PXE driver sends DHCPv4/v6 DISCOVER
  3. DHCP responds with boot file (e.g., bootx64.efi)
  4. UEFI downloads boot file via TFTP
  5. UEFI loads and executes boot loader (GRUB2, systemd-boot, iPXE)
  6. Boot loader may download additional files (kernel, initrd, config)
  7. OS boots

UEFI HTTP Boot

  1. UEFI firmware with HTTP Boot support enabled
  2. DHCP request includes “HTTPClient” vendor class
  3. DHCP responds with HTTP(S) URL in option 67
  4. UEFI HTTP client downloads boot file over HTTP(S)
  5. Execution continues as with UEFI PXE

Performance Considerations

TFTP vs HTTP

  • TFTP: Slow for large files (typical: 1-5 MB/s)
    • Use for small boot loaders only
    • Chainload to iPXE or HTTP boot for better performance
  • HTTP: 10-100x faster depending on network and server
    • Recommended for kernels, initramfs, live OS images
    • iPXE or UEFI HTTP boot required

Network Speed Impact

DL360 Gen9 boot performance by NIC speed:

  • 1GbE: Adequate for most PXE deployments (100-125 MB/s theoretical max)
  • 10GbE: Significant improvement for large image downloads (1-2 GB/s)
  • Bonding/Teaming: Not typically used for boot (single NIC boots)

Recommendation: For production diskless nodes or frequent re-provisioning, 10GbE with HTTP boot provides best performance.

Common Use Cases

1. Automated OS Provisioning

Boot into installer via PXE:

  • Kickstart (RHEL/CentOS/Rocky)
  • Preseed (Debian/Ubuntu)
  • Ignition (Fedora CoreOS, Flatcar)

2. Diskless Boot

Boot OS entirely from network/RAM:

  • Network root: NFS or iSCSI root filesystem
  • Overlay: Persistent storage via network overlay
  • Stateless: Boot identical image, no local state

3. Rescue and Diagnostics

Boot live environments:

  • SystemRescue
  • Clonezilla
  • Memtest86+
  • Hardware diagnostics (HPE Service Pack for ProLiant)

4. Kubernetes/Container Hosts

PXE boot immutable OS images:

  • Talos Linux: API-driven, diskless k8s nodes
  • Flatcar Container Linux: Automated updates
  • k3OS: Lightweight k8s OS

Troubleshooting

PXE Boot Fails

Symptoms: “PXE-E51: No DHCP or proxy DHCP offers received” or timeout

Checks:

  1. Verify NIC link light and switch port status
  2. Confirm DHCP server is responding (check DHCP logs)
  3. Ensure DHCP options 66 and 67 are set correctly
  4. Test TFTP server accessibility (tftp -i <server> GET <file>)
  5. Check BIOS/UEFI network boot is enabled
  6. Verify boot order prioritizes network boot
  7. Disable Secure Boot if using unsigned boot files

UEFI Network Boot Not Available

Symptoms: Network boot option missing in UEFI boot menu

Resolution:

  1. Enter RBSU (F9), navigate to Network Options
  2. Ensure at least one NIC has “Option ROM” enabled
  3. Verify Boot Mode is set to UEFI (not Legacy)
  4. Update System ROM to latest version if option is missing
  5. Some FlexibleLOM cards require firmware update for UEFI boot support

HTTP Boot Fails

Symptoms: UEFI HTTP boot option present but fails to download

Checks:

  1. Verify firmware version supports HTTP boot (>=2.40)
  2. Ensure DHCP option 67 contains valid HTTP(S) URL
  3. Test URL accessibility from another client
  4. Check DNS resolution if using hostname in URL
  5. For HTTPS: Verify certificate is trusted (or disable cert validation in test)

Slow PXE Boot

Symptoms: Boot process takes minutes instead of seconds

Optimizations:

  1. Switch from TFTP to HTTP (chainload iPXE or use UEFI HTTP boot)
  2. Increase TFTP server block size (tftp-hpa --blocksize 1468)
  3. Tune DHCP response times (reduce lease query delays)
  4. Use local network segment for boot server (avoid WAN/VPN)
  5. Enable NIC interrupt coalescing in BIOS for 10GbE

Security Considerations

Secure Boot

DL360 Gen9 supports UEFI Secure Boot:

  • Validates signed boot loaders (shim, GRUB, kernel)
  • Prevents unsigned code execution during boot
  • Required for some compliance scenarios

Configuration: RBSU > Boot Options > Secure Boot = Enabled

Implications for Network Boot:

  • Must use signed boot loaders (e.g., shim.efi signed by Microsoft/vendor)
  • Custom kernels require signing or disabling Secure Boot
  • iPXE must be signed or chainloaded from signed shim

Network Security

Risks:

  • PXE/TFTP is unencrypted and unauthenticated
  • Attacker on network can serve malicious boot images
  • DHCP spoofing can redirect to malicious boot server

Mitigations:

  1. Network Segmentation: Isolate PXE boot to management VLAN
  2. DHCP Snooping: Prevent rogue DHCP servers on switch
  3. HTTPS Boot: Use UEFI HTTP boot with TLS and certificate validation
  4. iPXE with HTTPS: Chainload iPXE, then use HTTPS for all downloads
  5. Signed Images: Use Secure Boot with signed boot chain
  6. 802.1X: Require network authentication before DHCP (complex for PXE)

iLO Security

  • Change default iLO password immediately
  • Use TLS for iLO web interface and API
  • Restrict iLO network access (firewall, separate VLAN)
  • Disable iLO Virtual Media if not needed
  • Enable iLO Security Override for extra security during boot

Firmware and Driver Resources

Required Firmware Versions

For optimal network boot support:

  • System ROM: v2.60 or later (latest recommended)
  • iLO 4 Firmware: v2.80 or later
  • NIC Firmware: Latest for specific FlexibleLOM/PCIe card

Check current versions: iLO web interface > Information > Firmware Information

Updating Firmware

Methods:

  1. HPE Service Pack for ProLiant (SPP): Comprehensive update bundle

    • Boot from SPP ISO (via iLO Virtual Media or USB)
    • Runs Smart Update Manager (SUM) in Linux environment
    • Updates all firmware, drivers, system ROM automatically
  2. iLO Web Interface: Individual component updates

    • System ROM: Administration > Firmware > Update Firmware
    • Upload .fwpkg or .bin files from HPE support site
  3. Online Flash Component: Linux Online ROM Flash utility

    • Install hp-firmware-* packages
    • Run updates while OS is running (requires reboot to apply)

Download Source: https://support.hpe.com/connect/s/product?language=en_US&kmpmoid=1010026910 (requires HPE Passport account, free registration)

Best Practices

  1. Use UEFI Mode: Better security, IPv6 support, larger disk support
  2. Enable HTTP Boot: Faster and more reliable than TFTP for large files
  3. Chainload iPXE: Flexibility of iPXE with standard PXE infrastructure
  4. Update Firmware: Keep System ROM and iLO current for bug fixes and features
  5. Isolate Boot Network: Use dedicated management VLAN for PXE/provisioning
  6. Test Failover: Configure multiple DHCP servers and boot mirrors for redundancy
  7. Document Configuration: Record BIOS settings, DHCP config, and boot infrastructure
  8. Monitor iLO Logs: Track boot failures and hardware issues via iLO event log

References

  • HPE ProLiant DL360 Gen9 Server User Guide
  • HPE UEFI System Utilities User Guide
  • iLO 4 User Guide (firmware version 2.80)
  • Intel PXE Specification v2.1
  • UEFI Specification v2.8 (HTTP Boot)
  • iPXE Documentation: https://ipxe.org/

Conclusion

The HP ProLiant DL360 Gen9 provides enterprise-grade network boot capabilities suitable for both traditional PXE deployments and modern UEFI HTTP boot scenarios. Its flexible configuration options, mature firmware support, and iLO integration make it an excellent platform for automated provisioning, diskless computing, and infrastructure-as-code workflows in home lab environments.

For home lab use, the recommended configuration is:

  • UEFI boot mode with Secure Boot disabled (unless required)
  • iPXE chainloading for flexibility and HTTP performance
  • iLO 4 configured for remote management and scripted provisioning
  • Latest firmware for stability and feature support

1.5 - Matchbox Analysis

Analysis of Matchbox network boot service capabilities and architecture

Matchbox Network Boot Analysis

This section contains a comprehensive analysis of Matchbox, a network boot service for provisioning bare-metal machines.

Overview

Matchbox is an HTTP and gRPC service developed by Poseidon that automates bare-metal machine provisioning through network booting. It matches machines to configuration profiles based on hardware attributes and serves boot configurations, kernel images, and provisioning configs.

Primary Repository: poseidon/matchbox
Documentation: https://matchbox.psdn.io/
License: Apache 2.0

Key Features

  • Network Boot Support: iPXE, PXELINUX, GRUB2 chainloading
  • OS Provisioning: Fedora CoreOS, Flatcar Linux, RHEL CoreOS
  • Configuration Management: Ignition v3.x configs, Butane transpilation
  • Machine Matching: Label-based matching (MAC, UUID, hostname, serial, custom)
  • API: Read-only HTTP API + authenticated gRPC API
  • Asset Serving: Local caching of OS images for faster deployment
  • Templating: Go template support for dynamic configuration

Use Cases

  1. Bare-metal Kubernetes clusters - Provision CoreOS nodes for k8s
  2. Lab/development environments - Quick PXE boot for testing
  3. Datacenter provisioning - Automate OS installation across fleets
  4. Immutable infrastructure - Declarative machine provisioning via Terraform

Analysis Contents

Quick Architecture

┌─────────────┐
│   Machine   │ PXE Boot
│  (BIOS/UEFI)│───┐
└─────────────┘   │
                  │
┌─────────────┐   │ DHCP/TFTP
│   dnsmasq   │◄──┘ (chainload to iPXE)
│  DHCP+TFTP  │
└─────────────┘
       │
       │ HTTP
       ▼
┌─────────────────────────┐
│      Matchbox           │
│  ┌──────────────────┐   │
│  │  HTTP Endpoints  │   │ /boot.ipxe, /ignition
│  └──────────────────┘   │
│  ┌──────────────────┐   │
│  │   gRPC API       │   │ Terraform provider
│  └──────────────────┘   │
│  ┌──────────────────┐   │
│  │ Profile/Group    │   │ Match machines
│  │   Matcher        │   │ to configs
│  └──────────────────┘   │
└─────────────────────────┘

Technology Stack

  • Language: Go
  • Config Formats: Ignition JSON, Butane YAML
  • Boot Protocols: PXE, iPXE, GRUB2
  • APIs: HTTP (read-only), gRPC (authenticated)
  • Deployment: Binary, container (Podman/Docker), Kubernetes

Integration Points

  • Terraform: terraform-provider-matchbox for declarative provisioning
  • Ignition/Butane: CoreOS provisioning configs
  • dnsmasq: Reference DHCP/TFTP/DNS implementation (quay.io/poseidon/dnsmasq)
  • Asset sources: Can serve local or remote (HTTPS) OS images

1.5.1 - Configuration Model

Analysis of Matchbox’s profile, group, and templating system

Matchbox Configuration Model

Matchbox uses a flexible configuration model based on Profiles (what to provision) and Groups (which machines get which profile), with support for templating and metadata.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    Matchbox Store                           │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐            │
│  │  Profiles  │  │   Groups   │  │   Assets   │            │
│  └────────────┘  └────────────┘  └────────────┘            │
│        │               │                                    │
│        │               │                                    │
│        ▼               ▼                                    │
│  ┌─────────────────────────────────────┐                   │
│  │       Matcher Engine                │                   │
│  │  (Label-based group selection)      │                   │
│  └─────────────────────────────────────┘                   │
│                    │                                        │
│                    ▼                                        │
│  ┌─────────────────────────────────────┐                   │
│  │    Template Renderer                │                   │
│  │  (Go templates + metadata)          │                   │
│  └─────────────────────────────────────┘                   │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
            Rendered Config (iPXE, Ignition, etc.)

Data Directory Structure

Matchbox uses a FileStore (default) that reads from -data-path (default: /var/lib/matchbox):

/var/lib/matchbox/
├── groups/              # Machine group definitions (JSON)
│   ├── default.json
│   ├── node1.json
│   └── us-west.json
├── profiles/            # Profile definitions (JSON)
│   ├── worker.json
│   ├── controller.json
│   └── etcd.json
├── ignition/            # Ignition configs (.ign) or Butane (.yaml)
│   ├── worker.ign
│   ├── controller.ign
│   └── butane-example.yaml
├── cloud/               # Cloud-Config templates (DEPRECATED)
│   └── legacy.yaml.tmpl
├── generic/             # Arbitrary config templates
│   ├── setup.cfg
│   └── metadata.yaml.tmpl
└── assets/              # Static files (kernel, initrd)
    ├── fedora-coreos/
    └── flatcar/

Version control: Poseidon recommends keeping /var/lib/matchbox under git for auditability and rollback.

Profiles

Profiles define what to provision: network boot settings (kernel, initrd, args) and config references (Ignition, Cloud-Config, generic).

Profile Schema

{
  "id": "worker",
  "name": "Fedora CoreOS Worker Node",
  "boot": {
    "kernel": "/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-kernel-x86_64",
    "initrd": [
      "--name main /assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img"
    ],
    "args": [
      "initrd=main",
      "coreos.live.rootfs_url=http://matchbox.example.com:8080/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-rootfs.x86_64.img",
      "coreos.inst.install_dev=/dev/sda",
      "coreos.inst.ignition_url=http://matchbox.example.com:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
    ]
  },
  "ignition_id": "worker.ign",
  "cloud_id": "",
  "generic_id": ""
}

Profile Fields

FieldTypeRequiredDescription
idstringUnique profile identifier (referenced by groups)
namestringHuman-readable description
bootobjectNetwork boot configuration
boot.kernelstringKernel URL (HTTP/HTTPS or /assets path)
boot.initrdarrayInitrd URLs (can specify --name for multi-initrd)
boot.argsarrayKernel command-line arguments
ignition_idstringIgnition/Butane config filename in ignition/
cloud_idstringCloud-Config filename in cloud/ (deprecated)
generic_idstringGeneric config filename in generic/

Boot Configuration Patterns

Pattern 1: Live PXE (RAM-based, ephemeral)

Boot and run OS entirely from RAM, no disk install:

{
  "boot": {
    "kernel": "/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-kernel-x86_64",
    "initrd": [
      "--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img"
    ],
    "args": [
      "initrd=main",
      "coreos.live.rootfs_url=http://matchbox/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-rootfs.x86_64.img",
      "ignition.config.url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
    ]
  }
}

Use case: Diskless workers, testing, ephemeral compute

Pattern 2: Disk Install (persistent)

PXE boot live image, install to disk, reboot to disk:

{
  "boot": {
    "kernel": "/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-kernel-x86_64",
    "initrd": [
      "--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img"
    ],
    "args": [
      "initrd=main",
      "coreos.live.rootfs_url=http://matchbox/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-rootfs.x86_64.img",
      "coreos.inst.install_dev=/dev/sda",
      "coreos.inst.ignition_url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
    ]
  }
}

Key difference: coreos.inst.install_dev triggers disk install before reboot

Pattern 3: Multi-initrd (layered)

Multiple initrds can be loaded (e.g., base + drivers):

{
  "initrd": [
    "--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img",
    "--name drivers /assets/drivers/custom-drivers.img"
  ],
  "args": [
    "initrd=main,drivers",
    "..."
  ]
}

Config References

Ignition Configs

Direct Ignition (.ign files):

{
  "ignition_id": "worker.ign"
}

File: /var/lib/matchbox/ignition/worker.ign

{
  "ignition": { "version": "3.3.0" },
  "systemd": {
    "units": [{
      "name": "example.service",
      "enabled": true,
      "contents": "[Service]\nType=oneshot\nExecStart=/usr/bin/echo Hello\n\n[Install]\nWantedBy=multi-user.target"
    }]
  }
}

Butane Configs (transpiled to Ignition):

{
  "ignition_id": "worker.yaml"
}

File: /var/lib/matchbox/ignition/worker.yaml

variant: fcos
version: 1.5.0
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - ssh-ed25519 AAAA...
systemd:
  units:
    - name: etcd.service
      enabled: true

Matchbox automatically:

  1. Detects Butane format (file doesn’t end in .ign or .ignition)
  2. Transpiles Butane → Ignition using embedded library
  3. Renders templates with group metadata
  4. Serves as Ignition v3.3.0

Generic Configs

For non-Ignition configs (scripts, YAML, arbitrary data):

{
  "generic_id": "setup-script.sh.tmpl"
}

File: /var/lib/matchbox/generic/setup-script.sh.tmpl

#!/bin/bash
# Rendered with group metadata
NODE_NAME={{.node_name}}
CLUSTER_ID={{.cluster_id}}
echo "Provisioning ${NODE_NAME} in cluster ${CLUSTER_ID}"

Access via: GET /generic?uuid=...&mac=...

Groups

Groups match machines to profiles using selectors (label matching) and provide metadata for template rendering.

Group Schema

{
  "id": "node1-worker",
  "name": "Worker Node 1",
  "profile": "worker",
  "selector": {
    "mac": "52:54:00:89:d8:10",
    "uuid": "550e8400-e29b-41d4-a716-446655440000"
  },
  "metadata": {
    "node_name": "worker-01",
    "cluster_id": "prod-cluster",
    "etcd_endpoints": "https://10.0.1.10:2379,https://10.0.1.11:2379",
    "ssh_authorized_keys": [
      "ssh-ed25519 AAAA...",
      "ssh-rsa AAAA..."
    ]
  }
}

Group Fields

FieldTypeRequiredDescription
idstringUnique group identifier
namestringHuman-readable description
profilestringProfile ID to apply
selectorobjectLabel match criteria (omit for default group)
metadataobjectKey-value data for template rendering

Selector Matching

Reserved selectors (automatically populated from machine attributes):

SelectorSourceExampleNormalized
uuidSMBIOS UUID550e8400-e29b-41d4-a716-446655440000Lowercase
macPrimary NIC MAC52:54:00:89:d8:10Colon-separated
hostnameNetwork hostnamenode1.example.comAs reported
serialHardware serialVMware-42 1a...As reported

Custom selectors (passed as query params):

{
  "selector": {
    "region": "us-west",
    "environment": "production",
    "rack": "A23"
  }
}

Matching request: /ipxe?mac=52:54:00:89:d8:10&region=us-west&environment=production&rack=A23

Matching logic:

  1. All selector key-value pairs must match request labels (AND logic)
  2. Most specific group wins (most selector matches)
  3. If multiple groups have same specificity, first match wins (undefined order)
  4. Groups with no selectors = default group (matches all)

Default Groups

Group with empty selector matches all machines:

{
  "id": "default-worker",
  "name": "Default Worker",
  "profile": "worker",
  "metadata": {
    "environment": "dev"
  }
}

⚠️ Warning: Avoid multiple default groups (non-deterministic matching)

Example: Region-based Matching

Group 1: US-West Workers

{
  "id": "us-west-workers",
  "profile": "worker",
  "selector": {
    "region": "us-west"
  },
  "metadata": {
    "etcd_endpoints": "https://etcd-usw.example.com:2379"
  }
}

Group 2: EU Workers

{
  "id": "eu-workers",
  "profile": "worker",
  "selector": {
    "region": "eu"
  },
  "metadata": {
    "etcd_endpoints": "https://etcd-eu.example.com:2379"
  }
}

Group 3: Specific Machine Override

{
  "id": "node-special",
  "profile": "controller",
  "selector": {
    "mac": "52:54:00:89:d8:10",
    "region": "us-west"
  },
  "metadata": {
    "role": "controller"
  }
}

Matching precedence:

  • Machine with mac=52:54:00:89:d8:10&region=us-westnode-special (2 selectors)
  • Machine with region=us-westus-west-workers (1 selector)
  • Machine with region=eueu-workers (1 selector)

Templating System

Matchbox uses Go’s text/template for rendering configs with group metadata.

Template Context

Available variables in Ignition/Butane/Cloud-Config/generic templates:

// Group metadata (all keys from group.metadata)
{{.node_name}}
{{.cluster_id}}
{{.etcd_endpoints}}

// Group selectors (normalized)
{{.mac}}      // e.g., "52:54:00:89:d8:10"
{{.uuid}}     // e.g., "550e8400-..."
{{.region}}   // Custom selector

// Request query params (raw)
{{.request.query.mac}}     // As passed in URL
{{.request.query.foo}}     // Custom query param
{{.request.raw_query}}     // Full query string

// Special functions
{{if index . "ssh_authorized_keys"}}  // Check if key exists
{{range $element := .ssh_authorized_keys}}  // Iterate arrays

Example: Templated Butane Config

Group metadata:

{
  "metadata": {
    "node_name": "worker-01",
    "ssh_authorized_keys": [
      "ssh-ed25519 AAA...",
      "ssh-rsa BBB..."
    ],
    "ntp_servers": ["time1.google.com", "time2.google.com"]
  }
}

Butane template: /var/lib/matchbox/ignition/worker.yaml

variant: fcos
version: 1.5.0

storage:
  files:
    - path: /etc/hostname
      mode: 0644
      contents:
        inline: {{.node_name}}

    - path: /etc/systemd/timesyncd.conf
      mode: 0644
      contents:
        inline: |
          [Time]
          {{range $server := .ntp_servers}}
          NTP={{$server}}
          {{end}}

{{if index . "ssh_authorized_keys"}}
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        {{range $key := .ssh_authorized_keys}}
        - {{$key}}
        {{end}}
{{end}}

Rendered Ignition (simplified):

{
  "ignition": {"version": "3.3.0"},
  "storage": {
    "files": [
      {
        "path": "/etc/hostname",
        "contents": {"source": "data:,worker-01"},
        "mode": 420
      },
      {
        "path": "/etc/systemd/timesyncd.conf",
        "contents": {"source": "data:,%5BTime%5D%0ANTP%3Dtime1.google.com%0ANTP%3Dtime2.google.com"},
        "mode": 420
      }
    ]
  },
  "passwd": {
    "users": [{
      "name": "core",
      "sshAuthorizedKeys": ["ssh-ed25519 AAA...", "ssh-rsa BBB..."]
    }]
  }
}

Template Best Practices

  1. Prefer external rendering: Use Terraform + ct_config provider for complex templates
  2. Validate Butane: Use strict: true in Terraform or fcct --strict
  3. Escape carefully: Go templates use {{}}, Butane uses YAML - mind the interaction
  4. Test rendering: Request /ignition?mac=... directly to inspect output
  5. Version control: Keep templates + groups in git for auditability

Reserved Metadata Keys

Warning: .request is reserved for query param access. Group metadata with "request": {...} will be overwritten.

Reserved keys:

  • request.query.* - Query parameters
  • request.raw_query - Raw query string

API Integration

HTTP Endpoints (Read-only)

EndpointPurposeTemplate Context
/ipxeiPXE boot scriptProfile boot section
/grubGRUB configProfile boot section
/ignitionIgnition configGroup metadata + selectors + query
/cloudCloud-Config (deprecated)Group metadata + selectors + query
/genericGeneric configGroup metadata + selectors + query
/metadataKey-value env formatGroup metadata + selectors + query

Example metadata endpoint response:

GET /metadata?mac=52:54:00:89:d8:10&foo=bar

NODE_NAME=worker-01
CLUSTER_ID=prod
MAC=52:54:00:89:d8:10
REQUEST_QUERY_MAC=52:54:00:89:d8:10
REQUEST_QUERY_FOO=bar
REQUEST_RAW_QUERY=mac=52:54:00:89:d8:10&foo=bar

gRPC API (Authenticated, mutable)

Used by terraform-provider-matchbox for declarative infrastructure:

Terraform example:

provider "matchbox" {
  endpoint    = "matchbox.example.com:8081"
  client_cert = file("~/.matchbox/client.crt")
  client_key  = file("~/.matchbox/client.key")
  ca          = file("~/.matchbox/ca.crt")
}

resource "matchbox_profile" "worker" {
  name   = "worker"
  kernel = "/assets/fedora-coreos/.../kernel"
  initrd = ["--name main /assets/fedora-coreos/.../initramfs.img"]
  args   = [
    "initrd=main",
    "coreos.inst.install_dev=/dev/sda",
    "coreos.inst.ignition_url=${var.matchbox_http_endpoint}/ignition?uuid=$${uuid}&mac=$${mac:hexhyp}"
  ]
  raw_ignition = data.ct_config.worker.rendered
}

resource "matchbox_group" "node1" {
  name    = "node1"
  profile = matchbox_profile.worker.name
  selector = {
    mac = "52:54:00:89:d8:10"
  }
  metadata = {
    node_name = "worker-01"
  }
}

Operations:

  • CreateProfile, GetProfile, UpdateProfile, DeleteProfile
  • CreateGroup, GetGroup, UpdateGroup, DeleteGroup

TLS client authentication required (see deployment docs)

Configuration Workflow

┌─────────────────────────────────────────────────────────────┐
│ 1. Write Butane configs (YAML)                             │
│    - worker.yaml, controller.yaml                          │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Terraform ct_config transpiles Butane → Ignition        │
│    data "ct_config" "worker" {                             │
│      content = file("worker.yaml")                         │
│      strict  = true                                        │
│    }                                                        │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Terraform creates profiles + groups in Matchbox         │
│    matchbox_profile.worker → gRPC CreateProfile()          │
│    matchbox_group.node1 → gRPC CreateGroup()               │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Machine PXE boots, queries Matchbox                     │
│    GET /ipxe?mac=... → matches group → returns profile     │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Ignition fetches rendered config                        │
│    GET /ignition?mac=... → Matchbox returns Ignition       │
└─────────────────────────────────────────────────────────────┘

Benefits:

  • Rich Terraform templating (loops, conditionals, external data sources)
  • Butane validation before deployment
  • Declarative infrastructure (can terraform plan before apply)
  • Version control workflow (git + CI/CD)

Alternative: Manual FileStore

┌─────────────────────────────────────────────────────────────┐
│ 1. Create profile JSON manually                            │
│    /var/lib/matchbox/profiles/worker.json                  │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Create group JSON manually                              │
│    /var/lib/matchbox/groups/node1.json                     │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Write Ignition/Butane config                            │
│    /var/lib/matchbox/ignition/worker.ign                   │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Restart matchbox (to reload FileStore)                  │
│    systemctl restart matchbox                              │
└─────────────────────────────────────────────────────────────┘

Drawbacks:

  • Manual file management
  • No validation before deployment
  • Requires matchbox restart to pick up changes
  • Error-prone for large fleets

Storage Backends

FileStore (Default)

Config: -data-path=/var/lib/matchbox

Pros:

  • Simple file-based storage
  • Easy to version control (git)
  • Human-readable JSON

Cons:

  • Requires file system access
  • Manual reload for gRPC-created resources

Custom Store (Extensible)

Matchbox’s Store interface allows custom backends:

type Store interface {
  ProfileGet(id string) (*Profile, error)
  GroupGet(id string) (*Group, error)
  IgnitionGet(name string) (string, error)
  // ... other methods
}

Potential custom stores:

  • etcd backend (for HA Matchbox)
  • Database backend (PostgreSQL, MySQL)
  • S3/object storage backend

Note: Not officially provided by Matchbox project; requires custom implementation

Security Considerations

  1. gRPC API authentication: Requires TLS client certificates

    • ca.crt - CA that signed client certs
    • server.crt/server.key - Server TLS identity
    • client.crt/client.key - Client credentials (Terraform)
  2. HTTP endpoints are read-only: No auth, machines fetch configs

    • Do NOT put secrets in Ignition configs
    • Use external secret stores (Vault, GCP Secret Manager)
    • Reference secrets via Ignition files.source with auth headers
  3. Network segmentation: Matchbox on provisioning VLAN, isolate from production

  4. Config validation: Validate Ignition/Butane before deployment to avoid boot failures

  5. Audit logging: Version control groups/profiles; log gRPC API changes

Operational Tips

  1. Test groups with curl:

    curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10'
    
  2. List profiles:

    ls -la /var/lib/matchbox/profiles/
    
  3. Validate Butane:

    podman run -i --rm quay.io/coreos/fcct:release --strict < worker.yaml
    
  4. Check group matching:

    # Default group (no selectors)
    curl http://matchbox.example.com:8080/ignition
    
    # Specific machine
    curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10&uuid=550e8400-e29b-41d4-a716-446655440000'
    
  5. Backup configs:

    tar -czf matchbox-backup-$(date +%F).tar.gz /var/lib/matchbox/{groups,profiles,ignition}
    

Summary

Matchbox’s configuration model provides:

  • Separation of concerns: Profiles (what) vs Groups (who/where)
  • Flexible matching: Label-based, multi-attribute, custom selectors
  • Template support: Go templates for dynamic configs (but prefer external rendering)
  • API-driven: Terraform integration for GitOps workflows
  • Storage options: FileStore (simple) or custom backends (extensible)
  • OS-agnostic: Works with any Ignition-based distro (FCOS, Flatcar, RHCOS)

Best practice: Use Terraform + external Butane configs for production; manual FileStore for labs/development.

1.5.2 - Deployment Patterns

Matchbox deployment options and operational considerations

Matchbox Deployment Patterns

Analysis of deployment architectures, installation methods, and operational considerations for running Matchbox in production.

Deployment Architectures

Single-Host Deployment

┌─────────────────────────────────────────────────────┐
│           Provisioning Host                         │
│  ┌─────────────┐        ┌─────────────┐            │
│  │  Matchbox   │        │  dnsmasq    │            │
│  │  :8080 HTTP │        │  DHCP/TFTP  │            │
│  │  :8081 gRPC │        │  :67,:69    │            │
│  └─────────────┘        └─────────────┘            │
│         │                      │                    │
│         └──────────┬───────────┘                    │
│                    │                                │
│  /var/lib/matchbox/                                 │
│  ├── groups/                                        │
│  ├── profiles/                                      │
│  ├── ignition/                                      │
│  └── assets/                                        │
└─────────────────────────────────────────────────────┘
              │
              │ Network
              ▼
     ┌──────────────┐
     │ PXE Clients  │
     └──────────────┘

Use case: Lab, development, small deployments (<50 machines)

Pros:

  • Simple setup
  • Single service to manage
  • Minimal resource requirements

Cons:

  • Single point of failure
  • No scalability
  • Downtime during updates

HA Deployment (Multiple Matchbox Instances)

┌─────────────────────────────────────────────────────┐
│              Load Balancer (Ingress/HAProxy)        │
│           :8080 HTTP        :8081 gRPC              │
└─────────────────────────────────────────────────────┘
       │                              │
       ├─────────────┬────────────────┤
       ▼             ▼                ▼
┌──────────┐  ┌──────────┐    ┌──────────┐
│Matchbox 1│  │Matchbox 2│    │Matchbox N│
│ (Pod/VM) │  │ (Pod/VM) │    │ (Pod/VM) │
└──────────┘  └──────────┘    └──────────┘
       │             │                │
       └─────────────┴────────────────┘
                     │
                     ▼
         ┌────────────────────────┐
         │  Shared Storage        │
         │  /var/lib/matchbox     │
         │  (NFS, PV, ConfigMap)  │
         └────────────────────────┘

Use case: Production, datacenter-scale (100+ machines)

Pros:

  • High availability (no single point of failure)
  • Rolling updates (zero downtime)
  • Load distribution

Cons:

  • Complex storage (shared volume or etcd backend)
  • More infrastructure required

Storage options:

  1. Kubernetes PersistentVolume (RWX mode)
  2. NFS share mounted on multiple hosts
  3. Custom etcd-backed Store (requires custom implementation)
  4. Git-sync sidecar (read-only, periodic pull)

Kubernetes Deployment

┌─────────────────────────────────────────────────────┐
│              Ingress Controller                     │
│  matchbox.example.com → Service matchbox:8080       │
│  matchbox-rpc.example.com → Service matchbox:8081   │
└─────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────┐
│          Service: matchbox (ClusterIP)              │
│            ports: 8080/TCP, 8081/TCP                │
└─────────────────────────────────────────────────────┘
                     │
         ┌───────────┴───────────┐
         ▼                       ▼
┌─────────────────┐     ┌─────────────────┐
│  Pod: matchbox  │     │  Pod: matchbox  │
│  replicas: 2+   │     │  replicas: 2+   │
└─────────────────┘     └─────────────────┘
         │                       │
         └───────────┬───────────┘
                     ▼
┌─────────────────────────────────────────────────────┐
│    PersistentVolumeClaim: matchbox-data             │
│    /var/lib/matchbox (RWX mode)                     │
└─────────────────────────────────────────────────────┘

Manifest structure:

contrib/k8s/
├── matchbox-deployment.yaml  # Deployment + replicas
├── matchbox-service.yaml     # Service (8080, 8081)
├── matchbox-ingress.yaml     # Ingress (HTTP + gRPC TLS)
└── matchbox-pvc.yaml         # PersistentVolumeClaim

Key configurations:

  1. Secret for gRPC TLS:

    kubectl create secret generic matchbox-rpc \
      --from-file=ca.crt \
      --from-file=server.crt \
      --from-file=server.key
    
  2. Ingress for gRPC (TLS passthrough):

    metadata:
      annotations:
        nginx.ingress.kubernetes.io/ssl-passthrough: "true"
        nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    
  3. Volume mount:

    volumes:
      - name: data
        persistentVolumeClaim:
          claimName: matchbox-data
    volumeMounts:
      - name: data
        mountPath: /var/lib/matchbox
    

Use case: Cloud-native deployments, Kubernetes-based infrastructure

Pros:

  • Native Kubernetes primitives (Deployments, Services, Ingress)
  • Rolling updates via Deployment strategy
  • Easy scaling (kubectl scale)
  • Health checks + auto-restart

Cons:

  • Requires RWX PersistentVolume or shared storage
  • Ingress TLS configuration complexity (gRPC passthrough)
  • Cluster dependency (can’t provision cluster bootstrap nodes)

⚠️ Bootstrap problem: Kubernetes-hosted Matchbox can’t PXE boot its own cluster nodes (chicken-and-egg). Use external Matchbox for initial cluster bootstrap, then migrate.

Installation Methods

1. Binary Installation (systemd)

Recommended for: Bare-metal hosts, VMs, traditional Linux servers

Steps:

  1. Download and verify:

    wget https://github.com/poseidon/matchbox/releases/download/v0.10.0/matchbox-v0.10.0-linux-amd64.tar.gz
    wget https://github.com/poseidon/matchbox/releases/download/v0.10.0/matchbox-v0.10.0-linux-amd64.tar.gz.asc
    gpg --verify matchbox-v0.10.0-linux-amd64.tar.gz.asc
    
  2. Extract and install:

    tar xzf matchbox-v0.10.0-linux-amd64.tar.gz
    sudo cp matchbox-v0.10.0-linux-amd64/matchbox /usr/local/bin/
    
  3. Create user and directories:

    sudo useradd -U matchbox
    sudo mkdir -p /var/lib/matchbox/{assets,groups,profiles,ignition}
    sudo chown -R matchbox:matchbox /var/lib/matchbox
    
  4. Install systemd unit:

    sudo cp contrib/systemd/matchbox.service /etc/systemd/system/
    
  5. Configure via systemd dropin:

    sudo systemctl edit matchbox
    
    [Service]
    Environment="MATCHBOX_ADDRESS=0.0.0.0:8080"
    Environment="MATCHBOX_RPC_ADDRESS=0.0.0.0:8081"
    Environment="MATCHBOX_LOG_LEVEL=debug"
    
  6. Start service:

    sudo systemctl daemon-reload
    sudo systemctl start matchbox
    sudo systemctl enable matchbox
    

Pros:

  • Direct control over service
  • Easy log access (journalctl -u matchbox)
  • Native OS integration

Cons:

  • Manual updates required
  • OS dependency (package compatibility)

2. Container Deployment (Docker/Podman)

Recommended for: Docker hosts, quick testing, immutable infrastructure

Docker:

mkdir -p /var/lib/matchbox/assets
docker run -d --name matchbox \
  --net=host \
  -v /var/lib/matchbox:/var/lib/matchbox:Z \
  -v /etc/matchbox:/etc/matchbox:Z,ro \
  quay.io/poseidon/matchbox:v0.10.0 \
  -address=0.0.0.0:8080 \
  -rpc-address=0.0.0.0:8081 \
  -log-level=debug

Podman:

podman run -d --name matchbox \
  --net=host \
  -v /var/lib/matchbox:/var/lib/matchbox:Z \
  -v /etc/matchbox:/etc/matchbox:Z,ro \
  quay.io/poseidon/matchbox:v0.10.0 \
  -address=0.0.0.0:8080 \
  -rpc-address=0.0.0.0:8081 \
  -log-level=debug

Volume mounts:

  • /var/lib/matchbox - Data directory (groups, profiles, configs, assets)
  • /etc/matchbox - TLS certificates (ca.crt, server.crt, server.key)

Network mode:

  • --net=host - Required for DHCP/TFTP interaction on same host
  • Bridge mode possible if Matchbox is on separate host from dnsmasq

Pros:

  • Immutable deployments
  • Easy updates (pull new image)
  • Portable across hosts

Cons:

  • Volume management complexity
  • SELinux considerations (:Z flag)

3. Kubernetes Deployment

Recommended for: Kubernetes environments, cloud platforms

Quick start:

# Create TLS secret for gRPC
kubectl create secret generic matchbox-rpc \
  --from-file=ca.crt=~/.matchbox/ca.crt \
  --from-file=server.crt=~/.matchbox/server.crt \
  --from-file=server.key=~/.matchbox/server.key

# Deploy manifests
kubectl apply -R -f contrib/k8s/

# Check status
kubectl get pods -l app=matchbox
kubectl get svc matchbox
kubectl get ingress matchbox matchbox-rpc

Persistence options:

Option 1: emptyDir (ephemeral, dev only):

volumes:
  - name: data
    emptyDir: {}

Option 2: PersistentVolumeClaim (production):

volumes:
  - name: data
    persistentVolumeClaim:
      claimName: matchbox-data

Option 3: ConfigMap (static configs):

volumes:
  - name: groups
    configMap:
      name: matchbox-groups
  - name: profiles
    configMap:
      name: matchbox-profiles

Option 4: Git-sync sidecar (GitOps):

initContainers:
  - name: git-sync
    image: k8s.gcr.io/git-sync:v3.6.3
    env:
      - name: GIT_SYNC_REPO
        value: https://github.com/example/matchbox-configs
      - name: GIT_SYNC_DEST
        value: /var/lib/matchbox
    volumeMounts:
      - name: data
        mountPath: /var/lib/matchbox

Pros:

  • Native k8s features (scaling, health checks, rolling updates)
  • Ingress integration
  • GitOps workflows

Cons:

  • Complexity (Ingress, PVC, TLS)
  • Can’t bootstrap own cluster

Network Boot Environment Setup

Matchbox requires separate DHCP/TFTP/DNS services. Options:

Option 1: dnsmasq Container (Quickest)

Use case: Lab, testing, environments without existing DHCP

Full DHCP + TFTP + DNS:

docker run -d --name dnsmasq \
  --cap-add=NET_ADMIN \
  --net=host \
  quay.io/poseidon/dnsmasq:latest \
  -d -q \
  --dhcp-range=192.168.1.3,192.168.1.254,30m \
  --enable-tftp \
  --tftp-root=/var/lib/tftpboot \
  --dhcp-match=set:bios,option:client-arch,0 \
  --dhcp-boot=tag:bios,undionly.kpxe \
  --dhcp-match=set:efi64,option:client-arch,9 \
  --dhcp-boot=tag:efi64,ipxe.efi \
  --dhcp-userclass=set:ipxe,iPXE \
  --dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
  --address=/matchbox.example.com/192.168.1.2 \
  --log-queries \
  --log-dhcp

Proxy DHCP (alongside existing DHCP):

docker run -d --name dnsmasq \
  --cap-add=NET_ADMIN \
  --net=host \
  quay.io/poseidon/dnsmasq:latest \
  -d -q \
  --dhcp-range=192.168.1.1,proxy,255.255.255.0 \
  --enable-tftp \
  --tftp-root=/var/lib/tftpboot \
  --dhcp-userclass=set:ipxe,iPXE \
  --pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe \
  --pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe \
  --log-queries \
  --log-dhcp

Included files: undionly.kpxe, ipxe.efi, grub.efi (bundled in image)

Option 2: Existing DHCP/TFTP Infrastructure

Use case: Enterprise environments with network admin policies

Required DHCP options (ISC DHCP example):

subnet 192.168.1.0 netmask 255.255.255.0 {
  range 192.168.1.10 192.168.1.250;
  
  # BIOS clients
  if option architecture-type = 00:00 {
    filename "undionly.kpxe";
  }
  # UEFI clients
  elsif option architecture-type = 00:09 {
    filename "ipxe.efi";
  }
  # iPXE clients
  elsif exists user-class and option user-class = "iPXE" {
    filename "http://matchbox.example.com:8080/boot.ipxe";
  }
  
  next-server 192.168.1.100;  # TFTP server IP
}

TFTP files (place in tftp root):

Option 3: iPXE-only (No PXE Chainload)

Use case: Modern hardware with native iPXE firmware

DHCP config (simpler):

filename "http://matchbox.example.com:8080/boot.ipxe";

No TFTP server needed (iPXE fetches directly via HTTP)

Limitation: Doesn’t support legacy BIOS with basic PXE ROM

TLS Certificate Setup

gRPC API requires TLS client certificates for authentication.

Option 1: Provided cert-gen Script

cd scripts/tls
export SAN=DNS.1:matchbox.example.com,IP.1:192.168.1.100
./cert-gen

Generates:

  • ca.crt - Self-signed CA
  • server.crt, server.key - Server credentials
  • client.crt, client.key - Client credentials (for Terraform)

Install server certs:

sudo mkdir -p /etc/matchbox
sudo cp ca.crt server.crt server.key /etc/matchbox/
sudo chown -R matchbox:matchbox /etc/matchbox

Save client certs for Terraform:

mkdir -p ~/.matchbox
cp client.crt client.key ca.crt ~/.matchbox/

Option 2: Corporate PKI

Preferred for production: Use organization’s certificate authority

Requirements:

  • Server cert with SAN: DNS:matchbox.example.com
  • Client cert issued by same CA
  • CA cert for validation

Matchbox flags:

-ca-file=/etc/matchbox/ca.crt
-cert-file=/etc/matchbox/server.crt
-key-file=/etc/matchbox/server.key

Terraform provider config:

provider "matchbox" {
  endpoint    = "matchbox.example.com:8081"
  client_cert = file("/path/to/client.crt")
  client_key  = file("/path/to/client.key")
  ca          = file("/path/to/ca.crt")
}

Option 3: Let’s Encrypt (HTTP API only)

Note: gRPC requires client cert auth (incompatible with Let’s Encrypt)

Use case: TLS for HTTP endpoints only (read-only API)

Matchbox flags:

-web-ssl=true
-web-cert-file=/etc/letsencrypt/live/matchbox.example.com/fullchain.pem
-web-key-file=/etc/letsencrypt/live/matchbox.example.com/privkey.pem

Limitation: Still need self-signed certs for gRPC API

Configuration Flags

Core Flags

FlagDefaultDescription
-address127.0.0.1:8080HTTP API listen address
-rpc-address``gRPC API listen address (empty = disabled)
-data-path/var/lib/matchboxData directory (FileStore)
-assets-path/var/lib/matchbox/assetsStatic assets directory
-log-levelinfoLogging level (debug, info, warn, error)

TLS Flags (gRPC)

FlagDefaultDescription
-ca-file/etc/matchbox/ca.crtCA certificate for client verification
-cert-file/etc/matchbox/server.crtServer TLS certificate
-key-file/etc/matchbox/server.keyServer TLS private key

TLS Flags (HTTP, optional)

FlagDefaultDescription
-web-sslfalseEnable TLS for HTTP API
-web-cert-file``HTTP server TLS certificate
-web-key-file``HTTP server TLS private key

Environment Variables

All flags can be set via environment variables with MATCHBOX_ prefix:

export MATCHBOX_ADDRESS=0.0.0.0:8080
export MATCHBOX_RPC_ADDRESS=0.0.0.0:8081
export MATCHBOX_LOG_LEVEL=debug
export MATCHBOX_DATA_PATH=/custom/path

Operational Considerations

Firewall Configuration

Matchbox host:

firewall-cmd --permanent --add-port=8080/tcp  # HTTP API
firewall-cmd --permanent --add-port=8081/tcp  # gRPC API
firewall-cmd --reload

dnsmasq host (if separate):

firewall-cmd --permanent --add-service=dhcp
firewall-cmd --permanent --add-service=tftp
firewall-cmd --permanent --add-service=dns  # optional
firewall-cmd --reload

Monitoring

Health check endpoints:

# HTTP API
curl http://matchbox.example.com:8080
# Should return: matchbox

# gRPC API
openssl s_client -connect matchbox.example.com:8081 \
  -CAfile ~/.matchbox/ca.crt \
  -cert ~/.matchbox/client.crt \
  -key ~/.matchbox/client.key

Prometheus metrics: Not built-in; consider adding reverse proxy (e.g., nginx) with metrics exporter

Logs (systemd):

journalctl -u matchbox -f

Logs (container):

docker logs -f matchbox

Backup Strategy

What to backup:

  1. /var/lib/matchbox/{groups,profiles,ignition} - Configs
  2. /etc/matchbox/*.{crt,key} - TLS certificates
  3. Terraform state (if using Terraform provider)

Backup command:

tar -czf matchbox-backup-$(date +%F).tar.gz \
  /var/lib/matchbox/{groups,profiles,ignition} \
  /etc/matchbox

Restore:

tar -xzf matchbox-backup-YYYY-MM-DD.tar.gz -C /
sudo chown -R matchbox:matchbox /var/lib/matchbox
sudo systemctl restart matchbox

GitOps approach: Store configs in git repository for versioning and auditability

Updates

Binary deployment:

# Download new version
wget https://github.com/poseidon/matchbox/releases/download/vX.Y.Z/matchbox-vX.Y.Z-linux-amd64.tar.gz
tar xzf matchbox-vX.Y.Z-linux-amd64.tar.gz

# Replace binary
sudo systemctl stop matchbox
sudo cp matchbox-vX.Y.Z-linux-amd64/matchbox /usr/local/bin/
sudo systemctl start matchbox

Container deployment:

docker pull quay.io/poseidon/matchbox:vX.Y.Z
docker stop matchbox
docker rm matchbox
docker run -d --name matchbox ... quay.io/poseidon/matchbox:vX.Y.Z ...

Kubernetes deployment:

kubectl set image deployment/matchbox matchbox=quay.io/poseidon/matchbox:vX.Y.Z
kubectl rollout status deployment/matchbox

Scaling Considerations

Vertical scaling (single instance):

  • CPU: Minimal (config rendering is lightweight)
  • Memory: ~50MB base + asset cache
  • Disk: Depends on cached assets (100MB - 10GB+)

Horizontal scaling (multiple instances):

  • Stateless HTTP API (load balance round-robin)
  • Shared storage required (RWX PV, NFS, or custom backend)
  • gRPC API can be load-balanced with gRPC-aware LB

Asset serving optimization:

  • Use CDN or cache proxy for remote assets
  • Local asset caching for <100 machines
  • Dedicated HTTP server (nginx) for large deployments (1000+ machines)

Security Best Practices

  1. Don’t store secrets in Ignition configs

    • Use Ignition files.source with auth headers to fetch from Vault
    • Or provision minimal config, fetch secrets post-boot
  2. Network segmentation

    • Provision VLAN isolated from production
    • Firewall rules: only allow provisioning traffic
  3. gRPC API access control

    • Client cert authentication (mandatory)
    • Restrict cert issuance to authorized personnel/systems
    • Rotate certs periodically
  4. Audit logging

    • Version control groups/profiles (git)
    • Log gRPC API changes (Terraform state tracking)
    • Monitor HTTP endpoint access
  5. Validate configs before deployment

    • fcct --strict for Butane configs
    • Terraform plan before apply
    • Test in dev environment first

Troubleshooting

Common Issues

1. Machines not PXE booting:

# Check DHCP responses
tcpdump -i eth0 port 67 and port 68

# Verify TFTP files
ls -la /var/lib/tftpboot/
curl tftp://192.168.1.100/undionly.kpxe

# Check Matchbox accessibility
curl http://matchbox.example.com:8080/boot.ipxe

2. 404 Not Found on /ignition:

# Test group matching
curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10'

# Check group exists
ls -la /var/lib/matchbox/groups/

# Check profile referenced by group exists
ls -la /var/lib/matchbox/profiles/

# Verify ignition_id file exists
ls -la /var/lib/matchbox/ignition/

3. gRPC connection refused (Terraform):

# Test TLS connection
openssl s_client -connect matchbox.example.com:8081 \
  -CAfile ~/.matchbox/ca.crt \
  -cert ~/.matchbox/client.crt \
  -key ~/.matchbox/client.key

# Check Matchbox gRPC is listening
sudo ss -tlnp | grep 8081

# Verify firewall
sudo firewall-cmd --list-ports

4. Ignition config validation errors:

# Validate Butane locally
podman run -i --rm quay.io/coreos/fcct:release --strict < config.yaml

# Fetch rendered Ignition
curl 'http://matchbox.example.com:8080/ignition?mac=...' | jq .

# Validate Ignition spec
curl 'http://matchbox.example.com:8080/ignition?mac=...' | \
  podman run -i --rm quay.io/coreos/ignition-validate:latest

Summary

Matchbox deployment considerations:

  • Architecture: Single-host (dev/lab) vs HA (production) vs Kubernetes
  • Installation: Binary (systemd), container (Docker/Podman), or Kubernetes manifests
  • Network boot: dnsmasq container (quick), existing infrastructure (enterprise), or iPXE-only (modern)
  • TLS: Self-signed (dev), corporate PKI (production), Let’s Encrypt (HTTP only)
  • Scaling: Vertical (simple) vs horizontal (requires shared storage)
  • Security: Client cert auth, network segmentation, no secrets in configs
  • Operations: Backup configs, GitOps workflow, monitoring/logging

Recommendation for production:

  • HA deployment (2+ instances) with load balancer
  • Shared storage (NFS or RWX PV on Kubernetes)
  • Corporate PKI for TLS certificates
  • GitOps workflow (Terraform + git-controlled configs)
  • Network segmentation (dedicated provisioning VLAN)
  • Prometheus/Grafana monitoring

1.5.3 - Network Boot Support

Detailed analysis of Matchbox’s network boot capabilities

Network Boot Support in Matchbox

Matchbox provides comprehensive network boot support for bare-metal provisioning, supporting multiple boot firmware types and protocols.

Overview

Matchbox serves as an HTTP entrypoint for network-booted machines but does not implement DHCP, TFTP, or DNS services itself. Instead, it integrates with existing network infrastructure (or companion services like dnsmasq) to provide a complete PXE boot solution.

Boot Protocol Support

1. PXE (Preboot Execution Environment)

Legacy BIOS support via chainloading to iPXE:

Machine BIOS → DHCP (gets TFTP server) → TFTP (gets undionly.kpxe) 
→ iPXE firmware → HTTP (Matchbox /boot.ipxe)

Key characteristics:

  • Requires TFTP server to serve undionly.kpxe (iPXE bootloader)
  • Chainloads from legacy PXE ROM to modern iPXE
  • Supports older hardware with basic PXE firmware
  • TFTP only used for initial iPXE bootstrap; subsequent downloads via HTTP

2. iPXE (Enhanced PXE)

Primary boot method supported by Matchbox:

iPXE Client → DHCP (gets boot script URL) → HTTP (Matchbox endpoints)
→ Kernel/initrd download → Boot with Ignition config

Endpoints served by Matchbox:

EndpointPurpose
/boot.ipxeStatic script that gathers machine attributes (UUID, MAC, hostname, serial)
/ipxe?<labels>Rendered iPXE script with kernel, initrd, and boot args for matched machine
/assets/Optional local caching of kernel/initrd images

Example iPXE flow:

  1. Machine boots with iPXE firmware
  2. DHCP response points to http://matchbox.example.com:8080/boot.ipxe
  3. iPXE fetches /boot.ipxe:
    #!ipxe
    chain ipxe?uuid=${uuid}&mac=${mac:hexhyp}&domain=${domain}&hostname=${hostname}&serial=${serial}
    
  4. iPXE makes request to /ipxe?uuid=...&mac=... with machine attributes
  5. Matchbox matches machine to group/profile and renders iPXE script:
    #!ipxe
    kernel /assets/coreos/VERSION/coreos_production_pxe.vmlinuz \
      coreos.config.url=http://matchbox.foo:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp} \
      coreos.first_boot=1
    initrd /assets/coreos/VERSION/coreos_production_pxe_image.cpio.gz
    boot
    

Advantages:

  • HTTP downloads (faster than TFTP)
  • Scriptable boot logic
  • Can fetch configs from HTTP endpoints
  • Supports HTTPS (if compiled with TLS support)

3. GRUB2

UEFI firmware support:

UEFI Firmware → DHCP (gets GRUB bootloader) → TFTP (grub.efi)
→ GRUB → HTTP (Matchbox /grub endpoint)

Matchbox endpoint: /grub?<labels>

Example GRUB config rendered by Matchbox:

default=0
timeout=1
menuentry "CoreOS" {
  echo "Loading kernel"
  linuxefi "(http;matchbox.foo:8080)/assets/coreos/VERSION/coreos_production_pxe.vmlinuz" \
    "coreos.config.url=http://matchbox.foo:8080/ignition" "coreos.first_boot"
  echo "Loading initrd"
  initrdefi "(http;matchbox.foo:8080)/assets/coreos/VERSION/coreos_production_pxe_image.cpio.gz"
}

Use case:

  • UEFI systems that prefer GRUB over iPXE
  • Environments with existing GRUB network boot infrastructure

4. PXELINUX (Legacy, via TFTP)

While not a primary Matchbox target, PXELINUX clients can be configured to chainload iPXE:

# /var/lib/tftpboot/pxelinux.cfg/default
timeout 10
default iPXE
LABEL iPXE
KERNEL ipxe.lkrn
APPEND dhcp && chain http://matchbox.example.com:8080/boot.ipxe

DHCP Configuration Patterns

Matchbox supports two DHCP deployment models:

Pattern 1: PXE-Enabled DHCP

Full DHCP server provides IP allocation + PXE boot options.

Example dnsmasq configuration:

dhcp-range=192.168.1.1,192.168.1.254,30m
enable-tftp
tftp-root=/var/lib/tftpboot

# Legacy BIOS → chainload to iPXE
dhcp-match=set:bios,option:client-arch,0
dhcp-boot=tag:bios,undionly.kpxe

# UEFI → iPXE
dhcp-match=set:efi32,option:client-arch,6
dhcp-boot=tag:efi32,ipxe.efi
dhcp-match=set:efi64,option:client-arch,9
dhcp-boot=tag:efi64,ipxe.efi

# iPXE clients → Matchbox
dhcp-userclass=set:ipxe,iPXE
dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe

# DNS for Matchbox
address=/matchbox.example.com/192.168.1.100

Client architecture detection:

  • Option 93 (client-arch): Identifies BIOS (0), UEFI32 (6), UEFI64 (9)
  • User class: Detects iPXE clients to skip TFTP chainloading

Pattern 2: Proxy DHCP

Runs alongside existing DHCP server; provides only boot options (no IP allocation).

Example dnsmasq proxy-DHCP:

dhcp-range=192.168.1.1,proxy,255.255.255.0
enable-tftp
tftp-root=/var/lib/tftpboot

# Chainload legacy PXE to iPXE
pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe
# iPXE clients → Matchbox
dhcp-userclass=set:ipxe,iPXE
pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe

Benefits:

  • Non-invasive: doesn’t replace existing DHCP
  • PXE clients receive merged responses from both DHCP servers
  • Ideal for environments where main DHCP cannot be modified

Network Boot Flow (Complete)

Scenario: BIOS machine with legacy PXE firmware

┌──────────────────────────────────────────────────────────────────┐
│ 1. Machine powers on, BIOS set to network boot                  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 2. NIC PXE firmware broadcasts DHCPDISCOVER (PXEClient)          │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 3. DHCP/proxyDHCP responds with:                                 │
│    - IP address (if full DHCP)                                   │
│    - Next-server: TFTP server IP                                 │
│    - Filename: undionly.kpxe (based on arch=0)                   │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 4. PXE firmware downloads undionly.kpxe via TFTP                 │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 5. Execute iPXE (undionly.kpxe)                                  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 6. iPXE requests DHCP again, identifies as iPXE (user-class)     │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 7. DHCP responds with boot URL (not TFTP):                       │
│    http://matchbox.example.com:8080/boot.ipxe                    │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 8. iPXE fetches /boot.ipxe via HTTP:                             │
│    #!ipxe                                                        │
│    chain ipxe?uuid=${uuid}&mac=${mac:hexhyp}&...                 │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 9. iPXE chains to /ipxe?uuid=XXX&mac=YYY (introspected labels)   │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 10. Matchbox matches machine to group/profile                    │
│     - Finds most specific group matching labels                  │
│     - Retrieves profile (kernel, initrd, args, configs)          │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 11. Matchbox renders iPXE script with:                           │
│     - kernel URL (local asset or remote HTTPS)                   │
│     - initrd URL                                                 │
│     - kernel args (including ignition.config.url)                │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 12. iPXE downloads kernel + initrd (HTTP/HTTPS)                  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 13. iPXE boots kernel with specified args                        │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 14. Fedora CoreOS/Flatcar boots, Ignition runs                   │
│     - Fetches /ignition?uuid=XXX&mac=YYY from Matchbox           │
│     - Matchbox renders Ignition config with group metadata       │
│     - Ignition partitions disk, writes files, creates users      │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 15. System reboots (if disk install), boots from disk            │
└──────────────────────────────────────────────────────────────────┘

Asset Serving

Matchbox can serve static assets (kernel, initrd images) from a local directory to reduce bandwidth and increase speed:

Asset directory structure:

/var/lib/matchbox/assets/
├── fedora-coreos/
│   └── 36.20220906.3.2/
│       ├── fedora-coreos-36.20220906.3.2-live-kernel-x86_64
│       ├── fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img
│       └── fedora-coreos-36.20220906.3.2-live-rootfs.x86_64.img
└── flatcar/
    └── 3227.2.0/
        ├── flatcar_production_pxe.vmlinuz
        ├── flatcar_production_pxe_image.cpio.gz
        └── version.txt

HTTP endpoint: http://matchbox.example.com:8080/assets/

Scripts provided:

  • scripts/get-fedora-coreos - Download/verify Fedora CoreOS images
  • scripts/get-flatcar - Download/verify Flatcar Linux images

Profile reference:

{
  "boot": {
    "kernel": "/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-kernel-x86_64",
    "initrd": ["--name main /assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img"]
  }
}

Alternative: Profiles can reference remote HTTPS URLs (requires iPXE compiled with TLS support):

{
  "kernel": "https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/36.20220906.3.2/x86_64/fedora-coreos-36.20220906.3.2-live-kernel-x86_64"
}

OS Support

Fedora CoreOS

Boot types:

  1. Live PXE (RAM-only, ephemeral)
  2. Install to disk (persistent, recommended)

Required kernel args:

  • coreos.inst.install_dev=/dev/sda - Target disk for install
  • coreos.inst.ignition_url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp} - Provisioning config
  • coreos.live.rootfs_url=... - Root filesystem image

Ignition fetch: During first boot, ignition.service fetches config from Matchbox

Flatcar Linux

Boot types:

  1. Live PXE (RAM-only)
  2. Install to disk

Required kernel args:

  • flatcar.first_boot=yes - Marks first boot
  • flatcar.config.url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp} - Ignition config URL
  • flatcar.autologin - Auto-login to console (optional, dev/debug)

Ignition support: Flatcar uses Ignition v3.x for provisioning

RHEL CoreOS

Supported as it uses Ignition like Fedora CoreOS. Requires Red Hat-specific image sources.

Machine Matching & Labels

Matchbox matches machines to profiles using labels extracted during boot:

Reserved Label Selectors

LabelSourceExampleNormalized
uuidSMBIOS UUID550e8400-e29b-41d4-a716-446655440000Lowercase
macNIC MAC address52:54:00:89:d8:10Normalized to colons
hostnameNetwork boot programnode1.example.comAs-is
serialHardware serialVMware-42 1a...As-is

Custom Labels

Groups can match on arbitrary labels passed as query params:

/ipxe?mac=52:54:00:89:d8:10&region=us-west&env=prod

Matching precedence: Most specific group wins (most selector matches)

Firmware Compatibility

Firmware TypeClient ArchBoot FileProtocolMatchbox Support
BIOS (legacy PXE)0undionly.kpxe → iPXETFTP → HTTP✅ Via chainload
UEFI 32-bit6ipxe.efiTFTP → HTTP
UEFI (BIOS compat)7ipxe.efiTFTP → HTTP
UEFI 64-bit9ipxe.efiTFTP → HTTP
Native iPXE-N/AHTTP✅ Direct
GRUB (UEFI)-grub.efiTFTP → HTTP/grub endpoint

Network Requirements

Firewall rules on Matchbox host:

# HTTP API (read-only)
firewall-cmd --add-port=8080/tcp --permanent

# gRPC API (authenticated, Terraform)
firewall-cmd --add-port=8081/tcp --permanent

DNS requirement:

  • matchbox.example.com must resolve to Matchbox server IP
  • Can be configured in dnsmasq, corporate DNS, or /etc/hosts on DHCP server

DHCP/TFTP host (if using dnsmasq):

firewall-cmd --add-service=dhcp --permanent
firewall-cmd --add-service=tftp --permanent
firewall-cmd --add-service=dns --permanent  # optional

Troubleshooting Tips

  1. Verify Matchbox endpoints:

    curl http://matchbox.example.com:8080
    # Should return: matchbox
    
    curl http://matchbox.example.com:8080/boot.ipxe
    # Should return iPXE script
    
  2. Test machine matching:

    curl 'http://matchbox.example.com:8080/ipxe?mac=52:54:00:89:d8:10'
    # Should return rendered iPXE script with kernel/initrd
    
  3. Check TFTP files:

    ls -la /var/lib/tftpboot/
    # Should contain: undionly.kpxe, ipxe.efi, grub.efi
    
  4. Verify DHCP responses:

    tcpdump -i eth0 -n port 67 and port 68
    # Watch for DHCP offers with PXE options
    
  5. iPXE console debugging:

    • Press Ctrl+B during iPXE boot to enter console
    • Commands: dhcp, ifstat, show net0/ip, chain http://...

Limitations

  1. HTTPS support: iPXE must be compiled with crypto support (larger binary, ~80KB vs ~45KB)
  2. TFTP dependency: Legacy PXE requires TFTP for initial chainload (can’t skip)
  3. No DHCP/TFTP built-in: Must use external services or dnsmasq container
  4. Boot firmware variations: Some vendor PXE implementations have quirks
  5. SecureBoot: iPXE and GRUB must be signed (or SecureBoot disabled)

Reference Implementation: dnsmasq Container

Matchbox project provides quay.io/poseidon/dnsmasq with:

  • Pre-configured DHCP/TFTP/DNS service
  • Bundled ipxe.efi, undionly.kpxe, grub.efi
  • Example configs for PXE-DHCP and proxy-DHCP modes

Quick start (full DHCP):

docker run --rm --cap-add=NET_ADMIN --net=host quay.io/poseidon/dnsmasq \
  -d -q \
  --dhcp-range=192.168.1.3,192.168.1.254 \
  --enable-tftp --tftp-root=/var/lib/tftpboot \
  --dhcp-match=set:bios,option:client-arch,0 \
  --dhcp-boot=tag:bios,undionly.kpxe \
  --dhcp-match=set:efi64,option:client-arch,9 \
  --dhcp-boot=tag:efi64,ipxe.efi \
  --dhcp-userclass=set:ipxe,iPXE \
  --dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
  --address=/matchbox.example.com/192.168.1.2 \
  --log-queries --log-dhcp

Quick start (proxy-DHCP):

docker run --rm --cap-add=NET_ADMIN --net=host quay.io/poseidon/dnsmasq \
  -d -q \
  --dhcp-range=192.168.1.1,proxy,255.255.255.0 \
  --enable-tftp --tftp-root=/var/lib/tftpboot \
  --dhcp-userclass=set:ipxe,iPXE \
  --pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe \
  --pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe \
  --log-queries --log-dhcp

Summary

Matchbox provides robust network boot support through:

  • Protocol flexibility: iPXE (primary), GRUB2, legacy PXE (via chainload)
  • Firmware compatibility: BIOS and UEFI
  • Modern approach: HTTP-based with optional local asset caching
  • Clean separation: Matchbox handles config rendering; external services handle DHCP/TFTP
  • Production-ready: Used by Typhoon Kubernetes distributions for bare-metal provisioning

1.5.4 - Use Case Evaluation

Evaluation of Matchbox for specific use cases and comparison with alternatives

Matchbox Use Case Evaluation

Analysis of Matchbox’s suitability for various use cases, strengths, limitations, and comparison with alternative provisioning solutions.

Use Case Fit Analysis

✅ Ideal Use Cases

1. Bare-Metal Kubernetes Clusters

Scenario: Provisioning 10-1000 physical servers for Kubernetes nodes

Why Matchbox Excels:

  • Ignition-native (perfect for Fedora CoreOS/Flatcar)
  • Declarative machine provisioning via Terraform
  • Label-based matching (region, role, hardware type)
  • Integration with Typhoon Kubernetes distribution
  • Minimal OS surface (immutable, container-optimized)

Example workflow:

resource "matchbox_profile" "k8s_controller" {
  name   = "k8s-controller"
  kernel = "/assets/fedora-coreos/.../kernel"
  raw_ignition = data.ct_config.controller.rendered
}

resource "matchbox_group" "controllers" {
  profile = matchbox_profile.k8s_controller.name
  selector = {
    role = "controller"
  }
}

Alternatives considered:

  • Cloud-init + netboot.xyz: Less declarative, no native Ignition support
  • Foreman: Heavier, more complex for container-centric workloads
  • Metal³: Kubernetes-native but requires existing cluster

Verdict: ⭐⭐⭐⭐⭐ Matchbox is purpose-built for this


2. Lab/Development Environments

Scenario: Rapid PXE boot testing with QEMU/KVM VMs or homelab servers

Why Matchbox Excels:

  • Quick setup (binary + dnsmasq container)
  • No DHCP infrastructure required (proxy-DHCP mode)
  • Localhost deployment (no external dependencies)
  • Fast iteration (change configs, re-PXE)
  • Included examples and scripts

Example setup:

# Start Matchbox locally
docker run -d --net=host -v /var/lib/matchbox:/var/lib/matchbox \
  quay.io/poseidon/matchbox:latest -address=0.0.0.0:8080

# Start dnsmasq on same host
docker run -d --net=host --cap-add=NET_ADMIN \
  quay.io/poseidon/dnsmasq ...

Alternatives considered:

  • netboot.xyz: Great for manual OS selection, no automation
  • PiXE server: Simpler but less flexible matching logic
  • Manual iPXE scripts: No dynamic matching, manual maintenance

Verdict: ⭐⭐⭐⭐⭐ Minimal setup, maximum flexibility


3. Edge/Remote Site Provisioning

Scenario: Provision machines at 10+ remote datacenters or edge locations

Why Matchbox Excels:

  • Lightweight (single binary, ~20MB)
  • Declarative region-based matching
  • Centralized config management (Terraform)
  • Can run on minimal hardware (ARM support)
  • HTTP-based (works over WAN with reverse proxy)

Architecture:

Central Matchbox (via Terraform)
  ↓ gRPC API
Regional Matchbox Instances (read-only cache)
  ↓ HTTP
Edge Machines (PXE boot)

Label-based routing:

{
  "selector": {
    "region": "us-west",
    "site": "pdx-1"
  },
  "metadata": {
    "ntp_servers": ["10.100.1.1", "10.100.1.2"]
  }
}

Alternatives considered:

  • Foreman: Requires more resources per site
  • Ansible + netboot: No declarative PXE boot, post-install only
  • Cloud-init datasources: Requires cloud metadata service per site

Verdict: ⭐⭐⭐⭐☆ Good fit, but consider caching strategy for WAN


⚠️ Moderate Fit Use Cases

4. Multi-Tenant Bare-Metal Cloud

Scenario: Provide bare-metal-as-a-service to multiple customers

Matchbox challenges:

  • No built-in multi-tenancy (single namespace)
  • No RBAC (gRPC API is all-or-nothing with client certs)
  • No customer self-service portal

Workarounds:

  • Deploy separate Matchbox per tenant (isolation via separate instances)
  • Proxy gRPC API with custom RBAC layer
  • Use group selectors with customer IDs

Better alternatives:

  • Metal³ (Kubernetes-native, better multi-tenancy)
  • OpenStack Ironic (purpose-built for bare-metal cloud)
  • MAAS (Ubuntu-specific, has RBAC)

Verdict: ⭐⭐☆☆☆ Possible but architecturally challenging


5. Heterogeneous OS Provisioning

Scenario: Need to provision Fedora CoreOS, Ubuntu, RHEL, Windows

Matchbox challenges:

  • Designed for Ignition-based OSes (FCOS, Flatcar, RHCOS)
  • No native support for Kickstart (RHEL/CentOS)
  • No support for Preseed (Ubuntu/Debian)
  • No Windows unattend.xml support

What works:

  • Fedora CoreOS ✅
  • Flatcar Linux ✅
  • RHEL CoreOS ✅
  • Container Linux (deprecated but supported) ✅

What requires workarounds:

  • RHEL/CentOS: Possible via generic configs + Kickstart URLs, but not native
  • Ubuntu: Can PXE boot and point to autoinstall ISO, but loses Matchbox templating benefits
  • Debian: Similar to Ubuntu
  • Windows: Not supported (different PXE boot mechanisms)

Better alternatives for heterogeneous environments:

  • Foreman (supports Kickstart, Preseed, unattend.xml)
  • MAAS (Ubuntu-centric but extensible)
  • Cobbler (older but supports many OS types)

Verdict: ⭐⭐☆☆☆ Stick to Ignition-based OSes or use different tool


❌ Poor Fit Use Cases

6. Windows PXE Boot

Why Matchbox doesn’t fit:

  • No WinPE support
  • No unattend.xml rendering
  • Different PXE boot chain (WDS/SCCM model)

Recommendation: Use Microsoft WDS or SCCM

Verdict: ⭐☆☆☆☆ Not designed for this


7. BIOS/Firmware Updates

Why Matchbox doesn’t fit:

  • Focused on OS provisioning, not firmware
  • No vendor-specific tooling (Dell iDRAC, HP iLO integration)

Recommendation: Use vendor tools or Ansible with ipmi/redfish modules

Verdict: ⭐☆☆☆☆ Out of scope


Strengths

1. Ignition-First Design

  • Native support for modern immutable OSes
  • Declarative, atomic provisioning (no config drift)
  • First-boot partition/filesystem setup

2. Label-Based Matching

  • Flexible machine classification (MAC, UUID, region, role, custom)
  • Most-specific-match algorithm (override defaults per machine)
  • Query params for dynamic attributes

3. Terraform Integration

  • Declarative infrastructure as code
  • Plan before apply (preview changes)
  • State tracking for auditability
  • Rich templating (ct_config provider for Butane)

4. Minimal Dependencies

  • Single static binary (~20MB)
  • No database required (FileStore default)
  • No built-in DHCP/TFTP (separation of concerns)
  • Container-ready (OCI image available)

5. HTTP-Centric

  • Faster downloads than TFTP (iPXE via HTTP)
  • Proxy/CDN friendly for asset distribution
  • Standard web tooling (curl, load balancers, Ingress)

6. Production-Ready

  • Used by Typhoon Kubernetes (battle-tested)
  • Clear upgrade path (SemVer releases)
  • OpenPGP signature support for config integrity

Limitations

1. No Multi-Tenancy

  • Single namespace (all groups/profiles global)
  • No RBAC on gRPC API (client cert = full access)
  • Requires separate instances per tenant

2. Ignition-Only Focus

  • Cloud-Config deprecated (legacy support only)
  • No native Kickstart/Preseed/unattend.xml
  • Limits OS choice to CoreOS family

3. Storage Constraints

  • FileStore doesn’t scale to 10,000+ profiles
  • No built-in HA storage (requires NFS or custom backend)
  • Kubernetes deployment needs RWX PersistentVolume

4. No Machine Discovery

  • Doesn’t detect new machines (passive service)
  • No inventory management (use external CMDB)
  • No hardware introspection (use Ironic for that)

5. Limited Observability

  • No built-in metrics (Prometheus integration requires reverse proxy)
  • Logs are minimal (request logging only)
  • No audit trail for gRPC API changes (use Terraform state)

6. TFTP Still Required

  • Legacy BIOS PXE needs TFTP for chainloading to iPXE
  • Can’t fully eliminate TFTP unless all machines have native iPXE

Comparison with Alternatives

vs. Foreman

FeatureMatchboxForeman
OS SupportIgnition-basedKickstart, Preseed, AutoYaST, etc.
ComplexityLow (single binary)High (Rails app, DB, Puppet/Ansible)
Config ModelDeclarative (Ignition)Imperative (post-install scripts)
APIHTTP + gRPCREST API
UINone (API-only)Full web UI
TerraformNative providerCommunity modules
Use CaseContainer-centric infraTraditional Linux servers

When to choose Matchbox: CoreOS-based Kubernetes clusters, minimal infrastructure
When to choose Foreman: Heterogeneous OS, need web UI, traditional config mgmt


vs. Metal³

FeatureMatchboxMetal³
PlatformStandaloneKubernetes-native (operator)
BootstrapCan bootstrap k8s clusterNeeds existing k8s cluster
Machine LifecycleProvision onlyProvision + decommission + reprovision
Hardware IntrospectionNo (labels passed manually)Yes (via Ironic)
Multi-tenancyNoYes (via k8s namespaces)
ComplexityLowHigh (requires Ironic, DHCP, etc.)

When to choose Matchbox: Greenfield bare-metal, no existing k8s
When to choose Metal³: Existing k8s, need hardware mgmt lifecycle


vs. Cobbler

FeatureMatchboxCobbler
AgeModern (2016+)Legacy (2008+)
Config FormatIgnition (declarative)Kickstart/Preseed (imperative)
TemplatingGo templates (minimal)Cheetah templates (extensive)
PythonGo (static binary)Python (requires interpreter)
DHCP ManagementExternalCan manage DHCP
MaintenanceActive (Poseidon)Low activity

When to choose Matchbox: Modern immutable OSes, container workloads
When to choose Cobbler: Legacy infra, need DHCP management, heterogeneous OS


vs. MAAS (Ubuntu)

FeatureMatchboxMAAS
OS SupportCoreOS familyUbuntu (primary), others (limited)
IPAMNo (external DHCP)Built-in IPAM
Power MgmtNo (manual or scripts)Built-in (IPMI, AMT, etc.)
UINoFull web UI
DeclarativeYes (Terraform)Limited (CLI mostly)
Cloud IntegrationNoYes (libvirt, LXD, VM hosts)

When to choose Matchbox: Non-Ubuntu, Kubernetes, minimal dependencies
When to choose MAAS: Ubuntu-centric, need power mgmt, cloud integration


vs. netboot.xyz

FeatureMatchboxnetboot.xyz
PurposeAutomated provisioningManual OS selection menu
AutomationFull (API-driven)None (interactive menu)
CustomizationPer-machine configsGlobal menu
IgnitionNative supportNo
ComplexityMediumVery low

When to choose Matchbox: Automated fleet provisioning
When to choose netboot.xyz: Ad-hoc OS installation, homelab


Decision Matrix

Use this table to evaluate Matchbox for your use case:

RequirementWeightMatchbox ScoreNotes
Ignition/CoreOS supportHigh⭐⭐⭐⭐⭐Native, first-class
Heterogeneous OSHigh⭐⭐☆☆☆Limited to Ignition OSes
Declarative provisioningMedium⭐⭐⭐⭐⭐Terraform native
Multi-tenancyMedium⭐☆☆☆☆Requires separate instances
Web UIMedium☆☆☆☆☆No UI (API-only)
Ease of deploymentMedium⭐⭐⭐⭐☆Binary or container, minimal deps
ScalabilityMedium⭐⭐⭐☆☆FileStore limits, need shared storage for HA
Hardware mgmtLow☆☆☆☆☆No power mgmt, no introspection
CostLow⭐⭐⭐⭐⭐Open source, Apache 2.0

Scoring:

  • ⭐⭐⭐⭐⭐ Excellent
  • ⭐⭐⭐⭐☆ Good
  • ⭐⭐⭐☆☆ Adequate
  • ⭐⭐☆☆☆ Limited
  • ⭐☆☆☆☆ Poor
  • ☆☆☆☆☆ Not supported

Recommendations

Choose Matchbox if:

  1. ✅ Provisioning Fedora CoreOS, Flatcar, or RHEL CoreOS
  2. ✅ Building bare-metal Kubernetes clusters
  3. ✅ Prefer declarative infrastructure (Terraform)
  4. ✅ Want minimal dependencies (single binary)
  5. ✅ Need flexible label-based machine matching
  6. ✅ Have homogeneous OS requirements (all Ignition-based)

Avoid Matchbox if:

  1. ❌ Need multi-OS support (Windows, traditional Linux)
  2. ❌ Require web UI for operations teams
  3. ❌ Need built-in hardware management (power, BIOS config)
  4. ❌ Have strict multi-tenancy requirements
  5. ❌ Need automated hardware discovery/introspection

Hybrid Approaches

Pattern 1: Matchbox + Ansible

  • Matchbox: Initial OS provisioning
  • Ansible: Post-boot configuration, app deployment
  • Works well for stateful services on bare-metal

Pattern 2: Matchbox + Metal³

  • Matchbox: Bootstrap initial k8s cluster
  • Metal³: Ongoing cluster node lifecycle management
  • Gradual migration from Matchbox to Metal³

Pattern 3: Matchbox + Terraform + External Secrets

  • Matchbox: Base OS + minimal config
  • Ignition: Fetch secrets from Vault/GCP Secret Manager
  • Terraform: Orchestrate end-to-end provisioning

Conclusion

Matchbox is a purpose-built, minimalist network boot service optimized for modern immutable operating systems (Ignition-based). It excels in container-centric bare-metal environments, particularly for Kubernetes clusters built with Fedora CoreOS or Flatcar Linux.

Best fit: Organizations adopting immutable infrastructure patterns, container orchestration, and declarative provisioning workflows.

Not ideal for: Heterogeneous OS environments, multi-tenant bare-metal clouds, or teams requiring extensive web UI and built-in hardware management.

For home labs and development, Matchbox offers an excellent balance of simplicity and power. For production Kubernetes deployments, it’s a proven, battle-tested solution (via Typhoon). For complex enterprise provisioning with mixed OS requirements, consider Foreman or MAAS instead.

1.6 - Ubiquiti Dream Machine Pro Analysis

Comprehensive analysis of the Ubiquiti Dream Machine Pro capabilities, focusing on network boot (PXE) support and infrastructure integration.

Overview

The Ubiquiti Dream Machine Pro (UDM Pro) is an all-in-one network gateway, router, and switch designed for enterprise and advanced home lab environments. This analysis focuses on its capabilities relevant to infrastructure automation and network boot scenarios.

Key Specifications

Hardware

  • Processor: Quad-core ARM Cortex-A57 @ 1.7 GHz
  • RAM: 4GB DDR4
  • Storage: 128GB eMMC (for UniFi OS, applications, and logs)
  • Network Interfaces:
    • 1x WAN port (RJ45, SFP, or SFP+)
    • 8x LAN ports (1 Gbps RJ45, configurable)
    • 1x SFP+ port (10 Gbps)
    • 1x SFP port (1 Gbps)
  • Additional Features:
    • 3.5" SATA HDD bay (for UniFi Protect surveillance)
    • IDS/IPS engine
    • Deep packet inspection
    • Built-in UniFi Network Controller

Software

  • OS: UniFi OS (Linux-based)
  • Controller: Built-in UniFi Network Controller
  • Services: DHCP, DNS, routing, firewall, VPN (site-to-site and remote access)

Network Boot (PXE) Support

Native DHCP PXE Capabilities

The UDM Pro provides basic PXE boot support through its DHCP server:

Supported:

  • DHCP Option 66 (next-server / TFTP server address)
  • DHCP Option 67 (filename / boot file name)
  • Basic single-architecture PXE booting

Configuration via UniFi Controller:

  1. Navigate to SettingsNetworks → Select your network
  2. Scroll to DHCP section
  3. Enable DHCP
  4. Under Advanced DHCP Options:
    • TFTP Server: IP address of your TFTP/PXE server (e.g., 192.168.42.16)
    • Boot Filename: Name of the bootloader file (e.g., pxelinux.0 for BIOS or bootx64.efi for UEFI)

Limitations:

  • No multi-architecture support: Cannot differentiate boot files based on client architecture (BIOS vs. UEFI, x86_64 vs. ARM64)
  • No conditional DHCP options: Cannot vary filename or next-server based on client characteristics
  • Fixed boot parameters: One boot configuration for all PXE clients
  • Single bootloader only: Must choose either BIOS or UEFI bootloader, not both

Use Cases:

  • ✅ Homogeneous environments (all BIOS or all UEFI)
  • ✅ Single OS deployment scenarios
  • ✅ Simple provisioning workflows
  • ❌ Mixed BIOS/UEFI environments (requires external DHCP server with conditional logic)

Network Segmentation & VLANs

The UDM Pro excels at network segmentation, critical for infrastructure isolation:

  • VLAN Support: Native 802.1Q tagging
  • Firewall Rules: Inter-VLAN routing with granular firewall policies
  • Network Isolation: Can create fully isolated networks or controlled inter-network traffic
  • Use Cases for Infrastructure:
    • Management VLAN (for PXE/provisioning)
    • Production VLAN (workloads)
    • IoT/OT VLAN (isolated devices)
    • DMZ (exposed services)

VPN Capabilities

Site-to-Site VPN

  • Protocols: IPsec, WireGuard (experimental)
  • Use Case: Connect home lab to cloud infrastructure (GCP, AWS, Azure)
  • Performance: Hardware-accelerated encryption on UDM Pro

Remote Access VPN

  • Protocols: L2TP, OpenVPN
  • Use Case: Remote administration of home lab infrastructure
  • Integration: Can work with Cloudflare Access for additional security layer

IDS/IPS Engine

  • Technology: Suricata-based
  • Capabilities:
    • Intrusion detection
    • Intrusion prevention (can drop malicious traffic)
    • Threat signatures updated via UniFi
  • Performance Impact: Can affect throughput on high-bandwidth connections
  • Recommendation: Enable for security-sensitive infrastructure segments

DNS & DHCP Services

DNS

  • Local DNS: Can act as caching DNS resolver
  • Custom DNS Records: Limited to UniFi controller hostname
  • Recommendation: Use external DNS (Pi-hole, Bind9) for advanced features like split-horizon DNS

DHCP

  • Static Leases: Supports MAC-based static IP assignments
  • DHCP Options: Can configure common options (NTP, DNS, domain name)
  • Reservations: Per-client reservations via GUI
  • PXE Options: Basic Option 66/67 support (as noted above)

Integration with Infrastructure-as-Code

UniFi Network API

  • REST API: Available for configuration automation
  • Python Libraries: pyunifi and others for programmatic access
  • Use Cases:
    • Terraform provider for network state management
    • Ansible modules for configuration automation
    • CI/CD integration for network-as-code

Terraform Provider

  • Provider: paultyng/unifi
  • Capabilities: Manage networks, firewall rules, port forwarding, DHCP settings
  • Limitations: Not all UI features exposed via API

Configuration Persistence

  • Backup/Restore: JSON-based configuration export
  • Version Control: Can track config changes in Git
  • Recovery: Auto-backup to cloud (optional)

Performance Characteristics

Throughput

  • Routing/NAT: ~3.5 Gbps (without IDS/IPS)
  • IDS/IPS Enabled: ~850 Mbps - 1 Gbps
  • VPN (IPsec): ~1 Gbps
  • Inter-VLAN Routing: Wire speed (8 Gbps backplane)

Scalability

  • Concurrent Devices: 500+ clients tested
  • VLANs: Up to 32 networks/VLANs
  • Firewall Rules: Thousands (performance depends on complexity)
  • DHCP Leases: Supports large pools efficiently

Comparison to Alternatives

FeatureUDM PropfSenseOPNsenseMikroTik
Basic PXE
Conditional DHCP
All-in-oneVaries
GUI Ease-of-use✅✅⚠️⚠️
API/Automation⚠️✅✅
IDS/IPS Built-in⚠️ (addon)⚠️ (addon)
HardwareFixedFlexibleFlexibleFlexible
Price$$$$ (+ hardware)$ (+ hardware)$ - $$$

Recommendations for Home Lab Use

Ideal Use Cases

Use the UDM Pro when:

  • You want an all-in-one solution with minimal configuration
  • You need integrated UniFi controller and network management
  • Your home lab has mixed UniFi hardware (switches, APs)
  • You want a polished GUI and mobile app management
  • Network segmentation and VLANs are critical

Consider Alternatives When

⚠️ Look elsewhere if:

  • You need conditional DHCP options or multi-architecture PXE boot
  • You require advanced routing protocols (BGP, OSPF beyond basics)
  • You need granular firewall control and scripting (pfSense/OPNsense better)
  • Budget is tight and you already have x86 hardware (pfSense on old PC)
  • You need extremely low latency (sub-1ms) routing
  1. Network Segmentation:

    • VLAN 10: Management (PXE, Ansible, provisioning tools)
    • VLAN 20: Kubernetes cluster
    • VLAN 30: Storage network (NFS, iSCSI)
    • VLAN 40: Public-facing services (behind Cloudflare)
  2. DHCP Strategy:

    • Use UDM Pro native DHCP with basic PXE options for single-arch PXE needs
    • Static reservations for infrastructure components
    • Consider external DHCP server if conditional options are required
  3. Firewall Rules:

    • Default deny between VLANs
    • Allow management VLAN → all (with source IP restrictions)
    • Allow cluster VLAN → storage VLAN (on specific ports)
    • NAT only on VLAN 40 (public services)
  4. VPN Configuration:

    • Site-to-Site to GCP via WireGuard (lower overhead than IPsec)
    • Remote access VPN on separate VLAN with restrictive firewall
  5. Integration:

    • Terraform for network state management
    • Ansible for DHCP/DNS servers in management VLAN
    • Cloudflare Access for secure public service exposure

Conclusion

The UDM Pro is a capable all-in-one network device ideal for home labs that prioritize ease-of-use and integration with the UniFi ecosystem. It provides basic PXE boot support suitable for single-architecture environments, though conditional DHCP options require external DHCP servers for complex scenarios.

For infrastructure automation projects, the UDM Pro serves well as a reliable network foundation that handles VLANs, routing, and basic services, allowing you to focus on higher-level infrastructure concerns like container orchestration and cloud integration.

1.6.1 - UDM Pro VLAN Configuration & Capabilities

Detailed analysis of VLAN support on the Ubiquiti Dream Machine Pro, including port-based VLAN assignment and VPN integration.

Overview

The Ubiquiti Dream Machine Pro (UDM Pro) provides robust VLAN support through native 802.1Q tagging, enabling network segmentation for security, performance, and organizational purposes. This document covers VLAN configuration capabilities, port assignments, and VPN integration.

VLAN Fundamentals on UDM Pro

Supported Standards

  • 802.1Q VLAN Tagging: Full support for standard VLAN tagging
  • VLAN Range: IDs 1-4094 (standard IEEE 802.1Q range)
  • Maximum VLANs: Up to 32 networks/VLANs per device
  • Native VLAN: Configurable per port (default: VLAN 1)

VLAN Types

Corporate Network

  • Default network type for general-purpose VLANs
  • Provides DHCP, inter-VLAN routing, and firewall capabilities
  • Can enable/disable guest policies, IGMP snooping, and multicast DNS

Guest Network

  • Isolated network with internet-only access
  • Automatic firewall rules preventing access to other VLANs
  • Captive portal support for guest authentication

IoT Network

  • Optimized for IoT devices with device isolation
  • Prevents lateral movement between IoT devices
  • Allows communication with controller/gateway only

Port-Based VLAN Assignment

Per-Port VLAN Configuration

The UDM Pro’s 8x 1 Gbps LAN ports and SFP/SFP+ ports support flexible VLAN assignment:

Configuration Options per Port:

  1. Native VLAN/Untagged VLAN: The default VLAN for untagged traffic on the port
  2. Tagged VLANs: Multiple VLANs that can pass through the port with 802.1Q tags
  3. Port Profile: Pre-configured VLAN assignments that can be applied to ports

Port Profile Types

All: Port accepts all VLANs (trunk mode)

  • Passes all configured VLANs with tags
  • Used for connecting managed switches or access points
  • Native VLAN for untagged traffic

Specific VLANs: Port limited to selected VLANs

  • Choose which VLANs are allowed (tagged)
  • Set native/untagged VLAN
  • Used for controlled trunk links

Single VLAN: Access port mode

  • Port carries only one VLAN (untagged)
  • All traffic on this port belongs to specified VLAN
  • Used for end devices (PCs, servers, printers)

Configuration Steps

Via UniFi Controller GUI:

  1. Create Port Profile:

    • Navigate to SettingsProfilesPort Manager
    • Click Create New Port Profile
    • Select profile type (All, LAN, or Custom)
    • Configure VLAN settings:
      • Native VLAN/Network: Untagged VLAN
      • Tagged VLANs: Select allowed VLANs (for trunk mode)
    • Enable/disable settings: PoE, Storm Control, Port Isolation
  2. Assign Profile to Ports:

    • Navigate to UniFi Devices → Select UDM Pro
    • Go to Ports tab
    • For each LAN port (1-8) or SFP port:
      • Click port to edit
      • Select Port Profile from dropdown
      • Apply changes
  3. Quick Port Assignment (Alternative):

    • SettingsNetworks → Select VLAN
    • Under Port Manager, assign specific ports to this network
    • Ports become access ports for this VLAN

Example Port Layout

UDM Pro Port Assignment Example:

Port 1: Native VLAN 10 (Management) - Access Mode
        └── Use: Ansible control server

Port 2: Native VLAN 20 (Kubernetes) - Access Mode
        └── Use: K8s master node

Port 3: Native VLAN 30 (Storage) - Access Mode
        └── Use: NAS/SAN device

Port 4: Native VLAN 1, Tagged: 10,20,30,40 - Trunk Mode
        └── Use: Managed switch uplink

Port 5-7: Native VLAN 40 (DMZ) - Access Mode
          └── Use: Public-facing servers

Port 8: Native VLAN 1 (Default/Untagged) - Access Mode
        └── Use: Management laptop (temporary)

SFP+: Native VLAN 1, Tagged: All - Trunk Mode
      └── Use: 10G uplink to core switch

VLAN Features and Capabilities

Inter-VLAN Routing

Enabled by Default:

  • Hardware-accelerated routing between VLANs
  • Wire-speed performance (8 Gbps backplane)
  • Routing decisions made at Layer 3

Firewall Control:

  • Default behavior: Allow all inter-VLAN traffic
  • Recommended: Create explicit allow/deny rules per VLAN pair
  • Granular control: Protocol, port, source/destination filtering

Example Firewall Rules:

Rule 1: Allow Management (VLAN 10) → All VLANs
        Source: 192.168.10.0/24
        Destination: Any
        Action: Accept

Rule 2: Allow K8s (VLAN 20) → Storage (VLAN 30) - NFS only
        Source: 192.168.20.0/24
        Destination: 192.168.30.0/24
        Ports: 2049 (NFS), 111 (Portmapper)
        Action: Accept

Rule 3: Block IoT (VLAN 50) → All Private Networks
        Source: 192.168.50.0/24
        Destination: 192.168.0.0/16, 10.0.0.0/8, 172.16.0.0/12
        Action: Drop

Rule 4 (Implicit): Default Deny Between VLANs
        Source: Any
        Destination: Any
        Action: Drop

DHCP per VLAN

Each VLAN can have its own DHCP server:

  • Independent IP ranges per VLAN
  • Separate DHCP options (DNS, gateway, NTP, domain)
  • Static DHCP reservations per VLAN
  • PXE boot options (Option 66/67) per network

Configuration:

  • SettingsNetworks → Select VLAN
  • DHCP section:
    • Enable DHCP server
    • Define IP range (e.g., 192.168.10.100-192.168.10.254)
    • Set lease time
    • Configure gateway (usually UDM Pro’s IP on this VLAN)
    • Add custom DHCP options

Example DHCP Configuration:

VLAN 10 (Management):
  Subnet: 192.168.10.0/24
  Gateway: 192.168.10.1 (UDM Pro)
  DHCP Range: 192.168.10.100-192.168.10.200
  DNS: 192.168.10.10 (local DNS server)
  TFTP Server (Option 66): 192.168.10.16
  Boot Filename (Option 67): pxelinux.0

VLAN 20 (Kubernetes):
  Subnet: 192.168.20.0/24
  Gateway: 192.168.20.1 (UDM Pro)
  DHCP Range: 192.168.20.50-192.168.20.99
  DNS: 8.8.8.8, 8.8.4.4
  Domain Name: k8s.lab.local

VLAN Isolation

Guest Portal Isolation:

  • Guest networks auto-configured with isolation rules
  • Prevents access to RFC1918 private networks
  • Internet-only access by default

Manual Isolation (Firewall Rules):

  • Create LAN In rules to block inter-VLAN traffic
  • Use groups for easier management of multiple VLANs
  • Apply port isolation for additional security

Device Isolation (IoT Networks):

  • Prevents devices on same VLAN from communicating
  • Only controller/gateway access allowed
  • Use for untrusted IoT devices (cameras, smart home)

VPN and VLAN Integration

Site-to-Site VPN VLAN Assignment

✅ VLANs CAN be assigned to site-to-site VPN connections:

WireGuard VPN:

  • Configure remote subnet to map to specific local VLAN
  • Example: GCP subnet 10.128.0.0/20 → routed through VLAN 10
  • Routing table automatically updated
  • Firewall rules apply to VPN traffic

IPsec Site-to-Site:

  • Specify local networks (can select specific VLANs)
  • Remote networks configured in tunnel settings
  • Multiple VLANs can traverse single VPN tunnel
  • Perfect Forward Secrecy supported

Configuration Steps:

  1. SettingsVPNSite-to-Site VPN
  2. Create New VPN tunnel (WireGuard or IPsec)
  3. Under Local Networks, select VLANs to include:
    • Option 1: Select “All” networks
    • Option 2: Choose specific VLANs (e.g., VLAN 10, 20 only)
  4. Configure Remote Networks (cloud provider subnets)
  5. Set encryption parameters and pre-shared keys
  6. Create Firewall Rules for VPN traffic:
    • Allow specific VLAN → VPN tunnel
    • Control which VLANs can reach remote networks

Example Site-to-Site Config:

Home Lab → GCP WireGuard VPN

Local Networks:
  - VLAN 10 (Management): 192.168.10.0/24
  - VLAN 20 (Kubernetes): 192.168.20.0/24

Remote Networks:
  - GCP VPC: 10.128.0.0/20

Firewall Rules:
  - Allow VLAN 10 → GCP VPC (all protocols)
  - Allow VLAN 20 → GCP VPC (HTTPS, kubectl API only)
  - Block all other VLANs from VPN tunnel

Remote Access VPN VLAN Assignment

✅ VLANs CAN be assigned to remote access VPN clients:

L2TP/IPsec Remote Access:

  • VPN clients land on a specific VLAN
  • Default: All clients in same VPN subnet
  • Firewall rules control VLAN access from VPN

OpenVPN Remote Access (via UniFi Network Application addon):

  • Not natively built into UDM Pro
  • Requires UniFi Network Application 6.0+
  • Can route VPN clients to specific VLAN

Teleport VPN (UniFi’s solution):

  • Built-in remote access VPN
  • Clients route through UDM Pro
  • Can access specific VLANs based on firewall rules
  • Layer 3 routing to VLANs

Configuration:

  1. SettingsVPNRemote Access
  2. Enable L2TP or configure Teleport
  3. Set VPN Network (e.g., 192.168.100.0/24)
  4. Advanced:
    • Enable access to specific VLANs
    • By default, VPN network is treated as separate VLAN
  5. Firewall Rules to allow VPN → VLANs:
    • Source: VPN network (192.168.100.0/24)
    • Destination: VLAN 10, VLAN 20 (or specific resources)
    • Action: Accept

Example Remote Access Config:

Remote VPN Users → Home Lab Access

VPN Network: 192.168.100.0/24
VPN Gateway: 192.168.100.1 (UDM Pro)

Firewall Rules:
  Rule 1: Allow VPN → Management VLAN (admin users)
          Source: 192.168.100.0/24
          Dest: 192.168.10.0/24
          Ports: SSH (22), HTTPS (443)
  
  Rule 2: Allow VPN → Kubernetes VLAN (developers)
          Source: 192.168.100.0/24
          Dest: 192.168.20.0/24
          Ports: kubectl (6443), app ports (8080-8090)
  
  Rule 3: Block VPN → Storage VLAN (security)
          Source: 192.168.100.0/24
          Dest: 192.168.30.0/24
          Action: Drop

VPN VLAN Routing Limitations

Current Limitations:

  • Cannot assign individual VPN clients to different VLANs dynamically
  • No VLAN assignment based on user identity (all clients in same VPN network)
  • RADIUS integration does not support per-user VLAN assignment for VPN
  • For per-user VLAN control, use firewall rules based on source IP

Workarounds:

  • Use firewall rules with VPN client IP ranges for granular access
  • Deploy separate VPN tunnels for different access levels
  • Use RADIUS for authentication + firewall rules for authorization

VLAN Best Practices for Home Lab

Network Segmentation Strategy

Recommended VLAN Layout:

VLAN 1:   Default/Management (UDM Pro access)
VLAN 10:  Infrastructure Management (Ansible, PXE, monitoring)
VLAN 20:  Kubernetes Cluster (control plane + workers)
VLAN 30:  Storage Network (NFS, iSCSI, object storage)
VLAN 40:  DMZ/Public Services (exposed to internet via Cloudflare)
VLAN 50:  IoT Devices (isolated smart home devices)
VLAN 60:  Guest Network (visitor WiFi, untrusted devices)
VLAN 100: VPN Remote Access (remote admin/dev access)

Firewall Policy Design

Default Deny Approach:

  1. Create explicit allow rules for necessary traffic
  2. Set implicit deny for all inter-VLAN traffic
  3. Log dropped packets for troubleshooting

Rule Order (top to bottom):

  1. Management VLAN → All (with source IP restrictions)
  2. Kubernetes → Storage (specific ports)
  3. DMZ → Internet (outbound only)
  4. VPN → Specific VLANs (based on role)
  5. All → Internet (NAT)
  6. Block RFC1918 from DMZ
  7. Drop all (implicit)

Performance Optimization

VLAN Routing Performance:

  • Inter-VLAN routing is hardware-accelerated
  • No performance penalty for multiple VLANs
  • Use VLAN tagging on trunk ports to reduce switch load

Multicast and Broadcast Control:

  • Enable IGMP snooping per VLAN for multicast efficiency
  • Disable multicast DNS (mDNS) between VLANs if not needed
  • Use multicast routing for cross-VLAN multicast (advanced)

Advanced VLAN Features

VLAN-Specific Services

DNS per VLAN:

  • Configure different DNS servers per VLAN via DHCP
  • Example: Management VLAN uses local DNS, DMZ uses public DNS

NTP per VLAN:

  • DHCP Option 42 for NTP server
  • Different time sources per network segment

Domain Name per VLAN:

  • DHCP Option 15 for domain name
  • Useful for split-horizon DNS setups

VLAN Tagging on WiFi

UniFi WiFi Integration:

  • Each WiFi SSID can map to a specific VLAN
  • Multiple SSIDs on same AP → different VLANs
  • Seamless VLAN tagging for wireless clients

Configuration:

  • Create WiFi network in UniFi Controller
  • Assign VLAN ID to SSID
  • Client traffic automatically tagged

VLAN Monitoring and Troubleshooting

Traffic Statistics:

  • Per-VLAN bandwidth usage visible in UniFi Controller
  • Deep Packet Inspection (DPI) provides application-level stats
  • Export data for analysis in external tools

Debugging Tools:

  • Port mirroring for packet capture
  • Flow logs for traffic analysis
  • Firewall logs show inter-VLAN blocks

Common Issues:

  1. VLAN not working: Check port profile assignment and native VLAN config
  2. No inter-VLAN routing: Verify firewall rules aren’t blocking traffic
  3. DHCP not working on VLAN: Ensure DHCP server enabled on that network
  4. VPN can’t reach VLAN: Check VPN local networks include the VLAN

Summary

VLAN Port Assignment: ✅ YES

The UDM Pro fully supports port-based VLAN assignment:

  • Individual ports can be assigned to specific VLANs (access mode)
  • Ports can carry multiple tagged VLANs (trunk mode)
  • Native/untagged VLAN configurable per port
  • Port profiles simplify configuration across multiple devices

VPN VLAN Assignment: ✅ YES

VLANs can be assigned to VPN connections:

  • Site-to-Site VPN: Select which VLANs traverse the tunnel
  • Remote Access VPN: VPN clients route to specific VLANs via firewall rules
  • Routing Control: Full control over which VLANs are accessible via VPN
  • Limitations: No per-user VLAN assignment; use firewall rules for granular access

Key Capabilities

  • Up to 32 VLANs supported
  • Hardware-accelerated inter-VLAN routing
  • Per-VLAN DHCP, DNS, and firewall policies
  • Full integration with UniFi WiFi for SSID-to-VLAN mapping
  • Flexible port profiles for easy configuration
  • VPN integration for both site-to-site and remote access scenarios

2 - Architecture Decision Records

Documentation of architectural decisions made using MADR 4.0.0 standard

Architecture Decision Records (ADRs)

This section contains architectural decision records that document the key design choices made. Each ADR follows the MADR 4.0.0 format and includes:

  • Context and problem statement
  • Decision drivers and constraints
  • Considered options with pros and cons
  • Decision outcome and rationale
  • Consequences (positive and negative)
  • Confirmation methods

ADR Categories

ADRs are classified into three categories:

  • Strategic - High-level architectural decisions affecting the entire system (frameworks, authentication strategies, cross-cutting patterns). Use for foundational technology choices.
  • User Journey - Decisions solving specific user journey problems. More tactical than strategic, but still architectural. Use when evaluating approaches to implement user-facing features.
  • API Design - API endpoint implementation decisions (pagination, filtering, bulk operations). Use for significant API design trade-offs that warrant documentation.

Status Values

Each ADR has a status that reflects its current state:

  • proposed - Decision is under consideration
  • accepted - Decision has been approved and should be implemented
  • rejected - Decision was considered but not approved
  • deprecated - Decision is no longer relevant or has been superseded
  • superseded by ADR-XXXX - Decision has been replaced by a newer ADR

These records provide historical context for architectural decisions and help ensure consistency across the platform.

2.1 - [0001] Use MADR for Architecture Decision Records

Adopt Markdown Architectural Decision Records (MADR) as the standard format for documenting architectural decisions in the project.

Context and Problem Statement

As the project grows, architectural decisions are made that have long-term impacts on the system’s design, maintainability, and scalability. Without a structured way to document these decisions, we risk losing the context and rationale behind important choices, making it difficult for current and future team members to understand why certain approaches were taken.

How should we document architectural decisions in a way that is accessible, maintainable, and provides sufficient context for future reference?

Decision Drivers

  • Need for clear documentation of architectural decisions and their rationale
  • Easy accessibility and searchability of past decisions
  • Low barrier to entry for creating and maintaining decision records
  • Integration with existing documentation workflow
  • Version control friendly format
  • Industry-standard approach that team members may already be familiar with

Considered Options

  • MADR (Markdown Architectural Decision Records)
  • ADR using custom format
  • Wiki-based documentation
  • No formal ADR process

Decision Outcome

Chosen option: “MADR (Markdown Architectural Decision Records)”, because it provides a well-established, standardized format that is lightweight, version-controlled, and integrates seamlessly with our existing documentation structure. MADR 4.0.0 offers a clear template that captures all necessary information while remaining flexible enough for different types of decisions.

Consequences

  • Good, because MADR is a widely adopted standard with clear documentation and examples
  • Good, because markdown files are easy to create, edit, and review through pull requests
  • Good, because ADRs will be version-controlled alongside code, maintaining historical context
  • Good, because the format is flexible enough to accommodate strategic, user-journey, and API design decisions
  • Good, because team members can easily search and reference past decisions
  • Neutral, because requires discipline to maintain and update ADR status as decisions evolve
  • Bad, because team members need to learn and follow the MADR format conventions

Confirmation

Compliance will be confirmed through:

  • Code reviews ensuring new architectural decisions are documented as ADRs
  • ADRs are stored in docs/content/r&d/adrs/ following the naming convention NNNN-title-with-dashes.md
  • Regular reviews during architecture discussions to reference and update existing ADRs

Pros and Cons of the Options

MADR (Markdown Architectural Decision Records)

MADR 4.0.0 is a standardized format for documenting architectural decisions using markdown.

  • Good, because it’s a well-established standard with extensive documentation
  • Good, because markdown is simple, portable, and version-control friendly
  • Good, because it provides a clear structure while remaining flexible
  • Good, because it integrates with static site generators and documentation tools
  • Good, because it’s lightweight and doesn’t require special tools
  • Neutral, because it requires some initial learning of the format
  • Neutral, because maintaining consistency requires discipline

ADR using custom format

Create our own custom format for architectural decision records.

  • Good, because we can tailor it exactly to our needs
  • Bad, because it requires defining and maintaining our own standard
  • Bad, because new team members won’t be familiar with the format
  • Bad, because we lose the benefits of community knowledge and tooling
  • Bad, because it may evolve inconsistently over time

Wiki-based documentation

Use a wiki system (like Confluence, Notion, or GitHub Wiki) to document decisions.

  • Good, because wikis provide easy editing and hyperlinking
  • Good, because some team members may be familiar with wiki tools
  • Neutral, because it may or may not integrate with version control
  • Bad, because content may not be version-controlled alongside code
  • Bad, because it creates a separate system to maintain
  • Bad, because it’s harder to review changes through standard PR process
  • Bad, because portability and long-term accessibility may be concerns

No formal ADR process

Continue without a structured approach to documenting architectural decisions.

  • Good, because it requires no additional overhead
  • Bad, because context and rationale for decisions are lost over time
  • Bad, because new team members struggle to understand why decisions were made
  • Bad, because it leads to repeated discussions of previously settled questions
  • Bad, because it makes it difficult to track when decisions should be revisited

More Information

  • MADR 4.0.0 specification: https://adr.github.io/madr/
  • ADRs will be categorized as: strategic, user-journey, or api-design
  • ADR status values: proposed | accepted | rejected | deprecated | superseded by ADR-XXXX
  • All ADRs are stored in docs/content/r&d/adrs/ directory

2.2 - [0002] Network Boot Architecture for Home Lab

Evaluate options for network booting servers in a home lab environment, considering local vs cloud-hosted boot servers.

Context and Problem Statement

When setting up a home lab infrastructure, servers need to be provisioned and booted over the network using PXE (Preboot Execution Environment). This requires a TFTP/HTTP server to serve boot files to requesting machines. The question is: where should this boot server be hosted to balance security, reliability, cost, and operational complexity?

Decision Drivers

  • Security: Minimize attack surface and ensure only authorized servers receive boot files
  • Reliability: Boot process should be resilient and not dependent on external network connectivity
  • Cost: Minimize ongoing infrastructure costs
  • Complexity: Keep the operational burden manageable
  • Trust Model: Clear verification of requesting server identity

Considered Options

  • Option 1: TFTP/HTTP server locally on home lab network
  • Option 2: TFTP/HTTP server on public cloud (without VPN)
  • Option 3: TFTP/HTTP server on public cloud (with VPN)

Decision Outcome

Chosen option: “Option 3: TFTP/HTTP server on public cloud (with VPN)”, because:

  1. No local machine management: Unlike Option 1, this avoids the need to maintain dedicated local hardware for the boot server, reducing operational overhead
  2. Secure protocol support: The VPN tunnel encrypts all traffic, allowing unsecured protocols like TFTP to be used without risk of data exposure over public internet routes (unlike Option 2)
  3. Cost-effective VPN: The UDM Pro natively supports WireGuard, enabling a self-managed VPN solution that avoids expensive managed VPN services (~$180-300/year vs ~$540-900/year)

Consequences

  • Good, because all traffic is encrypted through WireGuard VPN tunnel
  • Good, because boot server is not exposed to public internet (no public attack surface)
  • Good, because trust model is simple - subnet validation similar to local option
  • Good, because centralized cloud management reduces local maintenance burden
  • Good, because boot server remains available even if home lab storage fails
  • Good, because UDM Pro’s native WireGuard support keeps costs at ~$180-300/year
  • Bad, because boot process depends on both internet connectivity and VPN availability
  • Bad, because VPN adds latency to boot file transfers
  • Bad, because VPN gateway becomes an additional failure point
  • Bad, because higher ongoing cost compared to local-only option (~$180-300/year vs ~$10/year)

Confirmation

The implementation will be confirmed by:

  • Successfully network booting a test server using the chosen architecture
  • Validating the trust model prevents unauthorized boot requests
  • Measuring actual costs against estimates

Pros and Cons of the Options

Option 1: TFTP/HTTP server locally on home lab network

Run the boot server on local infrastructure (e.g., Raspberry Pi, dedicated VM, or container) within the home lab network.

Boot Flow Sequence

sequenceDiagram
    participant Server as Home Lab Server
    participant DHCP as Local DHCP Server
    participant Boot as Local TFTP/HTTP Server

    Server->>DHCP: PXE Boot Request (DHCP Discover)
    DHCP->>Server: DHCP Offer with Boot Server IP
    Server->>Boot: TFTP Request for Boot File
    Boot->>Boot: Verify MAC/IP against allowlist
    Boot->>Server: Send iPXE/Boot Loader
    Server->>Boot: HTTP Request for Kernel/Initrd
    Boot->>Server: Send Boot Files
    Server->>Server: Boot into OS

Trust Model

  • MAC Address Allowlist: Maintain a list of known server MAC addresses
  • Network Isolation: Boot server only accessible from home lab VLAN
  • No external exposure: Traffic never leaves local network
  • Physical security: Relies on physical access control to home lab

Cost Estimate

  • Hardware: ~$50-100 one-time (Raspberry Pi or repurposed hardware)
  • Power: ~$5-10/year (low power consumption)
  • Total: ~$55-110 initial + ~$10/year ongoing

Pros and Cons

  • Good, because no dependency on internet connectivity for booting
  • Good, because lowest latency for boot file transfers
  • Good, because all data stays within local network (maximum privacy)
  • Good, because lowest ongoing cost
  • Good, because simple trust model based on network isolation
  • Neutral, because requires dedicated local hardware or resources
  • Bad, because single point of failure if boot server goes down
  • Bad, because requires local maintenance and updates

Option 2: TFTP/HTTP server on public cloud (without VPN)

Host the boot server on a cloud provider (AWS, GCP, Azure) and expose it directly to the internet.

Boot Flow Sequence

sequenceDiagram
    participant Server as Home Lab Server
    participant DHCP as Local DHCP Server
    participant Router as Home Router/NAT
    participant Internet as Internet
    participant Boot as Cloud TFTP/HTTP Server

    Server->>DHCP: PXE Boot Request (DHCP Discover)
    DHCP->>Server: DHCP Offer with Cloud Boot Server IP
    Server->>Router: TFTP Request
    Router->>Internet: NAT Translation
    Internet->>Boot: TFTP Request from Home IP
    Boot->>Boot: Verify source IP + token/certificate
    Boot->>Internet: Send iPXE/Boot Loader
    Internet->>Router: Response
    Router->>Server: Boot Loader
    Server->>Router: HTTP Request for Kernel/Initrd
    Router->>Internet: NAT Translation
    Internet->>Boot: HTTP Request with auth headers
    Boot->>Boot: Validate request authenticity
    Boot->>Internet: Send Boot Files
    Internet->>Router: Response
    Router->>Server: Boot Files
    Server->>Server: Boot into OS

Trust Model

  • Source IP Validation: Restrict to home lab’s public IP (dynamic IP is problematic)
  • Certificate/Token Authentication: Embed certificates in initial bootloader
  • TLS for HTTP: All HTTP traffic encrypted
  • Challenge-Response: Boot server can challenge requesting server
  • Risk: TFTP typically unencrypted, vulnerable to interception

Cost Estimate

  • Cloud VM (t3.micro or equivalent): ~$10-15/month
  • Data Transfer: ~$1-5/month (boot files are typically small)
  • Static IP: ~$3-5/month
  • Total: ~$170-300/year

Pros and Cons

  • Good, because boot server remains available even if home lab has issues
  • Good, because centralized management in cloud console
  • Good, because easy to scale or replicate
  • Neutral, because requires internet connectivity for every boot
  • Bad, because significantly higher ongoing cost
  • Bad, because TFTP protocol is inherently insecure over public internet
  • Bad, because complex trust model required (IP validation, certificates)
  • Bad, because boot process depends on internet availability
  • Bad, because higher latency for boot file transfers
  • Bad, because public exposure increases attack surface

Option 3: TFTP/HTTP server on public cloud (with VPN)

Host the boot server in the cloud but connect the home lab to the cloud via a site-to-site VPN tunnel.

Boot Flow Sequence

sequenceDiagram
    participant Server as Home Lab Server
    participant DHCP as Local DHCP Server
    participant VPN as VPN Gateway (Home)
    participant CloudVPN as VPN Gateway (Cloud)
    participant Boot as Cloud TFTP/HTTP Server

    Note over VPN,CloudVPN: Site-to-Site VPN Tunnel Established

    Server->>DHCP: PXE Boot Request (DHCP Discover)
    DHCP->>Server: DHCP Offer with Boot Server Private IP
    Server->>VPN: TFTP Request to Private IP
    VPN->>CloudVPN: Encrypted VPN Tunnel
    CloudVPN->>Boot: TFTP Request (appears local)
    Boot->>Boot: Verify source IP from home lab subnet
    Boot->>CloudVPN: Send iPXE/Boot Loader
    CloudVPN->>VPN: Encrypted Response
    VPN->>Server: Boot Loader
    Server->>VPN: HTTP Request for Kernel/Initrd
    VPN->>CloudVPN: Encrypted VPN Tunnel
    CloudVPN->>Boot: HTTP Request
    Boot->>Boot: Validate subnet membership
    Boot->>CloudVPN: Send Boot Files
    CloudVPN->>VPN: Encrypted Response
    VPN->>Server: Boot Files
    Server->>Server: Boot into OS

Trust Model

  • VPN Tunnel Encryption: All traffic encrypted end-to-end
  • Private IP Addressing: Boot server only accessible via VPN
  • Subnet Validation: Verify requests come from trusted home lab subnet
  • VPN Authentication: Strong auth at tunnel level (certificates, pre-shared keys)
  • No public exposure: Boot server has no public IP

Cost Estimate

  • Cloud VM (t3.micro or equivalent): ~$10-15/month
  • Data Transfer (VPN): ~$5-10/month
  • VPN Gateway Service (if using managed): ~$30-50/month OR
  • Self-managed VPN (WireGuard/OpenVPN): ~$0 additional
  • Total (self-managed VPN): ~$180-300/year
  • Total (managed VPN): ~$540-900/year

Pros and Cons

  • Good, because all traffic encrypted through VPN tunnel
  • Good, because boot server not exposed to public internet
  • Good, because trust model similar to local option (subnet validation)
  • Good, because centralized cloud management benefits
  • Good, because boot server available if home lab storage fails
  • Neutral, because moderate complexity (VPN setup and maintenance)
  • Bad, because higher cost than local option
  • Bad, because boot process still depends on internet + VPN availability
  • Bad, because VPN adds latency to boot process
  • Bad, because VPN gateway becomes additional failure point
  • Bad, because most expensive option if using managed VPN service

More Information

Key Questions for Decision

  1. How critical is boot availability during internet outages?
  2. Is the home lab public IP static or dynamic?
  3. What is the acceptable boot time latency?
  4. How many servers need to be supported?
  5. Is there existing VPN infrastructure?
  • Issue #595 - story(docs): create adr for network boot architecture

2.3 - [0003] Cloud Provider Selection for Network Boot Infrastructure

Evaluate Google Cloud Platform vs Amazon Web Services for hosting network boot server infrastructure as required by ADR-0002.

Context and Problem Statement

ADR-0002 established that network boot infrastructure will be hosted on a cloud provider and accessed via VPN (specifically WireGuard from the UDM Pro). The decision to use cloud hosting provides resilience against local hardware failures while maintaining security through encrypted VPN tunnels.

The question now is: Which cloud provider should host the network boot infrastructure?

This decision will affect:

  • Cost: Ongoing monthly/annual infrastructure costs
  • Protocol Support: Ability to serve TFTP, HTTP, and HTTPS boot files
  • VPN Integration: Ease of WireGuard deployment and management
  • Operational Complexity: Management overhead and maintenance burden
  • Performance: Boot file transfer latency and throughput
  • Vendor Lock-in: Future flexibility to migrate or multi-cloud

Decision Drivers

  • Cost Efficiency: Minimize ongoing infrastructure costs for home lab scale
  • Protocol Support: Must support TFTP (UDP/69), HTTP (TCP/80), and HTTPS (TCP/443) for network boot workflows
  • WireGuard Compatibility: Must support self-managed WireGuard VPN with reasonable effort
  • UDM Pro Integration: Should work seamlessly with UniFi Dream Machine Pro’s native WireGuard client
  • Simplicity: Minimize operational complexity for a single-person home lab
  • Existing Expertise: Leverage existing team knowledge and infrastructure
  • Performance: Sufficient throughput and low latency for boot file transfers (50-200MB per boot)

Considered Options

  • Option 1: Google Cloud Platform (GCP)
  • Option 2: Amazon Web Services (AWS)

Decision Outcome

Chosen option: “Option 1: Google Cloud Platform (GCP)”, because:

  1. Existing Infrastructure: The home lab already uses GCP extensively (Cloud Run services, load balancers, mTLS infrastructure per existing codebase), reducing operational overhead and leveraging existing expertise
  2. Comparable Costs: Both providers offer similar costs for the required infrastructure (~$6-12/month for compute + VPN), with GCP’s e2-micro being sufficient
  3. Equivalent Protocol Support: Both support TFTP/HTTP/HTTPS via direct VM access (load balancers unnecessary for single boot server), meeting all protocol requirements
  4. WireGuard Compatibility: Both require self-managed WireGuard deployment (neither has native WireGuard support), with nearly identical implementation complexity
  5. Unified Management: Consolidating all cloud infrastructure on GCP simplifies monitoring, billing, IAM, and operational workflows

While AWS would be a viable alternative (especially with t4g.micro ARM instances offering slightly better price/performance), the existing GCP investment makes it the pragmatic choice to avoid multi-cloud complexity.

Consequences

  • Good, because consolidates all cloud infrastructure on a single provider (reduced operational complexity)
  • Good, because leverages existing GCP expertise and IAM configurations
  • Good, because unified Cloud Monitoring/Logging across all services
  • Good, because single cloud bill simplifies cost tracking
  • Good, because existing Terraform modules and patterns can be reused
  • Good, because GCP’s e2-micro instances (~$6.50/month) are cost-effective for the workload
  • Good, because self-managed WireGuard provides flexibility and low cost (~$10/month total)
  • Neutral, because both providers have comparable protocol support (TFTP/HTTP/HTTPS via VM)
  • Neutral, because both require self-managed WireGuard (no native support)
  • Bad, because creates vendor lock-in to GCP (migration would require relearning and reconfiguration)
  • Bad, because foregoes AWS’s slightly cheaper t4g.micro ARM instances (~$6/month vs GCP’s ~$6.50/month)
  • Bad, because multi-cloud strategy could provide redundancy (accepted trade-off for simplicity)

Confirmation

The implementation will be confirmed by:

  • Successfully deploying WireGuard VPN gateway on GCP Compute Engine
  • Establishing site-to-site VPN tunnel between UDM Pro and GCP
  • Network booting a test server via VPN using TFTP and HTTP protocols
  • Measuring actual costs against estimates (~$10-15/month)
  • Validating boot performance (transfer time < 30 seconds for typical boot)

Pros and Cons of the Options

Option 1: Google Cloud Platform (GCP)

Host network boot infrastructure on Google Cloud Platform.

Architecture Overview

graph TB
    subgraph "Home Lab Network"
        A[Home Lab Servers]
        B[UDM Pro - WireGuard Client]
    end
    
    subgraph "GCP VPC"
        C[WireGuard Gateway VM<br/>e2-micro]
        D[Boot Server VM<br/>e2-micro]
        C -->|VPC Routing| D
    end
    
    A -->|PXE Boot Request| B
    B -->|WireGuard Tunnel| C
    C -->|TFTP/HTTP/HTTPS| D
    D -->|Boot Files| C
    C -->|Encrypted Response| B
    B -->|Boot Files| A

Implementation Details

Compute:

  • WireGuard Gateway: e2-micro VM (~$6.50/month) running Ubuntu 22.04
    • Self-managed WireGuard server
    • IP forwarding enabled
    • Static external IP (~$3.50/month if VM ever stops)
  • Boot Server: e2-micro VM (same or consolidated with gateway)
    • TFTP server (tftpd-hpa)
    • HTTP server (nginx or simple Python server)
    • Optional HTTPS with self-signed cert or Let’s Encrypt

Networking:

  • VPC: Default VPC or custom VPC with private subnets
  • Firewall Rules:
    • Allow UDP/51820 from home lab public IP (WireGuard)
    • Allow UDP/69, TCP/80, TCP/443 from VPN subnet (boot protocols)
  • Routes: Custom route to direct home lab subnet through WireGuard gateway
  • Cloud VPN: Not used (self-managed WireGuard instead to save ~$65/month)

WireGuard Setup:

  • Install WireGuard on Compute Engine VM
  • Configure wg0 interface with PostUp/PostDown iptables rules
  • Store private key in Secret Manager
  • UDM Pro connects as WireGuard peer

Cost Breakdown (US regions):

ComponentMonthly Cost
e2-micro VM (WireGuard + Boot)~$6.50
Static External IP (if attached)~$3.50
Egress (10 boots × 150MB)~$0.18
Total~$10.18
Annual~$122

Pros and Cons

  • Good, because existing home lab infrastructure already uses GCP extensively
  • Good, because consolidates all cloud resources on single provider (unified billing, IAM, monitoring)
  • Good, because leverages existing GCP expertise and Terraform modules
  • Good, because Cloud Monitoring/Logging already configured for other services
  • Good, because Secret Manager integration for WireGuard key storage
  • Good, because e2-micro instance size is sufficient for network boot workload
  • Good, because low cost (~$10/month for self-managed WireGuard)
  • Good, because VPC networking is familiar and well-documented
  • Neutral, because requires self-managed WireGuard (no native support, same as AWS)
  • Neutral, because TFTP/HTTP/HTTPS served directly from VM (no special GCP features needed)
  • Bad, because slightly more expensive than AWS t4g.micro (~$6.50/month vs ~$6/month)
  • Bad, because creates vendor lock-in to GCP ecosystem
  • Bad, because Cloud VPN (managed IPsec) is expensive (~$73/month), so must use self-managed WireGuard

Option 2: Amazon Web Services (AWS)

Host network boot infrastructure on Amazon Web Services.

Architecture Overview

graph TB
    subgraph "Home Lab Network"
        A[Home Lab Servers]
        B[UDM Pro - WireGuard Client]
    end
    
    subgraph "AWS VPC"
        C[WireGuard Gateway EC2<br/>t4g.micro]
        D[Boot Server EC2<br/>t4g.micro]
        C -->|VPC Routing| D
    end
    
    A -->|PXE Boot Request| B
    B -->|WireGuard Tunnel| C
    C -->|TFTP/HTTP/HTTPS| D
    D -->|Boot Files| C
    C -->|Encrypted Response| B
    B -->|Boot Files| A

Implementation Details

Compute:

  • WireGuard Gateway: t4g.micro EC2 (~$6/month, ARM-based Graviton)
    • Self-managed WireGuard server
    • Source/Dest check disabled for IP forwarding
    • Elastic IP (free when attached to running instance)
  • Boot Server: t4g.micro EC2 (same or consolidated with gateway)
    • TFTP server (tftpd-hpa)
    • HTTP server (nginx)
    • Optional HTTPS with Let’s Encrypt or self-signed cert

Networking:

  • VPC: Default VPC or custom VPC with private subnets
  • Security Groups:
    • WireGuard SG: Allow UDP/51820 from home lab public IP
    • Boot Server SG: Allow UDP/69, TCP/80, TCP/443 from WireGuard SG
  • Route Table: Add route for home lab subnet via WireGuard instance
  • Site-to-Site VPN: Not used (self-managed WireGuard saves ~$30/month)

WireGuard Setup:

  • Install WireGuard on Ubuntu 22.04 or Amazon Linux 2023 EC2
  • Configure wg0 with iptables MASQUERADE
  • Store private key in Secrets Manager
  • UDM Pro connects as WireGuard peer

Cost Breakdown (US East):

ComponentMonthly Cost
t4g.micro EC2 (WireGuard + Boot)~$6.00
Elastic IP (attached)$0.00
Egress (10 boots × 150MB)~$0.09
Total (On-Demand)~$6.09
Total (1-yr Reserved)~$3.59
Annual (On-Demand)~$73
Annual (Reserved)~$43

Pros and Cons

  • Good, because t4g.micro ARM instances offer best price/performance (~$6/month on-demand)
  • Good, because Reserved Instances provide significant savings (~40% with 1-year commitment)
  • Good, because Elastic IP is free when attached to running instance
  • Good, because AWS has extensive documentation and community support
  • Good, because potential for future multi-cloud strategy
  • Good, because ACM provides free SSL certificates (if public domain used)
  • Good, because Secrets Manager for WireGuard key storage
  • Good, because low cost (~$6/month on-demand, ~$3.50/month with RI)
  • Neutral, because requires self-managed WireGuard (no native support, same as GCP)
  • Neutral, because TFTP/HTTP/HTTPS served directly from EC2 (no special AWS features)
  • Bad, because introduces multi-cloud complexity (separate billing, IAM, monitoring)
  • Bad, because no existing AWS infrastructure in home lab (new learning curve)
  • Bad, because requires separate monitoring/logging setup (CloudWatch vs Cloud Monitoring)
  • Bad, because separate Terraform state and modules needed
  • Bad, because Site-to-Site VPN is expensive (~$36/month), so must use self-managed WireGuard

More Information

Detailed Analysis

For in-depth analysis of each provider’s capabilities:

Key Findings Summary

Both providers offer:

  • TFTP Support: Via direct VM/EC2 access (load balancers don’t support TFTP)
  • HTTP/HTTPS Support: Full support via direct VM/EC2 or load balancers
  • WireGuard Compatibility: Self-managed deployment on VM/EC2 (neither has native support)
  • UDM Pro Integration: Native WireGuard client works with both
  • Low Cost: $6-12/month for compute + VPN infrastructure
  • Sufficient Performance: 100+ Mbps throughput on smallest instances

Key differences:

  • GCP: Slightly higher cost (~$10/month), but consolidates with existing infrastructure
  • AWS: Slightly lower cost (~$6/month on-demand, ~$3.50/month Reserved), but introduces multi-cloud complexity

Cost Comparison Table

ComponentGCP (e2-micro)AWS (t4g.micro On-Demand)AWS (t4g.micro 1-yr RI)
Compute$6.50/month$6.00/month$3.50/month
Static IP$3.50/month$0.00 (Elastic IP free when attached)$0.00
Egress (1.5GB)$0.18/month$0.09/month$0.09/month
Monthly$10.18$6.09$3.59
Annual$122$73$43

Savings Analysis: AWS is ~$49-79/year cheaper, but introduces operational complexity.

Protocol Support Comparison

ProtocolGCP SupportAWS SupportImplementation
TFTP (UDP/69)⚠️ Via VM⚠️ Via EC2Direct VM/EC2 access (no LB support)
HTTP (TCP/80)✅ Full✅ FullDirect VM/EC2 or Load Balancer
HTTPS (TCP/443)✅ Full✅ FullDirect VM/EC2 or Load Balancer + cert
WireGuard⚠️ Self-managed⚠️ Self-managedInstall on VM/EC2

WireGuard Deployment Comparison

AspectGCPAWS
Native Support❌ No (IPsec Cloud VPN only)❌ No (IPsec Site-to-Site VPN only)
Self-Managed✅ Compute Engine✅ EC2
Setup ComplexitySimilar (install, configure, firewall)Similar (install, configure, SG)
IP ForwardingEnable on VMDisable Source/Dest check
FirewallVPC Firewall rulesSecurity Groups
Key StorageSecret ManagerSecrets Manager
Cost~$10/month total~$6/month total

Trade-offs Analysis

Choosing GCP:

  • Wins: Operational simplicity, unified infrastructure, existing expertise
  • Loses: ~$50-80/year higher cost, vendor lock-in

Choosing AWS:

  • Wins: Lower cost, Reserved Instance savings, multi-cloud optionality
  • Loses: Multi-cloud complexity, separate monitoring/billing, new tooling

For a home lab prioritizing simplicity over cost optimization, GCP’s consolidation benefits outweigh the modest cost difference.

Future Considerations

  1. Cost Reevaluation: If annual costs become significant, reconsider AWS Reserved Instances
  2. Multi-Cloud: If multi-cloud strategy emerges, migrate boot server to AWS
  3. Managed WireGuard: If GCP or AWS adds native WireGuard support, reevaluate managed option
  4. High Availability: If HA required, evaluate multi-region deployment costs on both providers
  • Issue #597 - story(docs): create adr for cloud provider selection

2.4 - [0004] Server Operating System Selection

Evaluate operating systems for homelab server infrastructure with focus on Kubernetes cluster setup and maintenance.

Context and Problem Statement

The homelab infrastructure requires a server operating system to run Kubernetes clusters for container workloads. The choice of operating system significantly impacts ease of cluster initialization, ongoing maintenance burden, security posture, and operational complexity.

The question is: Which operating system should be used for homelab Kubernetes servers?

This decision will affect:

  • Cluster Initialization: Complexity and time required to bootstrap Kubernetes
  • Maintenance Burden: Frequency and complexity of OS updates, Kubernetes upgrades, and patching
  • Security Posture: Attack surface, built-in security features, and hardening requirements
  • Resource Efficiency: RAM, CPU, and disk overhead
  • Operational Complexity: Day-to-day management, troubleshooting, and debugging
  • Learning Curve: Time required for team to become proficient

Decision Drivers

  • Ease of Kubernetes Setup: Minimize steps and complexity for cluster initialization
  • Maintenance Simplicity: Reduce ongoing operational burden for updates and upgrades
  • Security-First Design: Minimal attack surface and strong security defaults
  • Resource Efficiency: Low RAM/CPU/disk overhead for cost-effective homelab
  • Learning Curve: Reasonable adoption time for single-person homelab
  • Community Support: Strong documentation and active community
  • Immutability: Prefer declarative, version-controlled configuration (GitOps-friendly)
  • Purpose-Built: OS optimized specifically for Kubernetes vs general-purpose

Considered Options

  • Option 1: Ubuntu Server with k3s
  • Option 2: Fedora Server with kubeadm
  • Option 3: Talos Linux (purpose-built Kubernetes OS)
  • Option 4: Harvester HCI (hyperconverged platform)

Decision Outcome

Chosen option: “Option 3: Talos Linux”, because:

  1. Minimal Attack Surface: No SSH, shell, or package manager eliminates entire classes of vulnerabilities, providing the strongest security posture
  2. Built-in Kubernetes: No separate installation or configuration complexity - Kubernetes is included and optimized
  3. Declarative Configuration: API-driven, immutable infrastructure aligns with GitOps principles and prevents configuration drift
  4. Lowest Resource Overhead: ~768MB RAM vs 1-2GB+ for traditional distros, maximizing homelab hardware efficiency
  5. Simplified Maintenance: Declarative upgrades (talosctl upgrade) for both OS and Kubernetes reduce operational burden
  6. Security by Default: Immutable filesystem, no shell, KSPP compliance - secure without manual hardening

While the learning curve is steeper than traditional Linux distributions, the benefits of purpose-built Kubernetes infrastructure, minimal maintenance, and superior security outweigh the initial learning investment for a dedicated Kubernetes homelab.

Consequences

  • Good, because minimal attack surface (no SSH/shell) provides strongest security posture
  • Good, because declarative configuration enables GitOps workflows and prevents drift
  • Good, because lowest resource overhead (~768MB RAM) maximizes homelab efficiency
  • Good, because built-in Kubernetes eliminates installation complexity
  • Good, because immutable infrastructure prevents configuration drift
  • Good, because simplified upgrades (single command for OS + K8s) reduce maintenance burden
  • Good, because smallest disk footprint (~500MB) vs 10GB+ for traditional distros
  • Good, because secure by default (no manual hardening required)
  • Good, because purpose-built design optimized specifically for Kubernetes
  • Good, because API-driven management (talosctl) enables automation
  • Neutral, because steeper learning curve (paradigm shift from shell-based management)
  • Neutral, because smaller community than Ubuntu/Fedora (but active and helpful)
  • Bad, because limited to Kubernetes workloads only (not general-purpose)
  • Bad, because no shell access requires different troubleshooting approach
  • Bad, because newer platform (less mature than Ubuntu/Fedora)
  • Bad, because no escape hatch for manual intervention when needed

Confirmation

The implementation will be confirmed by:

  • Successfully bootstrapping a Talos cluster using talosctl
  • Deploying test workloads and validating functionality
  • Performing declarative OS and Kubernetes upgrades
  • Measuring actual resource usage (RAM < 1GB per node)
  • Validating security posture (no SSH/shell, immutable filesystem)
  • Testing GitOps workflow (machine configs in version control)

Pros and Cons of the Options

Option 1: Ubuntu Server with k3s

Host Kubernetes using Ubuntu Server 24.04 LTS with k3s lightweight Kubernetes distribution.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Ubuntu Server
    participant K3s as k3s Components
    
    Admin->>Server: Install Ubuntu 24.04 LTS
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system
    Admin->>Server: curl -sfL https://get.k3s.io | sh -
    Server->>K3s: Download k3s binary
    K3s->>Server: Configure containerd
    K3s->>Server: Start k3s service
    K3s->>Server: Initialize etcd (embedded)
    K3s->>Server: Start API server
    K3s->>Server: Deploy built-in CNI (Flannel)
    K3s-->>Admin: Control plane ready
    Admin->>Server: Retrieve node token
    Admin->>Server: Install k3s agent on workers
    K3s->>Server: Join workers to cluster
    K3s-->>Admin: Cluster ready (5-10 minutes)

Implementation Details

Installation:

# Single-command k3s install
curl -sfL https://get.k3s.io | sh -

# Get token for workers
sudo cat /var/lib/rancher/k3s/server/node-token

# Install on workers
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -

Resource Requirements:

  • RAM: 1GB total (512MB OS + 512MB k3s)
  • CPU: 1-2 cores
  • Disk: 20GB (10GB OS + 10GB containers)

Maintenance:

# OS updates
sudo apt update && sudo apt upgrade

# k3s upgrade
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.32.0+k3s1 sh -

# Or automatic via system-upgrade-controller

Pros and Cons

  • Good, because most familiar Linux distribution (easy adoption)
  • Good, because 5-year LTS support (10 years with Ubuntu Pro)
  • Good, because k3s provides single-command setup
  • Good, because extensive documentation and community support
  • Good, because compatible with all Kubernetes tooling
  • Good, because automatic security updates available
  • Good, because general-purpose (can run non-K8s workloads)
  • Good, because low learning curve
  • Neutral, because moderate resource overhead (1GB RAM)
  • Bad, because general-purpose OS has larger attack surface
  • Bad, because requires manual OS updates and reboots
  • Bad, because managing OS + Kubernetes lifecycle separately
  • Bad, because imperative configuration (not GitOps-native)
  • Bad, because mutable filesystem (configuration drift possible)

Option 2: Fedora Server with kubeadm

Host Kubernetes using Fedora Server with kubeadm (official Kubernetes tool) and CRI-O container runtime.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Fedora Server
    participant K8s as Kubernetes Components
    
    Admin->>Server: Install Fedora 41
    Server->>Server: Configure network
    Admin->>Server: Update system (dnf update)
    Admin->>Server: Install CRI-O
    Server->>Server: Configure CRI-O runtime
    Admin->>Server: Install kubeadm/kubelet/kubectl
    Server->>Server: Disable swap, load kernel modules
    Server->>Server: Configure SELinux
    Admin->>K8s: kubeadm init --cri-socket=unix:///var/run/crio/crio.sock
    K8s->>Server: Generate certificates
    K8s->>Server: Start etcd
    K8s->>Server: Start API server
    K8s-->>Admin: Control plane ready
    Admin->>K8s: kubectl apply CNI
    K8s->>Server: Deploy CNI pods
    Admin->>K8s: kubeadm join (workers)
    K8s-->>Admin: Cluster ready (15-20 minutes)

Implementation Details

Installation:

# Install CRI-O
sudo dnf install -y cri-o
sudo systemctl enable --now crio

# Install kubeadm components
sudo dnf install -y kubelet kubeadm kubectl

# Initialize cluster
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/crio/crio.sock

# Install CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

Resource Requirements:

  • RAM: 2.2GB total (700MB OS + 1.5GB Kubernetes)
  • CPU: 2+ cores
  • Disk: 35GB (15GB OS + 20GB containers)

Maintenance:

# OS updates (every 13 months major upgrade)
sudo dnf update -y

# Kubernetes upgrade
sudo dnf update -y kubeadm
sudo kubeadm upgrade apply v1.32.0
sudo dnf update -y kubelet kubectl

Pros and Cons

  • Good, because SELinux enabled by default (stronger than AppArmor)
  • Good, because latest kernel and packages (bleeding edge)
  • Good, because native CRI-O support (OpenShift compatibility)
  • Good, because upstream for RHEL (enterprise patterns)
  • Good, because kubeadm provides full control over cluster
  • Neutral, because faster release cycle (latest features, but more upgrades)
  • Bad, because short support cycle (13 months per release)
  • Bad, because bleeding-edge can introduce instability
  • Bad, because complex kubeadm setup (many manual steps)
  • Bad, because higher resource overhead (2.2GB RAM)
  • Bad, because SELinux configuration for Kubernetes is complex
  • Bad, because frequent OS upgrades required (every 13 months)
  • Bad, because managing OS + Kubernetes separately
  • Bad, because imperative configuration (not GitOps-native)

Option 3: Talos Linux (purpose-built Kubernetes OS)

Use Talos Linux, an immutable, API-driven operating system designed specifically for Kubernetes with built-in cluster management.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Talos as Talos Linux
    participant K8s as Kubernetes Components
    
    Admin->>Server: Boot Talos ISO (PXE or USB)
    Server->>Talos: Start in maintenance mode
    Talos-->>Admin: API endpoint ready
    Admin->>Admin: Generate configs (talosctl gen config)
    Admin->>Talos: talosctl apply-config (controlplane.yaml)
    Talos->>Server: Install Talos to disk
    Server->>Server: Reboot from disk
    Talos->>K8s: Start kubelet
    Talos->>K8s: Start etcd
    Talos->>K8s: Start API server
    Admin->>Talos: talosctl bootstrap
    Talos->>K8s: Initialize cluster
    K8s->>Talos: Start controller-manager
    K8s-->>Admin: Control plane ready
    Admin->>K8s: Apply CNI
    Admin->>Talos: Apply worker configs
    Talos->>K8s: Join workers
    K8s-->>Admin: Cluster ready (10-15 minutes)

Implementation Details

Installation:

# Generate machine configs
talosctl gen config homelab https://192.168.1.10:6443

# Apply config to control plane (booted from ISO)
talosctl apply-config --insecure --nodes 192.168.1.10 --file controlplane.yaml

# Bootstrap Kubernetes
talosctl bootstrap --nodes 192.168.1.10 --endpoints 192.168.1.10

# Get kubeconfig
talosctl kubeconfig --nodes 192.168.1.10

# Add workers
talosctl apply-config --insecure --nodes 192.168.1.11 --file worker.yaml

Machine Configuration (declarative YAML):

version: v1alpha1
machine:
  type: controlplane
  install:
    disk: /dev/sda
  network:
    hostname: control-plane-1
    interfaces:
      - interface: eth0
        addresses:
          - 192.168.1.10/24
cluster:
  clusterName: homelab
  controlPlane:
    endpoint: https://192.168.1.10:6443
  network:
    cni:
      name: custom
      urls:
        - https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

Resource Requirements:

  • RAM: 768MB total (256MB OS + 512MB Kubernetes)
  • CPU: 1-2 cores
  • Disk: 10-15GB (500MB OS + 10GB containers)

Maintenance:

# Upgrade Talos (OS + Kubernetes)
talosctl upgrade --nodes 192.168.1.10 --image ghcr.io/siderolabs/installer:v1.9.0

# Upgrade Kubernetes version
talosctl upgrade-k8s --nodes 192.168.1.10 --to 1.32.0

# Apply config changes
talosctl apply-config --nodes 192.168.1.10 --file controlplane.yaml

Pros and Cons

  • Good, because Kubernetes built-in (no separate installation)
  • Good, because minimal attack surface (no SSH, shell, package manager)
  • Good, because immutable infrastructure (config drift impossible)
  • Good, because API-driven management (GitOps-friendly)
  • Good, because lowest resource overhead (~768MB RAM)
  • Good, because declarative configuration (YAML in version control)
  • Good, because secure by default (no manual hardening)
  • Good, because smallest disk footprint (~500MB OS)
  • Good, because designed specifically for Kubernetes
  • Good, because simple declarative upgrades (OS + K8s)
  • Good, because UEFI Secure Boot support
  • Neutral, because smaller community (but active and helpful)
  • Bad, because steep learning curve (paradigm shift)
  • Bad, because limited to Kubernetes workloads only
  • Bad, because troubleshooting without shell requires different approach
  • Bad, because relatively new (less mature than Ubuntu/Fedora)
  • Bad, because no escape hatch for manual intervention

Option 4: Harvester HCI (hyperconverged platform)

Use Harvester, a hyperconverged infrastructure platform built on K3s and KubeVirt for unified VM + container management.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Harvester as Harvester HCI
    participant K3s as K3s / KubeVirt
    participant Storage as Longhorn Storage
    
    Admin->>Server: Boot Harvester ISO
    Server->>Harvester: Installation wizard
    Admin->>Harvester: Configure cluster (VIP, storage)
    Harvester->>Server: Install RancherOS 2.0
    Harvester->>Server: Install K3s
    Server->>Server: Reboot
    Harvester->>K3s: Start K3s server
    K3s->>Storage: Deploy Longhorn
    K3s->>Server: Deploy KubeVirt
    K3s->>Server: Deploy multus CNI
    Harvester-->>Admin: Web UI ready
    Admin->>Harvester: Add nodes
    Harvester->>K3s: Join cluster
    K3s-->>Admin: Cluster ready (20-30 minutes)

Implementation Details

Installation: Interactive ISO wizard or cloud-init config

Resource Requirements:

  • RAM: 8GB minimum per node (16GB+ recommended)
  • CPU: 4+ cores per node
  • Disk: 250GB+ per node (100GB OS + 150GB storage)
  • Nodes: 3+ for production HA

Features:

  • Web UI management
  • Built-in storage (Longhorn)
  • VM support (KubeVirt)
  • Live migration
  • Rancher integration

Pros and Cons

  • Good, because unified VM + container platform
  • Good, because built-in K3s (Kubernetes included)
  • Good, because web UI simplifies management
  • Good, because built-in persistent storage (Longhorn)
  • Good, because VM live migration
  • Good, because Rancher integration
  • Neutral, because immutable OS layer
  • Bad, because very heavy resource requirements (8GB+ RAM)
  • Bad, because complex architecture (KubeVirt, Longhorn, multus)
  • Bad, because overkill for container-only workloads
  • Bad, because larger attack surface (web UI, VM layer)
  • Bad, because requires 3+ nodes for HA (not single-node friendly)
  • Bad, because steep learning curve for full feature set

More Information

Detailed Analysis

For in-depth analysis of each operating system:

  • Ubuntu Server Analysis

    • Installation methods (kubeadm, k3s, MicroK8s)
    • Cluster initialization sequences
    • Maintenance requirements and upgrade procedures
    • Resource overhead and security posture
  • Fedora Server Analysis

    • kubeadm with CRI-O installation
    • SELinux configuration for Kubernetes
    • Rapid release cycle implications
    • RHEL ecosystem compatibility
  • Talos Linux Analysis

    • API-driven, immutable architecture
    • Declarative configuration model
    • Security-first design principles
    • Production readiness and advanced features
  • Harvester HCI Analysis

    • Hyperconverged infrastructure capabilities
    • VM + container unified platform
    • KubeVirt and Longhorn integration
    • Multi-node cluster requirements

Key Findings Summary

Resource efficiency comparison:

  • Talos: 768MB RAM, 500MB disk (most efficient)
  • Ubuntu + k3s: 1GB RAM, 20GB disk (efficient)
  • ⚠️ Fedora + kubeadm: 2.2GB RAM, 35GB disk (moderate)
  • Harvester: 8GB+ RAM, 250GB+ disk (heavy)

Security posture comparison:

  • Talos: Minimal attack surface (no SSH/shell, immutable)
  • Fedora: SELinux by default (strong MAC)
  • ⚠️ Ubuntu: AppArmor (moderate security)
  • ⚠️ Harvester: Larger attack surface (web UI, VM layer)

Operational complexity comparison:

  • Ubuntu + k3s: Single command install, familiar management
  • Talos: Declarative, automated (after learning curve)
  • ⚠️ Fedora + kubeadm: Manual kubeadm steps, frequent OS upgrades
  • Harvester: Complex HCI architecture, heavy requirements

Decision Matrix

CriterionUbuntu + k3sFedora + kubeadmTalos LinuxHarvester
Setup Simplicity⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Maintenance Burden⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Security Posture⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Resource Efficiency⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Learning Curve⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Community Support⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Immutability⭐⭐⭐⭐⭐⭐⭐⭐⭐
GitOps-Friendly⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Purpose-Built⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Overall Score29/4524/4538/4528/45

Talos Linux scores highest for Kubernetes-dedicated homelab infrastructure prioritizing security, efficiency, and GitOps workflows.

Trade-offs Analysis

Choosing Talos Linux:

  • Wins: Best security, lowest overhead, declarative configuration, minimal maintenance
  • Loses: Steeper learning curve, no shell access, smaller community

Choosing Ubuntu + k3s:

  • Wins: Easiest adoption, largest community, general-purpose flexibility
  • Loses: Higher attack surface, manual OS management, imperative config

Choosing Fedora + kubeadm:

  • Wins: Latest features, SELinux, enterprise compatibility
  • Loses: Frequent OS upgrades, complex setup, higher overhead

Choosing Harvester:

  • Wins: VM + container unified platform, web UI
  • Loses: Heavy resources, complex architecture, overkill for K8s-only

For a Kubernetes-dedicated homelab prioritizing security and efficiency, Talos Linux’s benefits outweigh the learning curve investment.

Future Considerations

  1. Team Growth: If team grows beyond single person, reassess Ubuntu for familiarity
  2. VM Requirements: If VM workloads emerge, consider Harvester or KubeVirt on Talos
  3. Enterprise Patterns: If RHEL compatibility needed, reconsider Fedora/CentOS Stream
  4. Maintenance Burden: If Talos learning curve proves too steep, fallback to k3s
  5. Talos Maturity: Monitor Talos ecosystem growth and production adoption
  • Issue #598 - story(docs): create adr for server operating system

2.5 - [0005] Network Boot Infrastructure Implementation on Google Cloud

Evaluate implementation approaches for deploying network boot infrastructure on Google Cloud Platform using UEFI HTTP boot, comparing custom server implementation versus Matchbox-based solution.

Context and Problem Statement

ADR-0002 established that network boot infrastructure will be hosted on a cloud provider accessed via WireGuard VPN. ADR-0003 selected Google Cloud Platform as the hosting provider to consolidate infrastructure and leverage existing expertise.

The remaining question is: How should the network boot server itself be implemented?

This decision affects:

  • Development Effort: Time required to build, test, and maintain the solution
  • Feature Completeness: Capabilities for boot image management, machine mapping, and provisioning workflows
  • Operational Complexity: Deployment, monitoring, and troubleshooting burden
  • Security: Boot image integrity, access control, and audit capabilities
  • Scalability: Ability to grow from single home lab to multiple environments

The boot server must handle:

  1. HTTP/HTTPS requests for UEFI boot scripts, kernels, initrd images, and cloud-init configurations
  2. Machine-to-image mapping to serve appropriate boot files based on MAC address, hardware profile, or tags
  3. Boot image lifecycle management including upload, versioning, and rollback capabilities

Hardware-Specific Context

The target bare metal servers (HP DL360 Gen 9) have the following network boot capabilities:

  • UEFI HTTP Boot: Supported in iLO 4 firmware v2.40+ (released 2016)
  • TLS Support: Server-side TLS only (no client certificate authentication)
  • Boot Process: Firmware handles initial HTTP requests directly (no PXE/TFTP chain loading required)
  • Configuration: Boot URL configured via iLO RBSU or UEFI System Utilities

Security Implications: Since the servers cannot present client certificates for mTLS authentication with Cloudflare, the WireGuard VPN serves as the secure transport layer for boot traffic. The HTTP boot server is only accessible through the VPN tunnel.

Reference: HP DL360 Gen 9 Network Boot Analysis

Decision Drivers

  • Time to Production: Minimize time to get a working network boot infrastructure
  • Feature Requirements: Must support machine-specific boot configurations, image versioning, and cloud-init integration
  • Maintenance Burden: Prefer solutions that minimize ongoing maintenance and updates
  • GCP Integration: Should leverage GCP services (Cloud Storage, Secret Manager, IAM)
  • Security: Boot images must be served securely with access control and integrity verification
  • Observability: Comprehensive logging and monitoring for troubleshooting boot failures
  • Cost: Minimize infrastructure costs while meeting functional requirements
  • Future Flexibility: Ability to extend or customize as needs evolve

Considered Options

  • Option 1: Custom server implementation (Go-based)
  • Option 2: Matchbox-based solution

Decision Outcome

Chosen option: “Option 1: Custom implementation”, because:

  1. UEFI HTTP Boot Simplification: Elimination of TFTP/PXE dramatically reduces implementation complexity
  2. Cloud Run Deployment: HTTP-only boot enables serverless deployment (~$5/month vs $8-17/month)
  3. Development Time Manageable: UEFI HTTP boot reduces custom development to 2-3 weeks
  4. Full Control: Custom implementation maintains flexibility for future home lab requirements
  5. GCP Native Integration: Direct Cloud Storage, Firestore, Secret Manager, and IAM integration
  6. Existing Framework: Leverages z5labs/humus patterns already in use across services
  7. HTTP REST API: Native HTTP REST admin API via z5labs/humus framework provides better integration with existing tooling

Consequences

  • Good, because UEFI HTTP boot eliminates TFTP complexity entirely
  • Good, because Cloud Run deployment reduces operational overhead and cost
  • Good, because leverages existing z5labs/humus framework and Go expertise
  • Good, because GCP native integration (Cloud Storage, Firestore, Secret Manager, IAM)
  • Good, because full control over implementation enables future customization
  • Good, because simplified testing (HTTP-only, no TFTP/PXE edge cases)
  • Good, because OpenTelemetry observability built-in from existing patterns
  • Neutral, because requires 2-3 weeks development time vs 1 week for Matchbox setup
  • Neutral, because ongoing maintenance responsibility (no upstream project support)
  • Bad, because custom implementation may miss edge cases that Matchbox handles
  • Bad, because reinvents machine matching and boot configuration patterns
  • Bad, because Cloud Run cold start latency needs monitoring (mitigated with min instances = 1)

Confirmation

The implementation success will be validated by:

  • Successfully deploying custom boot server on GCP Cloud Run
  • Successfully network booting HP DL360 Gen 9 via UEFI HTTP boot through WireGuard VPN
  • Confirming iLO 4 firmware v2.40+ compatibility with HTTP boot workflow
  • Validating boot image upload and versioning workflows via HTTP REST API
  • Measuring Cloud Run cold start latency for boot requests (target: < 100ms)
  • Measuring boot file request latency for kernel/initrd downloads (target: < 100ms)
  • Confirming Cloud Storage integration for boot asset storage
  • Testing machine-to-image mapping based on MAC address using Firestore
  • Validating WireGuard VPN security for boot traffic (compensating for lack of client cert support)
  • Verifying OpenTelemetry observability integration with Cloud Monitoring

Pros and Cons of the Options

Option 1: Custom Server Implementation (Go-based)

Build a custom network boot server in Go, leveraging the existing z5labs/humus framework for HTTP services.

Architecture Overview

architecture-beta
    group gcp(cloud)[GCP VPC]

    service wg_nlb(internet)[Network LB] in gcp
    service wireguard(server)[WireGuard Gateway] in gcp
    service https_lb(internet)[HTTPS LB] in gcp
    service compute(server)[Compute Engine] in gcp
    service storage(database)[Cloud Storage] in gcp
    service firestore(database)[Firestore] in gcp
    service secrets(disk)[Secret Manager] in gcp
    service monitoring(internet)[Cloud Monitoring] in gcp

    group homelab(cloud)[Home Lab]
    service udm(server)[UDM Pro] in homelab
    service servers(server)[Bare Metal Servers] in homelab

    servers:L -- R:udm
    udm:R -- L:wg_nlb
    wg_nlb:R -- L:wireguard
    wireguard:R -- L:https_lb
    https_lb:R -- L:compute
    compute:B --> T:storage
    compute:B --> T:firestore
    compute:R --> L:secrets
    compute:T --> B:monitoring

Components:

  • Boot Server: Go service deployed to Cloud Run (or Compute Engine VM as fallback)
    • HTTP/HTTPS server (using z5labs/humus framework with OpenAPI)
    • UEFI HTTP boot endpoint serving boot scripts and assets
    • HTTP REST admin API for boot configuration management
  • Cloud Storage: Buckets for boot images, boot scripts, kernels, initrd files
  • Firestore/Datastore: Machine-to-image mapping database (MAC → boot profile)
  • Secret Manager: WireGuard keys, TLS certificates (optional for HTTPS boot)
  • Cloud Monitoring: Metrics for boot requests, success/failure rates, latency

Boot Image Lifecycle

sequenceDiagram
    participant Admin
    participant API as Boot Server API
    participant Storage as Cloud Storage
    participant DB as Firestore
    participant Monitor as Cloud Monitoring

    Note over Admin,Monitor: Upload Boot Image
    Admin->>API: POST /api/v1/images (kernel, initrd, metadata)
    API->>API: Validate image integrity (checksum)
    API->>Storage: Upload kernel to gs://boot-images/kernels/
    API->>Storage: Upload initrd to gs://boot-images/initrd/
    API->>DB: Store metadata (version, checksum, tags)
    API->>Monitor: Log upload event
    API->>Admin: 201 Created (image ID)

    Note over Admin,Monitor: Map Machine to Image
    Admin->>API: POST /api/v1/machines (MAC, image_id, profile)
    API->>DB: Store machine mapping
    API->>Admin: 201 Created

    Note over Admin,Monitor: UEFI HTTP Boot Request
    participant Server as Home Lab Server
    Note right of Server: iLO 4 firmware v2.40+ initiates HTTP request directly
    Server->>API: HTTP GET /boot?mac=aa:bb:cc:dd:ee:ff (via WireGuard VPN)
    API->>DB: Query machine mapping by MAC
    API->>API: Generate iPXE script (kernel, initrd URLs)
    API->>Monitor: Log boot script request
    API->>Server: Send iPXE script
    
    Server->>API: HTTP GET /kernels/ubuntu-22.04.img
    API->>Storage: Fetch kernel from Cloud Storage
    API->>Monitor: Log kernel download (size, duration)
    API->>Server: Stream kernel file
    
    Server->>API: HTTP GET /initrd/ubuntu-22.04.img
    API->>Storage: Fetch initrd from Cloud Storage
    API->>Monitor: Log initrd download
    API->>Server: Stream initrd file
    
    Server->>Server: Boot into OS
    
    Note over Admin,Monitor: Rollback Image Version
    Admin->>API: POST /api/v1/machines/{mac}/rollback
    API->>DB: Update machine mapping to previous image_id
    API->>Monitor: Log rollback event
    API->>Admin: 200 OK

Implementation Details

Development Stack:

  • Language: Go 1.24 (leverage existing Go expertise)
  • HTTP Framework: z5labs/humus (consistent with existing services)
  • UEFI Boot: Standard HTTP handlers (no special libraries needed)
  • Storage Client: cloud.google.com/go/storage
  • Database: Firestore for machine mappings (or simple JSON config in Cloud Storage)
  • Observability: OpenTelemetry (metrics, traces, logs to Cloud Monitoring/Trace)

Deployment:

  • Cloud Run (preferred - HTTP-only boot enables serverless deployment):
    • Min instances: 1 (ensures fast boot response, avoids cold start delays)
    • Max instances: 2 (home lab scale)
    • Memory: 512MB
    • CPU: 1 vCPU
    • Health checks: /health/startup, /health/liveness
    • Concurrency: 10 requests per instance
  • Alternative - Compute Engine VM (if Cloud Run latency unacceptable):
    • e2-micro instance ($6.50/month)
    • Container-Optimized OS with Docker
    • systemd service for boot server
    • Health checks: /health/startup, /health/liveness
  • Networking:
    • VPC firewall: Allow TCP/80, TCP/443 from WireGuard subnet (no UDP/69 needed)
    • Static internal IP for boot server (Compute Engine) or HTTPS Load Balancer (Cloud Run)
    • Cloud NAT for outbound connectivity (Cloud Storage access)

Configuration Management:

  • Machine mappings stored in Firestore or Cloud Storage JSON files
  • Boot profiles defined in YAML (similar to Matchbox groups):
    profiles:
      - name: ubuntu-22.04-server
        kernel: gs://boot-images/kernels/ubuntu-22.04.img
        initrd: gs://boot-images/initrd/ubuntu-22.04.img
        cmdline: "console=tty0 console=ttyS0"
        cloud_init: gs://boot-images/cloud-init/ubuntu-base.yaml
    
    machines:
      - mac: "aa:bb:cc:dd:ee:ff"
        profile: ubuntu-22.04-server
        hostname: node-01
    

Cost Breakdown:

Option A: Cloud Run Deployment (Preferred):

ComponentMonthly Cost
Cloud Run (1 min instance, 512MB, always-on)$3.50
Cloud Storage (50GB boot images)$1.00
Firestore (minimal reads/writes)$0.50
Egress (10 boots × 150MB)$0.18
Total~$5.18

Option B: Compute Engine Deployment (If Cloud Run latency unacceptable):

ComponentMonthly Cost
e2-micro VM (boot server)$6.50
Cloud Storage (50GB boot images)$1.00
Firestore (minimal reads/writes)$0.50
Egress (10 boots × 150MB)$0.18
Total~$8.18

Pros and Cons

  • Good, because UEFI HTTP boot eliminates TFTP complexity entirely
  • Good, because Cloud Run deployment option reduces operational overhead and infrastructure cost
  • Good, because full control over boot server implementation and features
  • Good, because leverages existing Go expertise and z5labs/humus framework patterns
  • Good, because seamless GCP integration (Cloud Storage, Firestore, Secret Manager, IAM)
  • Good, because minimal dependencies (no external projects to track)
  • Good, because customizable to specific home lab requirements
  • Good, because OpenTelemetry observability built-in from existing patterns
  • Good, because can optimize for home lab scale (< 20 machines)
  • Good, because lightweight implementation (no unnecessary features)
  • Good, because simplified testing (HTTP-only, no TFTP/PXE edge cases)
  • Good, because standard HTTP serving is well-understood (lower risk than TFTP)
  • Neutral, because development effort required (2-3 weeks for MVP, reduced from 3-4 weeks)
  • Neutral, because requires ongoing maintenance and security updates
  • Neutral, because Cloud Run cold start latency needs validation (POC required)
  • Bad, because reinvents machine matching and boot configuration patterns
  • Bad, because testing network boot scenarios still requires hardware
  • Bad, because potential for bugs in custom implementation
  • Bad, because no community support or established best practices
  • Bad, because development time still longer than Matchbox (2-3 weeks vs 1 week)

Option 2: Matchbox-Based Solution

Deploy Matchbox, an open-source network boot server developed by CoreOS (now part of Red Hat), to handle UEFI HTTP boot workflows.

Architecture Overview

architecture-beta
    group gcp(cloud)[GCP VPC]
    
    service wg_nlb(internet)[Network LB] in gcp
    service wireguard(server)[WireGuard Gateway] in gcp
    service https_lb(internet)[HTTPS LB] in gcp
    service compute(server)[Compute Engine] in gcp
    service storage(database)[Cloud Storage] in gcp
    service secrets(disk)[Secret Manager] in gcp
    service monitoring(internet)[Cloud Monitoring] in gcp
    
    group homelab(cloud)[Home Lab]
    service udm(server)[UDM Pro] in homelab
    service servers(server)[Bare Metal Servers] in homelab
    
    servers:L -- R:udm
    udm:R -- L:wg_nlb
    wg_nlb:R -- L:wireguard
    wireguard:R -- L:https_lb
    https_lb:R -- L:compute
    compute:B --> T:storage
    compute:R --> L:secrets
    compute:T --> B:monitoring

Components:

  • Matchbox Server: Container deployed to Cloud Run or Compute Engine VM
    • HTTP/gRPC APIs for boot workflows and configuration
    • UEFI HTTP boot support (TFTP disabled)
    • Machine grouping and profile templating
    • Ignition, Cloud-Init, and generic boot support
  • Cloud Storage: Backend for boot assets (mounted via gcsfuse or synced periodically)
  • Local Storage (Compute Engine only): /var/lib/matchbox for assets and configuration (synced from Cloud Storage)
  • Secret Manager: WireGuard keys, Matchbox TLS certificates
  • Cloud Monitoring: Logs from Matchbox container, custom metrics via log parsing

Boot Image Lifecycle

sequenceDiagram
    participant Admin
    participant CLI as matchbox CLI / API
    participant Matchbox as Matchbox Server
    participant Storage as Cloud Storage
    participant Monitor as Cloud Monitoring

    Note over Admin,Monitor: Upload Boot Image
    Admin->>CLI: Upload kernel/initrd via gRPC API
    CLI->>Matchbox: gRPC CreateAsset(kernel, initrd)
    Matchbox->>Matchbox: Validate asset integrity
    Matchbox->>Matchbox: Store to /var/lib/matchbox/assets/
    Matchbox->>Storage: Sync to gs://boot-assets/ (via sidecar script)
    Matchbox->>Monitor: Log asset upload event
    Matchbox->>CLI: Asset ID, checksum

    Note over Admin,Monitor: Create Boot Profile
    Admin->>CLI: Create profile YAML (kernel, initrd, cmdline)
    CLI->>Matchbox: gRPC CreateProfile(profile.yaml)
    Matchbox->>Matchbox: Store to /var/lib/matchbox/profiles/
    Matchbox->>Storage: Sync profiles to gs://boot-config/
    Matchbox->>CLI: Profile ID

    Note over Admin,Monitor: Create Machine Group
    Admin->>CLI: Create group YAML (MAC selector, profile mapping)
    CLI->>Matchbox: gRPC CreateGroup(group.yaml)
    Matchbox->>Matchbox: Store to /var/lib/matchbox/groups/
    Matchbox->>Storage: Sync groups to gs://boot-config/
    Matchbox->>CLI: Group ID

    Note over Admin,Monitor: UEFI HTTP Boot Request
    participant Server as Home Lab Server
    Note right of Server: iLO 4 firmware v2.40+ initiates HTTP request directly
    Server->>Matchbox: HTTP GET /boot.ipxe?mac=aa:bb:cc:dd:ee:ff (via WireGuard VPN)
    Matchbox->>Matchbox: Match MAC to group
    Matchbox->>Matchbox: Render iPXE template with profile
    Matchbox->>Monitor: Log boot request (MAC, group, profile)
    Matchbox->>Server: Send iPXE script
    
    Server->>Matchbox: HTTP GET /assets/ubuntu-22.04-kernel.img
    Matchbox->>Matchbox: Serve from /var/lib/matchbox/assets/
    Matchbox->>Monitor: Log asset download (size, duration)
    Matchbox->>Server: Stream kernel file
    
    Server->>Matchbox: HTTP GET /assets/ubuntu-22.04-initrd.img
    Matchbox->>Matchbox: Serve from /var/lib/matchbox/assets/
    Matchbox->>Monitor: Log asset download
    Matchbox->>Server: Stream initrd file
    
    Server->>Server: Boot into OS
    
    Note over Admin,Monitor: Rollback Machine Group
    Admin->>CLI: Update group YAML (change profile reference)
    CLI->>Matchbox: gRPC UpdateGroup(group.yaml)
    Matchbox->>Matchbox: Update /var/lib/matchbox/groups/
    Matchbox->>Storage: Sync updated group config
    Matchbox->>Monitor: Log group update
    Matchbox->>CLI: Success

Implementation Details

Matchbox Deployment:

  • Container: quay.io/poseidon/matchbox:latest (official image)
  • Deployment Options:
    • Cloud Run (preferred - HTTP-only boot enables serverless deployment):
      • Min instances: 1 (ensures fast boot response)
      • Memory: 1GB RAM (Matchbox recommendation)
      • CPU: 1 vCPU
      • Storage: Cloud Storage for assets/profiles/groups (via HTTP API)
    • Compute Engine VM (if persistent local storage preferred):
      • e2-small instance ($14/month, 2GB RAM recommended for Matchbox)
      • /var/lib/matchbox: Persistent disk (10GB SSD, $1.70/month)
      • Cloud Storage sync: Periodic backup of assets/profiles/groups to gs://matchbox-config/
      • Option: Use gcsfuse to mount Cloud Storage directly (adds latency but simplifies backups)

Configuration Structure:

/var/lib/matchbox/
├── assets/           # Boot images (kernels, initrds, ISOs)
│   ├── ubuntu-22.04-kernel.img
│   ├── ubuntu-22.04-initrd.img
│   └── flatcar-stable.img.gz
├── profiles/         # Boot profiles (YAML)
│   ├── ubuntu-server.yaml
│   └── flatcar-container.yaml
└── groups/           # Machine groups (YAML)
    ├── default.yaml
    ├── node-01.yaml
    └── storage-nodes.yaml

Example Profile (profiles/ubuntu-server.yaml):

id: ubuntu-22.04-server
name: Ubuntu 22.04 LTS Server
boot:
  kernel: /assets/ubuntu-22.04-kernel.img
  initrd:
    - /assets/ubuntu-22.04-initrd.img
  args:
    - console=tty0
    - console=ttyS0
    - ip=dhcp
ignition_id: ubuntu-base.yaml

Example Group (groups/node-01.yaml):

id: node-01
name: Node 01 - Ubuntu Server
profile: ubuntu-22.04-server
selector:
  mac: "aa:bb:cc:dd:ee:ff"
metadata:
  hostname: node-01.homelab.local
  ssh_authorized_keys:
    - "ssh-ed25519 AAAA..."

GCP Integration:

  • Cloud Storage Sync: Cron job or sidecar container to sync /var/lib/matchbox to Cloud Storage
    # Sync every 5 minutes
    */5 * * * * gsutil -m rsync -r /var/lib/matchbox gs://matchbox-config/
    
  • Secret Manager: Store Matchbox TLS certificates for gRPC API authentication
  • Cloud Monitoring: Ship Matchbox logs to Cloud Logging, parse for metrics:
    • Boot request count by MAC/group
    • Asset download success/failure rates
    • TFTP vs HTTP request distribution

Networking:

  • VPC firewall: Allow TCP/8080 (HTTP), TCP/8081 (gRPC) from WireGuard subnet (no UDP/69 needed)
  • Optional: Internal load balancer if high availability required (adds ~$18/month)
  • Note: Cloud Run deployment includes integrated HTTPS load balancing

Cost Breakdown:

Option A: Cloud Run Deployment (Preferred):

ComponentMonthly Cost
Cloud Run (1 min instance, 1GB RAM, always-on)$7.00
Cloud Storage (50GB boot images)$1.00
Egress (10 boots × 150MB)$0.18
Total~$8.18

Option B: Compute Engine Deployment (If persistent local storage preferred):

ComponentMonthly Cost
e2-small VM (Matchbox server)$14.00
Persistent SSD (10GB)$1.70
Cloud Storage (50GB backups)$1.00
Egress (10 boots × 150MB)$0.18
Total~$16.88

Pros and Cons

  • Good, because HTTP-only boot enables Cloud Run deployment (reduces cost significantly)
  • Good, because UEFI HTTP boot eliminates TFTP complexity and potential failure points
  • Good, because production-ready boot server with extensive real-world usage
  • Good, because feature-complete with machine grouping, templating, and multi-OS support
  • Good, because gRPC API for programmatic boot configuration management
  • Good, because supports Ignition (Flatcar, CoreOS), Cloud-Init, and generic boot workflows
  • Good, because well-documented with established best practices
  • Good, because active community and upstream maintenance (Red Hat/CoreOS)
  • Good, because reduces development time to days (deploy + configure vs weeks of coding)
  • Good, because avoids reinventing network boot patterns (machine matching, boot configuration)
  • Good, because proven security model (TLS for gRPC, asset integrity checks)
  • Neutral, because requires learning Matchbox configuration patterns (YAML profiles/groups)
  • Neutral, because containerized deployment (Docker on Compute Engine or Cloud Run)
  • Neutral, because Cloud Run deployment option competitive with custom implementation cost
  • Bad, because introduces external dependency (Matchbox project maintenance)
  • Bad, because some features unnecessary for home lab scale (large-scale provisioning, etcd backend)
  • Bad, because less control over implementation details (limited customization)
  • Bad, because Cloud Storage integration requires custom sync scripts (Matchbox doesn’t natively support GCS backend)
  • Bad, because dependency on upstream for security patches and bug fixes

UEFI HTTP Boot Architecture

This section documents the UEFI HTTP boot capability that fundamentally changes the network boot infrastructure design.

Boot Process Overview

Traditional PXE Boot (NOT USED - shown for comparison):

sequenceDiagram
    participant Server as Bare Metal Server
    participant DHCP as DHCP Server
    participant TFTP as TFTP Server
    participant HTTP as HTTP Server

    Note over Server,HTTP: Traditional PXE Boot Chain (NOT USED)
    Server->>DHCP: DHCP Discover
    DHCP->>Server: DHCP Offer (TFTP server, boot filename)
    Server->>TFTP: TFTP GET /pxelinux.0
    TFTP->>Server: Send PXE bootloader
    Server->>TFTP: TFTP GET /ipxe.efi
    TFTP->>Server: Send iPXE binary
    Server->>HTTP: HTTP GET /boot.ipxe
    HTTP->>Server: Send boot script
    Server->>HTTP: HTTP GET /kernel, /initrd
    HTTP->>Server: Stream boot files

UEFI HTTP Boot (ACTUAL IMPLEMENTATION):

sequenceDiagram
    participant Server as HP DL360 Gen 9<br/>(iLO 4 v2.40+)
    participant DHCP as DHCP Server<br/>(UDM Pro)
    participant VPN as WireGuard VPN
    participant HTTP as HTTP Boot Server<br/>(GCP Cloud Run)

    Note over Server,HTTP: UEFI HTTP Boot (ACTUAL IMPLEMENTATION)
    Server->>DHCP: DHCP Discover
    DHCP->>Server: DHCP Offer (boot URL: http://boot.internal/boot.ipxe?mac=...)
    Note right of Server: Firmware initiates HTTP request directly<br/>(no TFTP/PXE chain loading)
    Server->>VPN: WireGuard tunnel established
    Server->>HTTP: HTTP GET /boot.ipxe?mac=aa:bb:cc:dd:ee:ff
    HTTP->>Server: Send boot script with kernel/initrd URLs
    Server->>HTTP: HTTP GET /assets/talos-kernel.img
    HTTP->>Server: Stream kernel (via WireGuard)
    Server->>HTTP: HTTP GET /assets/talos-initrd.img
    HTTP->>Server: Stream initrd (via WireGuard)
    Server->>Server: Boot into OS

Key Differences

AspectTraditional PXEUEFI HTTP Boot
Initial ProtocolTFTP (UDP/69)HTTP (TCP/80) or HTTPS (TCP/443)
Boot LoaderRequires TFTP transfer of iPXE binaryFirmware has HTTP client built-in
Chain LoadingPXE → TFTP → iPXE → HTTPDirect HTTP boot (no chain)
Firewall RulesUDP/69, TCP/80, TCP/443TCP/80, TCP/443 only
Cloud Run Support❌ (UDP not supported)✅ (HTTP-only)
Transfer Speed~1-5 Mbps (TFTP)10-100 Mbps (HTTP)
ComplexityHigh (multiple protocols)Low (HTTP-only)

Security Architecture

Challenge: HP DL360 Gen 9 UEFI HTTP boot does not support client-side TLS certificates (mTLS).

Solution: WireGuard VPN provides transport-layer security:

flowchart LR
    subgraph homelab[Home Lab]
        server[HP DL360 Gen 9<br/>UEFI HTTP Boot<br/>iLO 4 v2.40+]
        udm[UDM Pro<br/>WireGuard Client]
    end

    subgraph gcp[Google Cloud Platform]
        wg_gw[WireGuard Gateway<br/>Compute Engine]
        cr[Boot Server<br/>Cloud Run]
    end

    server -->|HTTP| udm
    udm -->|Encrypted WireGuard Tunnel| wg_gw
    wg_gw -->|HTTP| cr

    style server fill:#f9f,stroke:#333
    style udm fill:#bbf,stroke:#333
    style wg_gw fill:#bfb,stroke:#333
    style cr fill:#fbb,stroke:#333

Why WireGuard instead of Cloudflare mTLS?

  • Cloudflare mTLS Limitation: Requires client certificates at TLS layer
  • UEFI Firmware Limitation: Cannot present client certificates during TLS handshake
  • WireGuard Solution: Provides mutual authentication at network layer (pre-shared keys)
  • Security Equivalent: WireGuard offers same security properties as mTLS:
    • Mutual authentication (both endpoints authenticated)
    • Confidentiality (all traffic encrypted)
    • Integrity (authenticated encryption via ChaCha20-Poly1305)
    • No Internet exposure (boot server only accessible via VPN)

Firmware Configuration

HP iLO 4 UEFI HTTP Boot Setup:

  1. Access Configuration:

    • iLO web interface → Remote Console → Power On → Press F9 (RBSU)
    • Or: Direct RBSU access during POST (Press F9)
  2. Enable UEFI HTTP Boot:

    • Navigate: System Configuration → BIOS/Platform Configuration (RBSU) → Network Options
    • Set Network Boot to Enabled
    • Set Boot Mode to UEFI (not Legacy BIOS)
    • Enable UEFI HTTP Boot Support
  3. Configure NIC:

    • Navigate: RBSU → Network Options → [FlexibleLOM/PCIe NIC]
    • Set Option ROM to Enabled (required for UEFI boot option to appear)
    • Set Network Boot to Enabled
    • Configure IPv4/IPv6 settings (DHCP or static)
  4. Set Boot Order:

    • Navigate: RBSU → Boot Options → UEFI Boot Order
    • Move network device to top priority
  5. Configure Boot URL (via DHCP or static):

    • DHCP option 67: http://10.x.x.x/boot.ipxe?mac=${net0/mac}
    • Or: Static configuration in UEFI System Utilities

Required Firmware Versions:

  • iLO 4: v2.40 or later (for UEFI HTTP boot support)
  • System ROM: P89 v2.60 or later (recommended)

Verification:

# Check iLO firmware version via REST API
curl -k -u admin:password https://ilo-address/redfish/v1/Managers/1/ | jq '.FirmwareVersion'

# Expected output: "2.40" or higher

Architectural Implications

TFTP Elimination Impact:

  1. Deployment: Cloud Run becomes viable (no UDP/TFTP requirement)
  2. Cost: Reduced infrastructure costs (~$5-8/month vs $8-17/month)
  3. Complexity: Simplified networking (TCP-only firewall rules)
  4. Development: Reduced effort (no TFTP library, testing, edge cases)
  5. Scalability: Cloud Run autoscaling vs fixed VM capacity
  6. Maintenance: Serverless reduces operational overhead

Decision Impact:

The removal of TFTP complexity fundamentally shifts the cost/benefit analysis:

  • Custom Implementation: More attractive (Cloud Run, reduced development time)
  • Matchbox: Still valid but cost/complexity advantage reduced
  • TCO Gap: Narrowed from ~$8,000-12,000 to ~$4,000-8,000 (Year 1)
  • Development Gap: Reduced from 2-3 weeks to 1-2 weeks

Detailed Comparison

Feature Comparison

FeatureCustom ImplementationMatchbox
UEFI HTTP Boot✅ Native (standard HTTP)✅ Built-in
HTTP/HTTPS Boot✅ Via z5labs/humus✅ Built-in
Cloud Run Deployment✅ Preferred option✅ Enabled by HTTP-only
Boot Scripting✅ Custom templates✅ Go templates
Machine-to-Image Mapping✅ Firestore/JSON✅ YAML groups with selectors
Boot Profile Management✅ Custom API✅ gRPC API + YAML
Cloud-Init Support⚠️ Requires implementation✅ Native support
Ignition Support❌ Not planned✅ Native support (Flatcar, CoreOS)
Asset Versioning⚠️ Requires implementation⚠️ Manual (via Cloud Storage versioning)
Rollback Capability⚠️ Requires implementation✅ Update group to previous profile
OpenTelemetry Observability✅ Built-in⚠️ Logs only (requires parsing)
GCP Cloud Storage Integration✅ Native SDK⚠️ Requires sync scripts
HTTP REST Admin API✅ Native (z5labs/humus)⚠️ gRPC only
Multi-Environment Support⚠️ Requires implementation✅ Groups + metadata

Development Effort Comparison

TaskCustom ImplementationMatchbox
Initial Setup1-2 days (project scaffolding)4-8 hours (deployment + config)
UEFI HTTP Boot1-2 days (standard HTTP endpoints)✅ Included
HTTP Boot API2-3 days (z5labs/humus endpoints)✅ Included
Machine Matching Logic2-3 days (database queries, selectors)✅ Included
Boot Script Templates2-3 days (boot script templating)✅ Included
Cloud-Init Support3-5 days (parsing, injection)✅ Included
Asset Management2-3 days (upload, storage)✅ Included
HTTP REST Admin API2-3 days (OpenAPI endpoints)✅ Included (gRPC)
Cloud Run Deployment1 day (Cloud Run config)1 day (Cloud Run config)
Testing3-5 days (unit, integration, E2E - simplified)2-3 days (integration only)
Documentation2-3 days1 day (reference existing docs)
Total Effort2-3 weeks1 week

Operational Complexity

AspectCustom ImplementationMatchbox
DeploymentDocker container on Compute EngineDocker container on Compute Engine
Configuration UpdatesAPI calls or Terraform updatesYAML file updates + API/filesystem sync
MonitoringOpenTelemetry metrics to Cloud MonitoringLog parsing + custom metrics
TroubleshootingFull access to code, custom loggingMatchbox logs + gRPC API inspection
Security PatchesManual code updatesUpstream container image updates
Dependency UpdatesManual Go module updatesUpstream Matchbox updates
Backup/RestoreCloud Storage + Firestore backupsSync /var/lib/matchbox to Cloud Storage

Cost Comparison Summary

Comparing Cloud Run Deployments (Preferred for both options):

ItemCustom (Cloud Run)Matchbox (Cloud Run)Difference
ComputeCloud Run ($3.50/month)Cloud Run ($7/month)+$3.50/month
StorageCloud Storage ($1/month)Cloud Storage ($1/month)$0
Development2-3 weeks @ $100/hour = $8,000-12,0001 week @ $100/hour = $4,000-$4,000-8,000
Annual Infrastructure~$54~$96+$42/year
TCO (Year 1)~$8,054-12,054~$4,096-$3,958-7,958
TCO (Year 3)~$8,162-12,162~$4,288-$3,874-7,874

Key Insights:

  • UEFI HTTP boot enables Cloud Run deployment for both options, dramatically reducing infrastructure costs
  • Custom implementation TCO gap narrowed from $7,895-11,895 to $3,958-7,958 (Year 1)
  • Both options now cost ~$5-8/month for infrastructure (vs $8-17/month with TFTP)
  • Development time difference reduced from 2-3 weeks to 1-2 weeks
  • Decision is much closer than originally assessed

Risk Analysis

RiskCustom ImplementationMatchboxMitigation
Security VulnerabilitiesMedium (standard HTTP code, well-understood)Medium (upstream dependency)Both: Monitor for security updates, automated deployments
Boot FailuresMedium (HTTP-only reduces complexity)Low (battle-tested)Custom: Comprehensive E2E testing with real hardware
Cloud Run Cold StartsMedium (needs validation)Medium (needs validation)Both: Min instances = 1 (always-on)
Maintenance BurdenMedium (ongoing code maintenance)Low (upstream handles updates)Both: Automated deployment pipelines
GCP Integration IssuesLow (native SDK)Medium (sync scripts)Matchbox: Robust sync with error handling
Scalability LimitsLow (Cloud Run autoscaling)Low (handles thousands of nodes)Both: Monitor boot request latency
Dependency AbandonmentN/A (no external deps)Low (Red Hat backing)Matchbox: Can fork if necessary

Implementation Plan

Phase 1: Core Boot Server (Week 1)

  1. Project Setup (1-2 days)

    • Create Go project with z5labs/humus framework
    • Set up OpenAPI specification for HTTP REST admin API
    • Configure Cloud Storage and Firestore clients
    • Implement basic health check endpoints
  2. UEFI HTTP Boot Endpoints (2-3 days)

    • HTTP endpoint serving boot scripts (iPXE format)
    • Kernel and initrd streaming from Cloud Storage
    • MAC-based machine matching using Firestore
    • Boot script templating with machine-specific parameters
  3. Testing & Deployment (2-3 days)

    • Deploy to Cloud Run with min instances = 1
    • Configure WireGuard VPN connectivity
    • Test UEFI HTTP boot from HP DL360 Gen 9 (iLO 4 v2.40+)
    • Validate boot latency and Cloud Run cold start metrics

Phase 2: Admin API & Management (Week 2)

  1. HTTP REST Admin API (2-3 days)

    • Boot image upload endpoints (kernel, initrd, metadata)
    • Machine-to-image mapping management
    • Boot profile CRUD operations
    • Asset versioning and integrity validation
  2. Cloud-Init Integration (2-3 days)

    • Cloud-init configuration templating
    • Metadata injection for machine-specific settings
    • Integration with boot workflow
  3. Observability & Documentation (2-3 days)

    • OpenTelemetry metrics integration
    • Cloud Monitoring dashboards
    • API documentation
    • Operational runbooks

Success Criteria

  • ✅ Successfully boot HP DL360 Gen 9 via UEFI HTTP boot through WireGuard VPN
  • ✅ Boot latency < 100ms for HTTP requests (kernel/initrd downloads)
  • ✅ Cloud Run cold start latency < 100ms (with min instances = 1)
  • ✅ Machine-to-image mapping works correctly based on MAC address
  • ✅ Cloud Storage integration functional (upload, retrieve boot assets)
  • ✅ HTTP REST API fully functional for boot configuration management
  • ✅ Firestore stores machine mappings and boot profiles correctly
  • ✅ OpenTelemetry metrics available in Cloud Monitoring
  • ✅ Configuration update workflow clear and documented
  • ✅ Firmware compatibility confirmed (no TFTP fallback needed)

More Information

Future Considerations

  1. High Availability: If boot server uptime becomes critical, evaluate multi-region deployment or failover strategies
  2. Multi-Cloud: If multi-cloud strategy emerges, custom implementation provides better portability
  3. Enterprise Features: If advanced provisioning workflows required (bare metal Kubernetes, Ignition support, etc.), evaluate adding features to custom implementation
  4. Asset Versioning: Implement comprehensive boot image versioning and rollback capabilities beyond basic Cloud Storage versioning
  5. Multi-Environment Support: Add support for multiple environments (dev, staging, prod) with environment-specific boot profiles
  • Issue #601 - story(docs): create adr for network boot infrastructure on google cloud
  • Issue #595 - story(docs): create adr for network boot architecture
  • Issue #597 - story(docs): create adr for cloud provider selection