This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Technology Analysis

In-depth analysis of technologies and tools evaluated for home lab infrastructure

Technology Analysis

This section contains detailed research and analysis of various technologies evaluated for potential use in the home lab infrastructure.

Network Boot & Provisioning

  • Matchbox - Network boot service for bare-metal provisioning
    • Comprehensive analysis of PXE/iPXE/GRUB support
    • Configuration model (profiles, groups, templating)
    • Deployment patterns and operational considerations
    • Use case evaluation and comparison with alternatives

Cloud Providers

  • Google Cloud Platform - GCP capabilities for network boot infrastructure
    • Network boot protocol support (TFTP, HTTP, HTTPS)
    • WireGuard VPN deployment and integration
    • Cost analysis and performance considerations
  • Amazon Web Services - AWS capabilities for network boot infrastructure
    • Network boot protocol support (TFTP, HTTP, HTTPS)
    • WireGuard VPN deployment and integration
    • Cost analysis and performance considerations

Operating Systems

  • Server Operating Systems - OS evaluation for Kubernetes homelab infrastructure
    • Ubuntu Server analysis (kubeadm, k3s, MicroK8s)
    • Fedora Server analysis (kubeadm with CRI-O)
    • Talos Linux analysis (purpose-built Kubernetes OS)
    • Harvester HCI analysis (hyperconverged platform)
    • Comparison of setup complexity, maintenance, security, and resource overhead

Hardware

Future Analysis Topics

Planned technology evaluations:

  • Storage Solutions: Ceph, GlusterFS, ZFS over iSCSI
  • Container Orchestration: Kubernetes distributions (k3s, Talos, etc.)
  • Observability: Prometheus, Grafana, Loki, Tempo stack
  • Service Mesh: Istio, Linkerd, Cilium comparison
  • CI/CD: GitLab Runner, Tekton, Argo Workflows
  • Secret Management: Vault, External Secrets Operator
  • Load Balancing: MetalLB, kube-vip, Cilium LB-IPAM

1 - Server Operating System Analysis

Evaluation of operating systems for homelab Kubernetes infrastructure

This section provides detailed analysis of operating systems evaluated for the homelab server infrastructure, with a focus on Kubernetes cluster setup and maintenance.

Overview

The selection of a server operating system is critical for homelab infrastructure. The primary evaluation criterion is ease of Kubernetes cluster initialization and ongoing maintenance burden.

Evaluated Options

  • Ubuntu - Traditional general-purpose Linux distribution

    • Kubernetes via kubeadm, k3s, or MicroK8s
    • Strong community support and extensive documentation
    • Familiar package management and system administration
  • Fedora - Cutting-edge Linux distribution

    • Latest kernel and system components
    • Kubernetes via kubeadm or k3s
    • Shorter support lifecycle with more frequent upgrades
  • Talos Linux - Purpose-built Kubernetes OS

    • API-driven, immutable infrastructure
    • Built-in Kubernetes with minimal attack surface
    • Designed specifically for container workloads
  • Harvester - Hyperconverged infrastructure platform

    • Built on Rancher and K3s
    • Combines compute, storage, and networking
    • VM and container workloads on unified platform

Evaluation Criteria

Each option is evaluated based on:

  1. Kubernetes Installation Methods - Available tooling and installation approaches
  2. Cluster Initialization Process - Steps required to bootstrap a cluster
  3. Maintenance Requirements - OS updates, Kubernetes upgrades, security patches
  4. Resource Overhead - Memory, CPU, and storage footprint
  5. Learning Curve - Ease of adoption and operational complexity
  6. Community Support - Documentation quality and ecosystem maturity
  7. Security Posture - Attack surface and security-first design

1.1 - Ubuntu Analysis

Analysis of Ubuntu for Kubernetes homelab infrastructure

Overview

Ubuntu Server is a popular general-purpose Linux distribution developed by Canonical. It provides Long Term Support (LTS) releases with 5 years of standard support and optional Extended Security Maintenance (ESM).

Key Facts:

  • Latest LTS: Ubuntu 24.04 LTS (Noble Numbat)
  • Support Period: 5 years standard, 10 years with Ubuntu Pro (free for personal use)
  • Kernel: Linux 6.8+ (LTS), regular HWE updates
  • Package Manager: APT/DPKG, Snap
  • Init System: systemd

Kubernetes Installation Methods

Ubuntu supports multiple Kubernetes installation approaches:

1. kubeadm (Official Kubernetes Tool)

Installation:

# Install container runtime (containerd)
sudo apt-get update
sudo apt-get install -y containerd

# Configure containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd

# Install kubeadm, kubelet, kubectl
sudo apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.31/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.31/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

Cluster Initialization:

# Initialize control plane
sudo kubeadm init --pod-network-cidr=10.244.0.0/16

# Configure kubectl for admin
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install CNI (e.g., Calico, Flannel)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

# Join worker nodes
kubeadm token create --print-join-command

Pros:

  • Official Kubernetes tooling, well-documented
  • Full control over cluster configuration
  • Supports latest Kubernetes versions
  • Large community and extensive resources

Cons:

  • More manual steps than turnkey solutions
  • Requires understanding of Kubernetes architecture
  • Manual upgrade process for each component
  • More complex troubleshooting

2. k3s (Lightweight Kubernetes)

Installation:

# Single-command install on control plane
curl -sfL https://get.k3s.io | sh -

# Get node token for workers
sudo cat /var/lib/rancher/k3s/server/node-token

# Install on worker nodes
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -

Pros:

  • Extremely simple installation (single command)
  • Lightweight (< 512MB RAM)
  • Built-in container runtime (containerd)
  • Automatic updates via Rancher System Upgrade Controller
  • Great for edge and homelab use cases

Cons:

  • Less customizable than kubeadm
  • Some features removed (e.g., in-tree storage, cloud providers)
  • Slightly different from upstream Kubernetes

3. MicroK8s (Canonical’s Distribution)

Installation:

# Install via snap
sudo snap install microk8s --classic

# Join cluster
sudo microk8s add-node
# Run output command on worker nodes

# Enable addons
microk8s enable dns storage ingress

Pros:

  • Zero-ops, single package install
  • Snap-based automatic updates
  • Addons for common services (DNS, storage, ingress)
  • Canonical support available

Cons:

  • Requires snap (not universally liked)
  • Less ecosystem compatibility than vanilla Kubernetes
  • Ubuntu-specific (less portable)

Cluster Initialization Sequence

kubeadm Approach

sequenceDiagram
    participant Admin
    participant Server as Ubuntu Server
    participant K8s as Kubernetes Components
    
    Admin->>Server: Install Ubuntu 24.04 LTS
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system (apt update && upgrade)
    Admin->>Server: Install containerd
    Server->>Server: Configure containerd (CRI)
    Admin->>Server: Install kubeadm/kubelet/kubectl
    Server->>Server: Disable swap, configure kernel modules
    Admin->>K8s: kubeadm init --pod-network-cidr=10.244.0.0/16
    K8s->>Server: Generate certificates
    K8s->>Server: Start etcd
    K8s->>Server: Start API server
    K8s->>Server: Start controller-manager
    K8s->>Server: Start scheduler
    K8s-->>Admin: Control plane ready
    Admin->>K8s: kubectl apply -f calico.yaml
    K8s->>Server: Deploy CNI pods
    Admin->>K8s: kubeadm join (on workers)
    K8s->>Server: Add worker nodes
    K8s-->>Admin: Cluster ready

k3s Approach

sequenceDiagram
    participant Admin
    participant Server as Ubuntu Server
    participant K3s as k3s Components
    
    Admin->>Server: Install Ubuntu 24.04 LTS
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system
    Admin->>Server: curl -sfL https://get.k3s.io | sh -
    Server->>K3s: Download k3s binary
    K3s->>Server: Configure containerd
    K3s->>Server: Start k3s service
    K3s->>Server: Initialize etcd (embedded)
    K3s->>Server: Start API server
    K3s->>Server: Start controller-manager
    K3s->>Server: Start scheduler
    K3s->>Server: Deploy built-in CNI (Flannel)
    K3s-->>Admin: Control plane ready
    Admin->>Server: Retrieve node token
    Admin->>Server: Install k3s agent on workers
    K3s->>Server: Join workers to cluster
    K3s-->>Admin: Cluster ready (5-10 minutes total)

Maintenance Requirements

OS Updates

Security Patches:

# Automatic security updates (recommended)
sudo apt-get install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades

# Manual updates
sudo apt-get update
sudo apt-get upgrade

Frequency:

  • Security patches: Weekly to monthly
  • Kernel updates: Monthly (may require reboot)
  • Major version upgrades: Every 2 years (LTS to LTS)

Kubernetes Upgrades

kubeadm Upgrade:

# Upgrade control plane
sudo apt-get update
sudo apt-get install -y kubeadm=1.32.0-*
sudo kubeadm upgrade apply v1.32.0
sudo apt-get install -y kubelet=1.32.0-* kubectl=1.32.0-*
sudo systemctl restart kubelet

# Upgrade workers
kubectl drain <node> --ignore-daemonsets
sudo apt-get install -y kubeadm=1.32.0-* kubelet=1.32.0-* kubectl=1.32.0-*
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon <node>

k3s Upgrade:

# Manual upgrade
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.32.0+k3s1 sh -

# Automatic upgrade via system-upgrade-controller
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml

Upgrade Frequency: Every 3-6 months (Kubernetes minor versions)

Resource Overhead

Minimal Installation (Ubuntu Server + k3s):

  • RAM: ~512MB (OS) + 512MB (k3s) = 1GB total
  • CPU: 1 core minimum, 2 cores recommended
  • Disk: 10GB (OS) + 10GB (container images) = 20GB
  • Network: 1 Gbps recommended

Full Installation (Ubuntu Server + kubeadm):

  • RAM: ~512MB (OS) + 1-2GB (Kubernetes components) = 2GB+ total
  • CPU: 2 cores minimum
  • Disk: 15GB (OS) + 20GB (container images/etcd) = 35GB
  • Network: 1 Gbps recommended

Security Posture

Strengths:

  • Regular security updates via Ubuntu Security Team
  • AppArmor enabled by default
  • SELinux support available
  • Kernel hardening features (ASLR, stack protection)
  • Ubuntu Pro ESM for extended CVE coverage (free for personal use)

Attack Surface:

  • Full general-purpose OS (larger attack surface than minimal OS)
  • Many installed packages by default (can be minimized)
  • Requires manual hardening for production use

Hardening Steps:

# Disable unnecessary services
sudo systemctl disable snapd.service
sudo systemctl disable bluetooth.service

# Configure firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 6443/tcp  # Kubernetes API
sudo ufw allow 10250/tcp # Kubelet
sudo ufw enable

# CIS Kubernetes Benchmark compliance
# Use tools like kube-bench for validation

Learning Curve

Ease of Adoption: ⭐⭐⭐⭐⭐ (Excellent)

  • Most familiar Linux distribution for many users
  • Extensive documentation and tutorials
  • Large community support (forums, Stack Overflow)
  • Straightforward package management
  • Similar to Debian-based systems

Required Knowledge:

  • Basic Linux system administration (apt, systemd, networking)
  • Kubernetes concepts (pods, services, deployments)
  • Container runtime basics (containerd, Docker)
  • Text editor (vim, nano) for configuration

Community Support

Ecosystem Maturity: ⭐⭐⭐⭐⭐ (Excellent)

  • Documentation: Comprehensive official docs, community guides
  • Community: Massive user base, active forums
  • Commercial Support: Available from Canonical (Ubuntu Pro)
  • Third-Party Tools: Excellent compatibility with all Kubernetes tools
  • Tutorials: Abundant resources for Kubernetes on Ubuntu

Resources:

Pros and Cons Summary

Pros

  • Good, because most familiar and well-documented Linux distribution
  • Good, because 5-year LTS support (10 years with Ubuntu Pro)
  • Good, because multiple Kubernetes installation options (kubeadm, k3s, MicroK8s)
  • Good, because k3s provides extremely simple setup (single command)
  • Good, because extensive package ecosystem (60,000+ packages)
  • Good, because strong community support and resources
  • Good, because automatic security updates available
  • Good, because low learning curve for most administrators
  • Good, because compatible with all Kubernetes tooling and addons
  • Good, because Ubuntu Pro free for personal use (extended security)

Cons

  • Bad, because general-purpose OS has larger attack surface than minimal OS
  • Bad, because more resource overhead than purpose-built Kubernetes OS (1-2GB RAM)
  • Bad, because requires manual OS updates and reboots
  • Bad, because kubeadm setup is complex with many manual steps
  • Bad, because snap packages controversial (for MicroK8s)
  • Bad, because Kubernetes upgrades require manual intervention (unless using k3s auto-upgrade)
  • Bad, because managing OS + Kubernetes lifecycle separately increases complexity
  • Neutral, because many preinstalled packages (can be removed, but require effort)

Recommendations

Best for:

  • Users familiar with Ubuntu/Debian ecosystem
  • Homelabs requiring general-purpose server functionality (not just Kubernetes)
  • Teams wanting multiple Kubernetes installation options
  • Users prioritizing community support and documentation

Best Installation Method:

  • Homelab/Learning: k3s (simplest, auto-updates, lightweight)
  • Production-like: kubeadm (full control, upstream Kubernetes)
  • Ubuntu-specific: MicroK8s (Canonical support, snap-based)

Avoid if:

  • Seeking minimal attack surface (consider Talos Linux)
  • Want infrastructure-as-code for OS layer (consider Talos Linux)
  • Prefer hyperconverged platform (consider Harvester)

1.2 - Fedora Analysis

Analysis of Fedora Server for Kubernetes homelab infrastructure

Overview

Fedora Server is a cutting-edge Linux distribution sponsored by Red Hat, serving as the upstream for Red Hat Enterprise Linux (RHEL). It emphasizes innovation with the latest software packages and kernel versions.

Key Facts:

  • Latest Version: Fedora 41 (October 2024)
  • Support Period: ~13 months per release (shorter than Ubuntu LTS)
  • Kernel: Linux 6.11+ (latest stable)
  • Package Manager: DNF/RPM, Flatpak
  • Init System: systemd

Kubernetes Installation Methods

Fedora supports standard Kubernetes installation approaches:

1. kubeadm (Official Kubernetes Tool)

Installation:

# Install container runtime (CRI-O preferred on Fedora)
sudo dnf install -y cri-o
sudo systemctl enable --now crio

# Add Kubernetes repository
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.key
EOF

# Install kubeadm, kubelet, kubectl
sudo dnf install -y kubelet kubeadm kubectl
sudo systemctl enable --now kubelet

Cluster Initialization:

# Initialize control plane
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/crio/crio.sock

# Configure kubectl
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

# Join workers
kubeadm token create --print-join-command

Pros:

  • CRI-O is native to Fedora ecosystem (same as RHEL/OpenShift)
  • Latest Kubernetes versions available quickly
  • Familiar to RHEL/CentOS users
  • Fully upstream Kubernetes

Cons:

  • Manual setup process (same as Ubuntu/kubeadm)
  • Requires Kubernetes knowledge
  • More complex than turnkey solutions

2. k3s (Lightweight Kubernetes)

Installation:

# Same single-command install
curl -sfL https://get.k3s.io | sh -

# Retrieve token
sudo cat /var/lib/rancher/k3s/server/node-token

# Install on workers
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -

Pros:

  • Simple installation (identical to Ubuntu)
  • Lightweight and fast
  • Well-tested on Fedora/RHEL family

Cons:

  • Less customizable
  • Not using native CRI-O by default (uses embedded containerd)

3. OKD (OpenShift Kubernetes Distribution)

Installation (Single-Node):

# Download and install OKD
wget https://github.com/okd-project/okd/releases/download/4.15.0-0.okd-2024-01-27-070424/openshift-install-linux-4.15.0-0.okd-2024-01-27-070424.tar.gz
tar -xvf openshift-install-linux-*.tar.gz
sudo mv openshift-install /usr/local/bin/

# Create install config
./openshift-install create install-config --dir=cluster

# Install cluster
./openshift-install create cluster --dir=cluster

Pros:

  • Enterprise features (operators, web console, image registry)
  • Built-in CI/CD and developer tools
  • Based on Fedora CoreOS (immutable, auto-updating)

Cons:

  • Very heavy resource requirements (16GB+ RAM)
  • Complex installation and management
  • Overkill for simple homelab use

Cluster Initialization Sequence

kubeadm with CRI-O

sequenceDiagram
    participant Admin
    participant Server as Fedora Server
    participant K8s as Kubernetes Components
    
    Admin->>Server: Install Fedora 41
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system (dnf update)
    Admin->>Server: Install CRI-O
    Server->>Server: Configure CRI-O runtime
    Server->>Server: Enable crio.service
    Admin->>Server: Install kubeadm/kubelet/kubectl
    Server->>Server: Disable swap, load kernel modules
    Server->>Server: Configure SELinux (permissive for Kubernetes)
    Admin->>K8s: kubeadm init --cri-socket=unix:///var/run/crio/crio.sock
    K8s->>Server: Generate certificates
    K8s->>Server: Start etcd
    K8s->>Server: Start API server
    K8s->>Server: Start controller-manager
    K8s->>Server: Start scheduler
    K8s-->>Admin: Control plane ready
    Admin->>K8s: kubectl apply CNI
    K8s->>Server: Deploy CNI pods
    Admin->>K8s: kubeadm join (workers)
    K8s->>Server: Add worker nodes
    K8s-->>Admin: Cluster ready

k3s Approach

sequenceDiagram
    participant Admin
    participant Server as Fedora Server
    participant K3s as k3s Components
    
    Admin->>Server: Install Fedora 41
    Server->>Server: Configure network
    Admin->>Server: Update system (dnf update)
    Admin->>Server: Disable firewalld (or configure)
    Admin->>Server: curl -sfL https://get.k3s.io | sh -
    Server->>K3s: Download k3s binary
    K3s->>Server: Configure containerd
    K3s->>Server: Start k3s service
    K3s->>Server: Initialize embedded etcd
    K3s->>Server: Start API server
    K3s->>Server: Deploy built-in CNI
    K3s-->>Admin: Control plane ready
    Admin->>Server: Retrieve node token
    Admin->>Server: Install k3s agent on workers
    K3s->>Server: Join workers
    K3s-->>Admin: Cluster ready (5-10 minutes)

Maintenance Requirements

OS Updates

Security and System Updates:

# Automatic updates (dnf-automatic)
sudo dnf install -y dnf-automatic
sudo systemctl enable --now dnf-automatic.timer

# Manual updates
sudo dnf update -y
sudo reboot  # if kernel updated

Frequency:

  • Security patches: Weekly to monthly
  • Kernel updates: Monthly (frequent updates)
  • Major version upgrades: Every ~13 months (Fedora releases)

Version Upgrade:

# Upgrade to next Fedora release
sudo dnf upgrade --refresh
sudo dnf install dnf-plugin-system-upgrade
sudo dnf system-upgrade download --releasever=42
sudo dnf system-upgrade reboot

Kubernetes Upgrades

kubeadm Upgrade:

# Upgrade control plane
sudo dnf update -y kubeadm
sudo kubeadm upgrade apply v1.32.0
sudo dnf update -y kubelet kubectl
sudo systemctl restart kubelet

# Upgrade workers
kubectl drain <node> --ignore-daemonsets
sudo dnf update -y kubeadm kubelet kubectl
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon <node>

k3s Upgrade: Same as Ubuntu (curl script or system-upgrade-controller)

Upgrade Frequency: Kubernetes every 3-6 months, Fedora OS every ~13 months

Resource Overhead

Minimal Installation (Fedora Server + k3s):

  • RAM: ~600MB (OS) + 512MB (k3s) = 1.2GB total
  • CPU: 1 core minimum, 2 cores recommended
  • Disk: 12GB (OS) + 10GB (containers) = 22GB
  • Network: 1 Gbps recommended

Full Installation (Fedora Server + kubeadm + CRI-O):

  • RAM: ~700MB (OS) + 1.5GB (Kubernetes) = 2.2GB total
  • CPU: 2 cores minimum
  • Disk: 15GB (OS) + 20GB (containers) = 35GB
  • Network: 1 Gbps recommended

Note: Slightly higher overhead than Ubuntu due to SELinux and newer components.

Security Posture

Strengths:

  • SELinux enabled by default (stronger than AppArmor)
  • Latest security patches and kernel (bleeding edge)
  • CRI-O container runtime (security-focused, used by OpenShift)
  • Shorter support window = less legacy CVEs
  • Active security team and rapid response

Attack Surface:

  • General-purpose OS (larger surface than minimal OS)
  • More installed packages than minimal server
  • SELinux can be complex to configure for Kubernetes

Hardening Steps:

# Configure firewall (firewalld default on Fedora)
sudo firewall-cmd --permanent --add-port=6443/tcp  # API server
sudo firewall-cmd --permanent --add-port=10250/tcp # Kubelet
sudo firewall-cmd --reload

# SELinux configuration for Kubernetes
sudo setenforce 0  # Permissive (Kubernetes not fully SELinux-ready)
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

# Disable unnecessary services
sudo systemctl disable bluetooth.service

Learning Curve

Ease of Adoption: ⭐⭐⭐⭐ (Good)

  • Familiar for RHEL/CentOS/Alma/Rocky users
  • DNF package manager (similar to APT)
  • Excellent documentation
  • SELinux learning curve can be steep

Required Knowledge:

  • RPM-based system administration (dnf, systemd)
  • SELinux basics (or willingness to use permissive mode)
  • Kubernetes concepts
  • Firewalld configuration

Differences from Ubuntu:

  • DNF vs APT package manager
  • SELinux vs AppArmor
  • Firewalld vs UFW
  • Faster release cycle (more frequent upgrades)

Community Support

Ecosystem Maturity: ⭐⭐⭐⭐ (Good)

  • Documentation: Excellent official docs, Red Hat resources
  • Community: Large user base, active forums
  • Commercial Support: RHEL support available (paid)
  • Third-Party Tools: Good compatibility with Kubernetes tools
  • Tutorials: Abundant resources, especially for RHEL ecosystem

Resources:

Pros and Cons Summary

Pros

  • Good, because latest kernel and software packages (bleeding edge)
  • Good, because SELinux enabled by default (stronger MAC than AppArmor)
  • Good, because native CRI-O support (same as RHEL/OpenShift)
  • Good, because upstream for RHEL (enterprise compatibility)
  • Good, because multiple Kubernetes installation options
  • Good, because k3s simplifies setup dramatically
  • Good, because strong security focus and rapid CVE response
  • Good, because familiar to RHEL/CentOS ecosystem
  • Good, because automatic updates available (dnf-automatic)
  • Neutral, because shorter support cycle (13 months) ensures latest features

Cons

  • Bad, because short support cycle requires frequent OS upgrades (every ~13 months)
  • Bad, because bleeding-edge packages can introduce instability
  • Bad, because SELinux configuration for Kubernetes is complex (often set to permissive)
  • Bad, because smaller community than Ubuntu (though still large)
  • Bad, because general-purpose OS has larger attack surface than minimal OS
  • Bad, because more resource overhead than purpose-built Kubernetes OS
  • Bad, because OS upgrade every 13 months adds maintenance burden
  • Bad, because less beginner-friendly than Ubuntu
  • Bad, because managing OS + Kubernetes lifecycle separately
  • Neutral, because rapid release cycle can be pro or con depending on preference

Recommendations

Best for:

  • Users familiar with RHEL/CentOS/Rocky/Alma ecosystem
  • Teams wanting latest kernel and software features
  • Environments requiring SELinux (compliance, enterprise standards)
  • Learning OpenShift/OKD ecosystem (Fedora CoreOS foundation)
  • Users comfortable with frequent OS upgrades

Best Installation Method:

  • Homelab/Learning: k3s (simplest, lightweight)
  • Enterprise-like: kubeadm + CRI-O (OpenShift compatibility)
  • Advanced: OKD (if resources available, 16GB+ RAM)

Avoid if:

  • Prefer long-term stability (choose Ubuntu LTS)
  • Want minimal maintenance (frequent Fedora upgrades required)
  • Seeking minimal attack surface (consider Talos Linux)
  • Uncomfortable with SELinux complexity
  • Want infrastructure-as-code for OS (consider Talos Linux)

Comparison with Ubuntu

AspectFedoraUbuntu LTS
Support Period13 months5 years (10 with Pro)
KernelLatest (6.11+)LTS (6.8+)
SecuritySELinuxAppArmor
Package ManagerDNF/RPMAPT/DEB
Release Cycle6 months2 years (LTS)
Upgrade FrequencyEvery 13 monthsEvery 2-5 years
Community SizeLargeVery Large
Enterprise UpstreamRHELN/A
StabilityBleeding edgeStable/Conservative
Learning CurveModerateEasy

Verdict: Fedora is excellent for those wanting latest features and comfortable with frequent upgrades. Ubuntu LTS is better for long-term stability and minimal maintenance.

1.3 - Talos Linux Analysis

Analysis of Talos Linux for Kubernetes homelab infrastructure

Overview

Talos Linux is a modern operating system designed specifically for running Kubernetes. It is API-driven, immutable, and minimal, with no SSH access, shell, or package manager. All configuration is done via a declarative API.

Key Facts:

  • Latest Version: Talos 1.9 (supports Kubernetes 1.31)
  • Support: Community-driven, commercial support available from Sidero Labs
  • Kernel: Linux 6.6+ LTS
  • Architecture: Immutable, API-driven, no shell access
  • Management: talosctl CLI + Kubernetes API

Kubernetes Installation Methods

Talos Linux has built-in Kubernetes - there is only one installation method.

Built-in Kubernetes (Only Option)

Installation Process:

  1. Boot Talos ISO/PXE (maintenance mode)
  2. Apply machine configuration via talosctl
  3. Bootstrap Kubernetes via talosctl bootstrap

Machine Configuration (YAML):

# controlplane.yaml
version: v1alpha1
machine:
  type: controlplane
  install:
    disk: /dev/sda
  network:
    hostname: control-plane-1
    interfaces:
      - interface: eth0
        dhcp: false
        addresses:
          - 192.168.1.10/24
        routes:
          - network: 0.0.0.0/0
            gateway: 192.168.1.1
cluster:
  clusterName: homelab
  controlPlane:
    endpoint: https://192.168.1.10:6443
  network:
    cni:
      name: custom
      urls:
        - https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

Cluster Initialization:

# Generate machine configs
talosctl gen config homelab https://192.168.1.10:6443

# Apply config to control plane node (booted from ISO)
talosctl apply-config --insecure --nodes 192.168.1.10 --file controlplane.yaml

# Wait for install to complete, then bootstrap
talosctl bootstrap --nodes 192.168.1.10 --endpoints 192.168.1.10

# Retrieve kubeconfig
talosctl kubeconfig --nodes 192.168.1.10 --endpoints 192.168.1.10

# Apply config to worker nodes
talosctl apply-config --insecure --nodes 192.168.1.11 --file worker.yaml

Pros:

  • Kubernetes built-in, no separate installation
  • Declarative configuration (GitOps-friendly)
  • Extremely minimal attack surface (no shell, no SSH)
  • Immutable infrastructure (config changes require reboot)
  • Automatic updates via Talos controller
  • Designed from ground up for Kubernetes

Cons:

  • Steep learning curve (completely different paradigm)
  • No SSH/shell access (all via API)
  • Troubleshooting requires different mindset
  • Limited to Kubernetes workloads only (not general-purpose)
  • Smaller community than traditional distros

Cluster Initialization Sequence

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Talos as Talos Linux
    participant K8s as Kubernetes Components
    
    Admin->>Server: Boot Talos ISO (PXE or USB)
    Server->>Talos: Start in maintenance mode
    Talos-->>Admin: API endpoint ready (no shell)
    Admin->>Admin: Generate configs (talosctl gen config)
    Admin->>Talos: talosctl apply-config (controlplane.yaml)
    Talos->>Server: Partition disk
    Talos->>Server: Install Talos to /dev/sda
    Talos->>Server: Write machine config
    Server->>Server: Reboot from disk
    Talos->>Talos: Load machine config
    Talos->>K8s: Start kubelet
    Talos->>K8s: Start etcd
    Talos->>K8s: Start API server
    Admin->>Talos: talosctl bootstrap
    Talos->>K8s: Initialize cluster
    K8s->>Talos: Start controller-manager
    K8s->>Talos: Start scheduler
    K8s-->>Admin: Control plane ready
    Admin->>K8s: Apply CNI (via talosctl or kubectl)
    K8s->>Talos: Deploy CNI pods
    Admin->>Talos: Apply worker configs
    Talos->>K8s: Join workers to cluster
    K8s-->>Admin: Cluster ready (10-15 minutes)

Maintenance Requirements

OS Updates

Declarative Upgrades:

# Upgrade Talos version (rolling upgrade)
talosctl upgrade --nodes 192.168.1.10 --image ghcr.io/siderolabs/installer:v1.9.0

# Kubernetes version upgrade (also declarative)
talosctl upgrade-k8s --nodes 192.168.1.10 --to 1.32.0

Automatic Updates (via Talos System Extensions):

# machine config with auto-update extension
machine:
  install:
    extensions:
      - image: ghcr.io/siderolabs/system-upgrade-controller

Frequency:

  • Talos releases: Every 2-3 months
  • Kubernetes upgrades: Follow upstream cadence (quarterly)
  • Security patches: Built into Talos releases
  • No traditional OS patching (immutable system)

Configuration Changes

All changes via machine config:

# Edit machine config YAML
vim controlplane.yaml

# Apply updated config (triggers reboot if needed)
talosctl apply-config --nodes 192.168.1.10 --file controlplane.yaml

No manual package installs - everything declarative.

Resource Overhead

Minimal Footprint (Talos Linux + Kubernetes):

  • RAM: ~256MB (OS) + 512MB (Kubernetes) = 768MB total
  • CPU: 1 core minimum, 2 cores recommended
  • Disk: ~500MB (OS) + 10GB (container images/etcd) = 10-15GB total
  • Network: 1 Gbps recommended

Comparison:

  • Ubuntu + k3s: ~1GB RAM
  • Talos: ~768MB RAM (lighter)
  • Ubuntu + kubeadm: ~2GB RAM
  • Talos: ~768MB RAM (much lighter)

Minimal install size: ~500MB (vs 10GB+ for Ubuntu/Fedora)

Security Posture

Strengths: ⭐⭐⭐⭐⭐ (Excellent)

  • No SSH access - attack surface eliminated
  • No shell - cannot install malware
  • No package manager - no additional software installation
  • Immutable filesystem - rootfs read-only
  • Minimal components: Only Kubernetes and essential services
  • API-only access - mTLS-authenticated talosctl
  • KSPP compliance: Kernel Self-Protection Project standards
  • Signed images: Cryptographically signed Talos images
  • Secure Boot support: UEFI Secure Boot compatible

Attack Surface:

  • Smallest possible: Only Kubernetes API, kubelet, and Talos API
  • ~30 running processes (vs 100+ on Ubuntu/Fedora)
  • ~200MB filesystem (vs 5-10GB on Ubuntu/Fedora)

No hardening needed - secure by default.

Security Features:

# Built-in security (example config)
machine:
  sysctls:
    kernel.kptr_restrict: "2"
    kernel.yama.ptrace_scope: "1"
  kernel:
    modules:
      - name: br_netfilter
  features:
    kubernetesTalosAPIAccess:
      enabled: true
      allowedRoles:
        - os:reader

Learning Curve

Ease of Adoption: ⭐⭐ (Challenging)

  • Paradigm shift: No shell/SSH, API-only management
  • Requires understanding of declarative infrastructure
  • Talosctl CLI has learning curve
  • Excellent documentation helps
  • Different troubleshooting approach (logs via API)

Required Knowledge:

  • Kubernetes fundamentals (critical)
  • YAML configuration syntax
  • Networking basics (especially CNI)
  • GitOps concepts helpful
  • Comfort with “infrastructure as code”

Debugging without shell:

# View logs via API
talosctl logs --nodes 192.168.1.10 kubelet

# Get system metrics
talosctl dashboard --nodes 192.168.1.10

# Interactive mode (limited shell in emergency)
talosctl dashboard --nodes 192.168.1.10

# Service status
talosctl service --nodes 192.168.1.10

Community Support

Ecosystem Maturity: ⭐⭐⭐ (Growing)

  • Documentation: Excellent official docs
  • Community: Smaller but very active (Slack, GitHub Discussions)
  • Commercial Support: Available from Sidero Labs
  • Third-Party Tools: Growing ecosystem (Cluster API, GitOps tools)
  • Tutorials: Increasing number of community guides

Resources:

Community Size: Smaller than Ubuntu/Fedora, but dedicated and helpful.

Pros and Cons Summary

Pros

  • Good, because Kubernetes is built-in (no separate installation)
  • Good, because minimal attack surface (no SSH, shell, or package manager)
  • Good, because immutable infrastructure (config drift impossible)
  • Good, because API-driven management (GitOps-friendly)
  • Good, because extremely low resource overhead (~768MB RAM)
  • Good, because automatic security patches via Talos upgrades
  • Good, because declarative configuration (version-controlled)
  • Good, because secure by default (no hardening required)
  • Good, because smallest disk footprint (~500MB OS)
  • Good, because designed specifically for Kubernetes (opinionated and optimized)
  • Good, because UEFI Secure Boot support
  • Good, because upgrades are simple and declarative (talosctl upgrade)

Cons

  • Bad, because steep learning curve (no shell/SSH paradigm shift)
  • Bad, because limited to Kubernetes workloads only (not general-purpose)
  • Bad, because troubleshooting without shell requires different approach
  • Bad, because smaller community than Ubuntu/Fedora
  • Bad, because relatively new (less mature than traditional distros)
  • Bad, because no escape hatch for manual intervention
  • Bad, because requires comfort with declarative infrastructure
  • Bad, because debugging is harder for beginners
  • Neutral, because opinionated design (pro for K8s-only, con for general use)

Recommendations

Best for:

  • Kubernetes-dedicated infrastructure (no general-purpose workloads)
  • Security-focused environments (minimal attack surface)
  • GitOps workflows (declarative configuration)
  • Immutable infrastructure advocates
  • Teams comfortable with API-driven management
  • Production Kubernetes clusters (once team is trained)

Best Installation Method:

  • Only option: Built-in Kubernetes via talosctl

Avoid if:

  • Need general-purpose server functionality (SSH, cron jobs, etc.)
  • Team unfamiliar with Kubernetes (too steep a learning curve)
  • Require shell access for troubleshooting comfort
  • Want traditional package management (apt, dnf)
  • Prefer familiar Linux administration tools

Comparison with Ubuntu and Fedora

AspectTalos LinuxUbuntu + k3sFedora + kubeadm
K8s InstallationBuilt-inSingle commandManual (kubeadm)
Attack SurfaceMinimal (~30 processes)Medium (~100)Medium (~100)
Resource Overhead768MB RAM1GB RAM2.2GB RAM
Disk Footprint500MB10GB15GB
Security ModelImmutable, no shellAppArmor, shellSELinux, shell
ManagementAPI-only (talosctl)SSH + kubectlSSH + kubectl
Learning CurveSteepEasyModerate
Community SizeSmall (growing)Very LargeLarge
Support PeriodRolling releases5-10 years13 months
Use CaseKubernetes onlyGeneral-purposeGeneral-purpose
UpgradesDeclarative, simpleManual OS + K8sManual OS + K8s
ConfigurationDeclarative YAMLImperative + YAMLImperative + YAML
TroubleshootingAPI logs/metricsSSH + logsSSH + logs
GitOps-FriendlyExcellentGoodGood
Best forK8s-dedicated infraHomelabs, learningRHEL ecosystem

Verdict: Talos is the most secure and efficient option for Kubernetes-only infrastructure, but requires team buy-in to API-driven, immutable paradigm. Ubuntu/Fedora better for general-purpose servers or teams wanting shell access.

Advanced Features

Talos System Extensions

Extend Talos functionality with extensions:

machine:
  install:
    extensions:
      - image: ghcr.io/siderolabs/intel-ucode:20240312
      - image: ghcr.io/siderolabs/iscsi-tools:v0.1.4

Cluster API Integration

Talos works natively with Cluster API:

# Install Cluster API + Talos provider
clusterctl init --infrastructure talos

# Create cluster from template
clusterctl generate cluster homelab --infrastructure talos > cluster.yaml
kubectl apply -f cluster.yaml

Image Factory

Custom Talos images with extensions:

# Build custom image
curl -X POST https://factory.talos.dev/image \
  -d '{"talos_version":"v1.9.0","extensions":["siderolabs/intel-ucode"]}'

Disaster Recovery

Talos supports etcd backup/restore:

# Backup etcd
talosctl etcd snapshot --nodes 192.168.1.10

# Restore from snapshot
talosctl bootstrap --recover-from ./etcd-snapshot.db

Production Readiness

Production Use: ✅ Yes (many companies run Talos in production)

High Availability:

  • 3+ control plane nodes recommended
  • External etcd supported
  • Load balancer for API server

Monitoring:

  • Prometheus metrics built-in
  • Talos dashboard for health
  • Standard Kubernetes observability tools

Example Production Clusters:

  • Sidero Metal (bare metal provisioning)
  • Various cloud providers (AWS, GCP, Azure)
  • Edge deployments (minimal footprint)

1.4 - Harvester Analysis

Analysis of Harvester HCI for Kubernetes homelab infrastructure

Overview

Harvester is a Hyperconverged Infrastructure (HCI) platform built on Kubernetes, designed to provide VM and container management on a unified platform. It combines compute, storage, and networking with built-in K3s for orchestration.

Key Facts:

  • Latest Version: Harvester 1.4 (based on K3s 1.30+)
  • Foundation: Built on RancherOS 2.0, K3s, and KubeVirt
  • Support: Supported by SUSE (acquired Rancher)
  • Architecture: HCI platform with VM + container workloads
  • Management: Web UI + kubectl + Rancher integration

Kubernetes Installation Methods

Harvester includes K3s as its foundation - Kubernetes is built-in.

Built-in K3s (Only Option)

Installation Process:

  1. Boot Harvester ISO (interactive installer or PXE)
  2. Complete installation wizard (web UI or console)
  3. Create cluster (automatic K3s deployment)
  4. Access via web UI or kubectl

Interactive Installation:

# Boot from Harvester ISO
1. Choose "Create a new Harvester cluster"
2. Configure:
   - Cluster token
   - Node role (management/worker/witness)
   - Network interface (management network)
   - VIP (Virtual IP for cluster access)
   - Storage disk (Longhorn persistent storage)
3. Install completes (15-20 minutes)
4. Access web UI at https://<VIP>

Configuration (cloud-init for automated install):

# config.yaml
token: my-cluster-token
os:
  hostname: harvester-node-1
  modules:
    - kvm
  kernel_parameters:
    - intel_iommu=on
install:
  mode: create
  device: /dev/sda
  iso_url: https://releases.rancher.com/harvester/v1.4.0/harvester-v1.4.0-amd64.iso
  vip: 192.168.1.100
  vip_mode: static
  networks:
    harvester-mgmt:
      interfaces:
        - name: eth0
      default_route: true
      ip: 192.168.1.10
      subnet_mask: 255.255.255.0
      gateway: 192.168.1.1

Pros:

  • Complete HCI solution (VMs + containers)
  • Web UI for management (no CLI required)
  • Built-in storage (Longhorn CSI)
  • Built-in networking (multus, SR-IOV)
  • VM live migration
  • Rancher integration for multi-cluster management
  • K3s built-in (no separate Kubernetes install)

Cons:

  • Heavy resource requirements (8GB+ RAM per node)
  • Complex architecture (steep learning curve)
  • Larger attack surface than minimal OS
  • Overkill for container-only workloads
  • Requires 3+ nodes for production HA

Cluster Initialization Sequence

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Harvester as Harvester HCI
    participant K3s as K3s / KubeVirt
    participant Storage as Longhorn Storage
    
    Admin->>Server: Boot Harvester ISO
    Server->>Harvester: Start installation wizard
    Harvester-->>Admin: Interactive console/web UI
    Admin->>Harvester: Configure cluster (token, VIP, storage)
    Harvester->>Server: Partition disks (OS + Longhorn storage)
    Harvester->>Server: Install RancherOS 2.0 base
    Harvester->>Server: Install K3s components
    Server->>Server: Reboot
    Harvester->>K3s: Start K3s server
    K3s->>Server: Initialize control plane
    K3s->>Server: Deploy Harvester operators
    K3s->>Storage: Deploy Longhorn for persistent storage
    K3s->>Server: Deploy KubeVirt for VM management
    K3s->>Server: Deploy multus CNI (multi-network)
    Harvester-->>Admin: Web UI ready at https://<VIP>
    Admin->>Harvester: Add additional nodes (join cluster)
    Harvester->>K3s: Join nodes to cluster
    K3s->>Storage: Replicate storage across nodes
    Harvester-->>Admin: Cluster ready (20-30 minutes)
    Admin->>Harvester: Create VMs or deploy containers

Maintenance Requirements

OS Updates

Harvester Upgrades (includes OS + K3s):

# Via Web UI:
# Settings → Upgrade → Select version → Start upgrade

# Via kubectl (after downloading upgrade image):
kubectl apply -f https://releases.rancher.com/harvester/v1.4.0/version.yaml

# Monitor upgrade progress
kubectl get upgrades -n harvester-system

Frequency:

  • Harvester releases: Every 2-3 months (minor versions)
  • Security patches: Included in Harvester releases
  • K3s upgrades: Bundled with Harvester upgrades
  • No separate OS patching (managed by Harvester)

Kubernetes Upgrades

K3s is upgraded with Harvester - no separate upgrade process.

Version Compatibility:

  • Harvester 1.4.x → K3s 1.30+
  • Harvester 1.3.x → K3s 1.28+
  • Harvester 1.2.x → K3s 1.26+

Upgrade Process:

  1. Web UI or kubectl to trigger upgrade
  2. Rolling upgrade of nodes (one at a time)
  3. VM live migration during node upgrades
  4. Automatic rollback on failure

Resource Overhead

Single Node (Harvester HCI):

  • RAM: 8GB minimum (16GB recommended for VMs)
  • CPU: 4 cores minimum (8 cores recommended)
  • Disk: 250GB minimum (SSD recommended)
    • 100GB for OS/Harvester components
    • 150GB+ for Longhorn storage (VM disks)
  • Network: 1 Gbps minimum (10 Gbps for production)

Three-Node Cluster (Production HA):

  • RAM: 32GB per node (64GB for VM-heavy workloads)
  • CPU: 8 cores per node minimum
  • Disk: 500GB+ per node (NVMe SSD recommended)
  • Network: 10 Gbps recommended (separate storage network ideal)

Comparison:

  • Ubuntu + k3s: 1GB RAM
  • Talos: 768MB RAM
  • Harvester: 8GB+ RAM (much heavier)

Note: Harvester is designed for multi-node HCI, not single-node homelabs.

Security Posture

Strengths:

  • SELinux-based (RancherOS 2.0 foundation)
  • Immutable OS layer (similar to Talos)
  • RBAC built-in (Kubernetes + Rancher)
  • Network segmentation (multus CNI)
  • VM isolation (KubeVirt)
  • Signed images and secure boot support

Attack Surface:

  • Larger than Talos/k3s: Includes web UI, VM management, storage layer
  • KubeVirt adds additional components
  • Web UI is additional attack vector
  • More processes than minimal OS (~50+ services)

Security Features:

# VM network isolation example
apiVersion: network.harvesterhci.io/v1beta1
kind: VlanConfig
metadata:
  name: production-vlan
spec:
  vlanID: 100
  uplink:
    linkAttributes: 1500

Hardening:

  • Firewall rules (web UI or kubectl)
  • RBAC policies (restrict VM/namespace access)
  • Network policies (isolate workloads)
  • Rancher authentication integration (LDAP, SAML)

Learning Curve

Ease of Adoption: ⭐⭐⭐ (Moderate)

  • Web UI simplifies management (no CLI required for basic tasks)
  • Requires understanding of VMs + containers
  • Kubernetes knowledge helpful but not required initially
  • Longhorn storage concepts (replicas, snapshots)
  • KubeVirt for VM management (learning curve)

Required Knowledge:

  • Basic Kubernetes concepts (pods, services)
  • VM management (KubeVirt/libvirt)
  • Storage concepts (Longhorn, CSI)
  • Networking (VLANs, SR-IOV optional)
  • Web UI navigation

Debugging:

# Access via kubectl (kubeconfig from web UI)
kubectl get nodes -n harvester-system

# View Harvester logs
kubectl logs -n harvester-system <pod-name>

# VM console access (via web UI or virtctl)
virtctl console <vm-name>

# Storage debugging
kubectl get volumes -A

Community Support

Ecosystem Maturity: ⭐⭐⭐⭐ (Good)

  • Documentation: Excellent official docs
  • Community: Active Slack, GitHub Discussions, forums
  • Commercial Support: Available from SUSE/Rancher
  • Third-Party Tools: Rancher ecosystem integration
  • Tutorials: Growing number of guides and videos

Resources:

Pros and Cons Summary

Pros

  • Good, because unified platform for VMs + containers (no separate hypervisor)
  • Good, because built-in K3s (Kubernetes included)
  • Good, because web UI simplifies management (no CLI required)
  • Good, because built-in persistent storage (Longhorn CSI)
  • Good, because VM live migration (no downtime during maintenance)
  • Good, because multi-network support (multus CNI, SR-IOV)
  • Good, because Rancher integration (multi-cluster management)
  • Good, because automatic upgrades (OS + K3s + components)
  • Good, because commercial support available (SUSE)
  • Good, because designed for bare-metal HCI (no cloud dependencies)
  • Neutral, because immutable OS layer (similar to Talos benefits)

Cons

  • Bad, because very heavy resource requirements (8GB+ RAM minimum)
  • Bad, because complex architecture (KubeVirt, Longhorn, multus, etc.)
  • Bad, because overkill for container-only workloads (use k3s/Talos instead)
  • Bad, because larger attack surface than minimal OS (web UI, VM layer)
  • Bad, because requires 3+ nodes for production HA (not single-node friendly)
  • Bad, because steep learning curve for full feature set (VMs + storage + networking)
  • Bad, because relatively new platform (less mature than Ubuntu/Fedora)
  • Bad, because limited to Rancher ecosystem (vendor lock-in)
  • Bad, because slower to adopt latest Kubernetes versions (depends on K3s bundle)
  • Neutral, because opinionated HCI design (pro for VM use cases, con for simplicity)

Recommendations

Best for:

  • Hybrid workloads (VMs + containers on same platform)
  • Homelab users wanting to consolidate VM hypervisor + Kubernetes
  • Teams familiar with Rancher ecosystem
  • Multi-node clusters (3+ nodes)
  • Environments requiring VM live migration
  • Users wanting web UI for infrastructure management
  • Replacing VMware/Proxmox + Kubernetes with unified platform

Best Installation Method:

  • Only option: Interactive ISO install or PXE with cloud-init

Avoid if:

  • Running container-only workloads (use k3s or Talos instead)
  • Limited resources (< 8GB RAM per node)
  • Single-node homelab (Harvester designed for multi-node)
  • Want minimal attack surface (use Talos)
  • Prefer traditional Linux shell access (use Ubuntu/Fedora)
  • Need latest Kubernetes versions immediately (Harvester lags upstream)

Comparison with Other Options

AspectHarvesterTalos LinuxUbuntu + k3sFedora + kubeadm
Primary Use CaseVMs + ContainersContainers onlyGeneral-purposeGeneral-purpose
Resource Overhead8GB+ RAM768MB RAM1GB RAM2.2GB RAM
KubernetesBuilt-in K3sBuilt-inInstall k3sInstall kubeadm
ManagementWeb UI + kubectlAPI-only (talosctl)SSH + kubectlSSH + kubectl
StorageBuilt-in LonghornExternal CSIExternal CSIExternal CSI
VM SupportNative (KubeVirt)NoVia KubeVirtVia KubeVirt
Learning CurveModerateSteepEasyModerate
Attack SurfaceLargeMinimalMediumMedium
Multi-NodeDesigned forSupportsSupportsSupports
Single-NodeNot idealExcellentExcellentGood
Best forVM + K8s hybridK8s-onlyHomelab/learningRHEL ecosystem

Verdict: Harvester is excellent for VM + container hybrid workloads with 3+ nodes, but overkill for container-only infrastructure. Use Talos or k3s for Kubernetes-only clusters, Ubuntu/Fedora for general-purpose servers.

Advanced Features

VM Management (KubeVirt)

Create VMs via YAML:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: ubuntu-vm
spec:
  running: true
  template:
    spec:
      domain:
        devices:
          disks:
            - name: root
              disk:
                bus: virtio
        resources:
          requests:
            memory: 4Gi
            cpu: 2
      volumes:
        - name: root
          containerDisk:
            image: docker.io/harvester/ubuntu:22.04

Live Migration

Move VMs between nodes:

# Via web UI: VM → Actions → Migrate

# Via kubectl
kubectl patch vm ubuntu-vm --type merge -p '{"spec":{"running":false}}'
kubectl patch vm ubuntu-vm --type merge -p '{"spec":{"running":true}}'

Backup and Restore

Harvester supports VM backups:

# Configure S3 backup target (web UI)
# Create VM snapshot
# Restore from snapshot or backup

Rancher Integration

Manage multiple clusters:

# Import Harvester cluster into Rancher
# Deploy workloads across clusters
# Central authentication and RBAC

Use Case Examples

Use Case 1: Replace VMware + Kubernetes

Scenario: Currently running VMware ESXi for VMs + separate Kubernetes cluster

Harvester Solution:

  • Consolidate to 3-node Harvester cluster
  • Migrate VMs to KubeVirt
  • Deploy containers on same cluster
  • Save VMware licensing costs

Benefits:

  • Single platform for VMs + containers
  • Unified management (web UI + kubectl)
  • Built-in HA and live migration

Use Case 2: Homelab with Mixed Workloads

Scenario: Need Windows VMs + Linux containers + storage server

Harvester Solution:

  • Windows VMs via KubeVirt (GPU passthrough supported)
  • Linux containers via K3s workloads
  • Longhorn for persistent storage (NFS export supported)

Benefits:

  • No need for separate Proxmox/ESXi
  • Kubernetes-native management
  • Learn enterprise HCI platform

Use Case 3: Edge Computing

Scenario: Deploy compute at remote sites (3-5 nodes each)

Harvester Solution:

  • Harvester cluster at each edge location
  • Rancher for central management
  • VM + container workloads

Benefits:

  • Autonomous operation (no cloud dependency)
  • Rancher multi-cluster management
  • Built-in storage and networking

Production Readiness

Production Use: ✅ Yes (used in enterprise environments)

High Availability:

  • 3+ nodes required for HA
  • Witness node for even-node clusters
  • VM live migration during maintenance
  • Longhorn 3-replica storage

Monitoring:

  • Built-in Prometheus + Grafana
  • Rancher monitoring integration
  • Alerting and notifications

Disaster Recovery:

  • VM backups to S3
  • Cluster backups (etcd + config)
  • Restore to new cluster

Enterprise Features:

  • Rancher authentication (LDAP, SAML, OAuth)
  • Multi-tenancy (namespaces, RBAC)
  • Audit logging
  • Network policies

2 - Amazon Web Services Analysis

Technical analysis of Amazon Web Services capabilities for hosting network boot infrastructure

This section contains detailed analysis of Amazon Web Services (AWS) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.

Overview

Amazon Web Services is Amazon’s comprehensive cloud computing platform, offering compute, storage, networking, and managed services. This analysis focuses on AWS’s capabilities to support the network boot architecture decided in ADR-0002.

Key Services Evaluated

  • EC2: Virtual machine instances for hosting boot server
  • VPN / VPC: Network connectivity and VPN capabilities
  • Elastic Load Balancing: Application and Network Load Balancers
  • NAT Gateway: Network address translation for outbound connectivity
  • VPC: Virtual Private Cloud networking and routing

Documentation Sections

2.1 - AWS Network Boot Protocol Support

Analysis of Amazon Web Services support for TFTP, HTTP, and HTTPS routing for network boot infrastructure

Network Boot Protocol Support on Amazon Web Services

This document analyzes AWS’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.

TFTP (Trivial File Transfer Protocol) Support

Native Support

Status: ❌ Not natively supported by Elastic Load Balancing

AWS’s Elastic Load Balancing services do not support TFTP protocol natively:

  • Application Load Balancer (ALB): HTTP/HTTPS only (Layer 7)
  • Network Load Balancer (NLB): TCP/UDP support, but not TFTP-aware
  • Classic Load Balancer: Deprecated, similar limitations

TFTP operates on UDP port 69 with unique protocol semantics (variable block sizes, retransmissions, port negotiation) that standard load balancers cannot parse.

Implementation Options

Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from an EC2 instance:

  • Approach: Run TFTP server (e.g., tftpd-hpa, dnsmasq) on an EC2 instance
  • Access: Home lab connects via VPN tunnel to instance’s private IP
  • Security Group: Allow UDP/69 from VPN subnet/security group
  • Pros:
    • Simple implementation
    • No load balancer needed (single boot server sufficient for home lab)
    • TFTP traffic encrypted through VPN tunnel
    • Direct instance-to-client communication
  • Cons:
    • Single point of failure (no HA)
    • Manual failover if instance fails

Option 2: Network Load Balancer (NLB) UDP Passthrough

While NLB doesn’t understand TFTP protocol, it can forward UDP traffic:

  • Approach: Configure NLB to forward UDP/69 to target group
  • Limitations:
    • No TFTP-specific health checks
    • Health checks would use TCP or different protocol
    • Adds cost and complexity without significant benefit for single server
  • Use Case: Only relevant for multi-AZ HA deployment (overkill for home lab)

TFTP Security Considerations

  • Encryption: TFTP itself is unencrypted, but VPN tunnel provides encryption
  • Security Groups: Restrict UDP/69 to VPN security group or CIDR only
  • File Access Control: Configure TFTP server with restricted file access
  • Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads

HTTP Support

Native Support

Status: ✅ Fully supported

AWS provides comprehensive HTTP support through multiple services:

Elastic Load Balancing - Application Load Balancer

  • Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (preview)
  • Port: Any port (typically 80 for HTTP)
  • Routing: Path-based, host-based, query string, header-based routing
  • Health Checks: HTTP health checks with configurable paths and response codes
  • SSL Offloading: Terminate SSL at ALB and use HTTP to backend
  • Backend: EC2 instances, ECS, EKS, Lambda

EC2 Direct Access

For VPN scenario, HTTP can be served directly from EC2 instance:

  • Approach: Run HTTP server (nginx, Apache, custom service) on EC2
  • Access: Home lab accesses via VPN tunnel to private IP
  • Security Group: Allow TCP/80 from VPN security group
  • Pros: Simpler than ALB for single boot server

HTTP Boot Flow for Network Boot

  1. PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
  2. iPXE → HTTP: iPXE chainloads kernel/initrd via HTTP
  3. Kernel/Initrd: Large boot files served efficiently over HTTP

Performance Considerations

  • Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
  • Compression: gzip compression for text-based configs
  • CloudFront: Optional CDN for caching boot files (probably overkill for VPN scenario)
  • TCP Optimization: AWS network optimized for low-latency TCP

HTTPS Support

Native Support

Status: ✅ Fully supported with advanced features

AWS provides enterprise-grade HTTPS support:

Elastic Load Balancing - Application Load Balancer

  • Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 (preview)
  • SSL/TLS Termination: Terminate SSL at ALB
  • Certificate Management:
    • AWS Certificate Manager (ACM) - free SSL certificates with automatic renewal
    • Import custom certificates
    • Integration with private CA via ACM Private CA
  • TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable via security policy)
  • Cipher Suites: Predefined security policies (modern, compatible, legacy)
  • SNI Support: Multiple certificates on single load balancer

AWS Certificate Manager (ACM)

  • Free Certificates: No cost for public SSL certificates used with AWS services
  • Automatic Renewal: ACM automatically renews certificates before expiration
  • Private CA: ACM Private CA for internal PKI (additional cost)
  • Integration: Native integration with ALB, CloudFront, API Gateway

HTTPS for Network Boot

Use Case

Modern UEFI firmware and iPXE support HTTPS boot:

  • iPXE HTTPS: iPXE compiled with DOWNLOAD_PROTO_HTTPS can fetch over HTTPS
  • UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot
  • Security: Boot file integrity verified via HTTPS chain of trust

Implementation on AWS

  1. Certificate Provisioning:

    • Use ACM certificate for public domain (free, auto-renewed)
    • Use self-signed certificate for VPN-only access (add to iPXE trust store)
    • Use ACM Private CA for internal PKI ($400/month - expensive for home lab)
  2. ALB Configuration:

    • HTTPS listener on port 443
    • Target group pointing to EC2 boot server
    • Security policy with TLS 1.2+ minimum
  3. Alternative: Direct EC2 HTTPS:

    • Run nginx/Apache with TLS on EC2 instance
    • Access via VPN tunnel to private IP with HTTPS
    • Simpler setup for VPN-only scenario
    • Use Let’s Encrypt or self-signed certificate

Mutual TLS (mTLS) Support

AWS ALB supports mutual TLS authentication (as of 2022):

  • Client Certificates: Require client certificates for authentication
  • Trust Store: Upload trusted CA certificates to ALB
  • Use Case: Ensure only authorized home lab servers can access boot files
  • Integration: Combine with VPN for defense-in-depth
  • Passthrough Mode: ALB can pass client cert to backend for validation

Routing and Load Balancing Capabilities

VPC Routing

  • Route Tables: Define routes to direct traffic through VPN gateway
  • Route Propagation: BGP route propagation for VPN connections
  • Transit Gateway: Advanced multi-VPC/VPN routing (overkill for home lab)

Security Groups

  • Stateful Firewall: Automatic return traffic handling
  • Ingress/Egress Rules: Fine-grained control by protocol, port, source/destination
  • Security Group Chaining: Reference security groups in rules (elegant for VPN setup)
  • VPN Subnet Restriction: Allow traffic only from VPN-connected subnet

Network ACLs (Optional)

  • Stateless Firewall: Subnet-level access control
  • Defense in Depth: Additional layer beyond security groups
  • Use Case: Probably unnecessary for simple VPN boot server

Cost Implications

Data Transfer Costs

  • VPN Traffic: Data transfer through VPN gateway charged at standard rates
  • Intra-Region: Free for traffic within same region/VPC
  • Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
  • Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.14/month (US East egress)

Load Balancing Costs

  • Application Load Balancer: $0.0225/hour + $0.008 per LCU-hour ($16-20/month minimum)
  • Network Load Balancer: $0.0225/hour + $0.006 per NLCU-hour ($16-18/month minimum)
  • For VPN Scenario: Load balancer unnecessary (single EC2 instance sufficient)

Compute Costs

  • t3.micro Instance: ~$7.50/month (on-demand pricing, US East)
  • t4g.micro Instance: ~$6.00/month (ARM-based, cheaper, sufficient for boot server)
  • Reserved Instances: Up to 72% savings with 1-year or 3-year commitment
  • Savings Plans: Flexible discounts for consistent compute usage

ACM Certificate Costs

  • Public Certificates: Free when used with AWS services
  • Private CA: $400/month (too expensive for home lab)

Comparison with Requirements

RequirementAWS SupportImplementation
TFTP⚠️ Via EC2, not ELBDirect EC2 access via VPN
HTTP✅ Full supportEC2 or ALB
HTTPS✅ Full supportEC2 or ALB with ACM
VPN Integration✅ Native VPNSite-to-Site VPN or self-managed
Load Balancing✅ ALB, NLBOptional for HA
Certificate Mgmt✅ ACM (free)Automatic renewal
Cost Efficiency✅ Low-cost instancest4g.micro sufficient

Recommendations

For VPN-Based Architecture (per ADR-0002)

  1. EC2 Instance: Deploy single t4g.micro or t3.micro instance with:

    • TFTP server (tftpd-hpa or dnsmasq)
    • HTTP server (nginx or simple Python HTTP server)
    • Optional HTTPS with Let’s Encrypt or self-signed certificate
  2. VPN Connection: Connect home lab to AWS via:

    • Site-to-Site VPN (IPsec) - managed service, higher cost (~$36/month)
    • Self-managed WireGuard on EC2 - lower cost, more control
  3. Security Groups: Restrict access to:

    • UDP/69 (TFTP) from VPN security group only
    • TCP/80 (HTTP) from VPN security group only
    • TCP/443 (HTTPS) from VPN security group only
  4. No Load Balancer: For home lab scale, direct EC2 access is sufficient

  5. Health Monitoring: Use CloudWatch for instance and service health

If HA Required (Future Enhancement)

  • Deploy multi-AZ EC2 instances with Network Load Balancer
  • Use S3 as backend for boot files with EC2 serving as cache
  • Implement auto-recovery with Auto Scaling Group (min=max=1)

References

2.2 - AWS WireGuard VPN Support

Analysis of WireGuard VPN deployment options on Amazon Web Services for secure site-to-site connectivity

WireGuard VPN Support on Amazon Web Services

This document analyzes options for deploying WireGuard VPN on AWS to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.

WireGuard Overview

WireGuard is a modern VPN protocol that provides:

  • Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
  • Performance: High throughput with low overhead
  • Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
  • Configuration: Simple key-based configuration
  • Kernel Integration: Mainline Linux kernel support since 5.6

AWS Native VPN Support

Site-to-Site VPN (IPsec)

Status: ❌ WireGuard not natively supported

AWS’s managed Site-to-Site VPN supports:

  • IPsec VPN: IKEv1, IKEv2 with pre-shared keys
  • Redundancy: Two VPN tunnels per connection for high availability
  • BGP Support: Dynamic routing via BGP
  • Transit Gateway: Scalable multi-VPC VPN hub

Limitation: Site-to-Site VPN does not support WireGuard protocol natively.

Cost: Site-to-Site VPN

  • VPN Connection: ~$0.05/hour = ~$36/month
  • Data Transfer: Standard data transfer out rates (~$0.09/GB for first 10TB)
  • Total Estimate: ~$36-50/month for managed IPsec VPN

Self-Managed WireGuard on EC2

Implementation Approach

Since AWS doesn’t offer managed WireGuard, deploy WireGuard on an EC2 instance:

Status: ✅ Fully supported via EC2

Architecture

graph LR
    A[Home Lab] -->|WireGuard Tunnel| B[AWS EC2 Instance]
    B -->|VPC Network| C[Boot Server EC2]
    B -->|IP Forwarding| C
    
    subgraph "Home Network"
        A
        D[UDM Pro]
        D -.WireGuard Client.- A
    end
    
    subgraph "AWS VPC"
        B[WireGuard Gateway EC2]
        C[Boot Server EC2]
    end

EC2 Configuration

  1. WireGuard Gateway Instance:

    • Instance Type: t4g.micro or t3.micro ($6-7.50/month)
    • OS: Ubuntu 22.04 LTS or Amazon Linux 2023 (native WireGuard support)
    • Source/Dest Check: Disable to allow IP forwarding
    • Elastic IP: Allocate Elastic IP for stable WireGuard endpoint
    • Security Group: Allow UDP port 51820 from home lab public IP
  2. Boot Server Instance:

    • Network: Same VPC as WireGuard gateway
    • Private IP Only: No Elastic IP (accessed via VPN)
    • Route Traffic: Through WireGuard gateway instance

Installation Steps

# On EC2 Instance (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools

# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key

# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf

Example /etc/wireguard/wg0.conf on AWS EC2:

[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24

Corresponding config on UDM Pro:

[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>

[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <AWS_ELASTIC_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.0.0.0/16
PersistentKeepalive = 25

Enable and Start WireGuard

# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0

# Verify status
sudo wg show

AWS VPC Configuration

Security Groups

Create security group for WireGuard gateway:

aws ec2 create-security-group \
    --group-name wireguard-gateway-sg \
    --description "WireGuard VPN gateway" \
    --vpc-id vpc-xxxxxx

aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxx \
    --protocol udp \
    --port 51820 \
    --cidr <HOME_LAB_PUBLIC_IP>/32

Allow SSH for management (optional, restrict to trusted IP):

aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxx \
    --protocol tcp \
    --port 22 \
    --cidr <TRUSTED_IP>/32

Disable Source/Destination Check

Required for IP forwarding to work:

aws ec2 modify-instance-attribute \
    --instance-id i-xxxxxx \
    --no-source-dest-check

Elastic IP Allocation

Allocate and associate Elastic IP for stable endpoint:

aws ec2 allocate-address --domain vpc

aws ec2 associate-address \
    --instance-id i-xxxxxx \
    --allocation-id eipalloc-xxxxxx

Cost: Elastic IP is free when associated with running instance, but charged ~$3.60/month if unattached.

Route Table Configuration

Add route to direct home lab subnet traffic through WireGuard gateway:

aws ec2 create-route \
    --route-table-id rtb-xxxxxx \
    --destination-cidr-block 192.168.1.0/24 \
    --instance-id i-xxxxxx

This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway instance.

UDM Pro WireGuard Integration

Native Support

Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)

The UniFi Dream Machine Pro includes native WireGuard VPN support:

  • GUI Configuration: Web UI for WireGuard VPN setup
  • Site-to-Site: Support for site-to-site VPN tunnels
  • Performance: Hardware acceleration for encryption (if available)
  • Routing: Automatic route injection for remote subnets

Configuration Steps on UDM Pro

  1. Network Settings → VPN:

    • Create new VPN connection
    • Select “WireGuard”
    • Generate key pair or import existing
  2. Peer Configuration:

    • Peer Public Key: AWS EC2 WireGuard instance’s public key
    • Endpoint: AWS Elastic IP address
    • Port: 51820
    • Allowed IPs: AWS VPC CIDR (e.g., 10.0.0.0/16)
    • Persistent Keepalive: 25 seconds
  3. Route Injection:

    • UDM Pro automatically adds routes to AWS subnets
    • Home lab servers can reach AWS boot server via VPN
  4. Firewall Rules:

    • Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN

Alternative: Manual WireGuard on UDM Pro

If native support is insufficient, use wireguard-go via udm-utilities:

  • Repository: boostchicken/udm-utilities
  • Script: on_boot.d script to start WireGuard on boot
  • Persistence: Survives firmware updates with on-boot script

Performance Considerations

Throughput

WireGuard on EC2 performance varies by instance type:

  • t4g.micro (2 vCPU, ARM): ~100-300 Mbps
  • t3.micro (2 vCPU, x86): ~100-300 Mbps
  • t3.small (2 vCPU): ~500-800 Mbps
  • t3.medium (2 vCPU): ~1+ Gbps

For network boot (typical boot = 50-200MB), even t4g.micro is sufficient:

  • Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
  • Recommendation: t4g.micro adequate and most cost-effective

Latency

  • VPN Overhead: WireGuard adds minimal latency (~1-5ms)
  • AWS Network: Low-latency network infrastructure
  • Total Latency: Primarily dependent on home ISP and AWS region proximity

CPU Usage

  • Encryption: ChaCha20 is CPU-efficient
  • Kernel Module: Minimal CPU overhead in kernel space
  • t4g.micro: Sufficient CPU for home lab VPN throughput
  • ARM Advantage: t4g instances use Graviton processors (better price/performance)

Security Considerations

Key Management

  • Private Keys: Store securely, never commit to version control
  • Key Rotation: Rotate keys periodically (e.g., annually)
  • Secrets Manager: Store WireGuard private keys in AWS Secrets Manager
    • Retrieve at instance startup via user data script
    • Avoid storing in AMIs or instance metadata
  • IAM Role: Grant EC2 instance IAM role to read secret

Firewall Hardening

  • Security Group Restriction: Limit WireGuard port to home lab public IP only
  • Least Privilege: Boot server security group allows only VPN security group
  • No Public Access: Boot server has no Elastic IP or public route

Monitoring and Alerts

  • CloudWatch Logs: Stream WireGuard logs to CloudWatch
  • CloudWatch Alarms: Alert on VPN tunnel down (no recent handshakes)
  • VPC Flow Logs: Monitor VPN traffic patterns

DDoS Protection

  • UDP Amplification: WireGuard resistant to DDoS amplification attacks
  • AWS Shield: Basic DDoS protection included free on all AWS resources
  • Shield Advanced: Optional ($3,000/month - overkill for VPN endpoint)

High Availability Options

Multi-AZ Failover

Deploy WireGuard gateways in multiple Availability Zones:

  • Primary: us-east-1a WireGuard instance
  • Secondary: us-east-1b WireGuard instance
  • Failover: UDM Pro switches endpoints if primary fails
  • Cost: Doubles instance costs (~$12-15/month for 2 instances)

Auto Scaling Group (Single Instance)

Use Auto Scaling Group with min=max=1 for auto-recovery:

  • Health Checks: EC2 status checks
  • Auto-Recovery: ASG replaces failed instance automatically
  • Elastic IP: Reassociate Elastic IP to new instance via Lambda/script
  • Limitation: Brief downtime during recovery (~2-5 minutes)

Health Monitoring

Monitor WireGuard tunnel health with CloudWatch custom metrics:

# On EC2 instance, run periodically via cron
#!/bin/bash
HANDSHAKE=$(wg show wg0 latest-handshakes | awk '{print $2}')
NOW=$(date +%s)
AGE=$((NOW - HANDSHAKE))

aws cloudwatch put-metric-data \
    --namespace WireGuard \
    --metric-name TunnelAge \
    --value $AGE \
    --unit Seconds

Alert if handshake age exceeds threshold (e.g., 180 seconds).

User Data Script for Auto-Configuration

EC2 user data script to configure WireGuard on launch:

#!/bin/bash
# Install WireGuard
apt update && apt install -y wireguard wireguard-tools

# Retrieve private key from Secrets Manager
aws secretsmanager get-secret-value \
    --secret-id wireguard-server-key \
    --query SecretString \
    --output text > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key

# Configure interface (full config omitted for brevity)
# ...

# Enable and start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0

Requires IAM instance role with secretsmanager:GetSecretValue permission.

Cost Analysis

Self-Managed WireGuard on EC2

ComponentCost (US East)
t4g.micro instance (730 hrs/month)~$6.00
Elastic IP (attached)$0.00
Data transfer out (1GB/month)~$0.09
Monthly Total~$6.09
Annual Total~$73

With Reserved Instance (1-year, no upfront):

ComponentCost
t4g.micro RI (1-year)~$3.50/month
Elastic IP$0.00
Data transfer~$0.09
Monthly Total~$3.59
Annual Total~$43

Site-to-Site VPN (IPsec - if WireGuard not used)

ComponentCost
VPN Connection (2 tunnels)~$36
Data transfer (1GB/month)~$0.09
Monthly Total~$36
Annual Total~$432

Cost Savings: Self-managed WireGuard saves ~$360/year vs Site-to-Site VPN (or ~$390/year with Reserved Instance).

Comparison with Requirements

RequirementAWS SupportImplementation
WireGuard Protocol✅ Via EC2Self-managed on instance
Site-to-Site VPN✅ YesWireGuard tunnel
UDM Pro Integration✅ Native supportWireGuard peer config
Cost Efficiency✅ Very low costt4g.micro ~$6/month (on-demand)
Performance✅ Sufficient100+ Mbps on t4g.micro
Security✅ Modern cryptoChaCha20, Curve25519
HA (optional)⚠️ Manual setupMulti-AZ or ASG

Recommendations

For Home Lab VPN (per ADR-0002)

  1. Self-Managed WireGuard: Deploy on EC2 t4g.micro instance

    • Cost: ~$6/month on-demand, ~$3.50/month with Reserved Instance
    • Performance: Sufficient for network boot traffic
    • Simplicity: Easy to configure and maintain
  2. Single AZ Deployment: Unless HA required, single instance adequate

    • Region Selection: Choose region closest to home lab for lowest latency
    • AZ: Single AZ sufficient (boot server not mission-critical)
  3. UDM Pro Native WireGuard: Use built-in WireGuard client

    • Configuration: Add AWS instance as WireGuard peer in UDM Pro UI
    • Route Injection: UDM Pro automatically routes AWS subnets
  4. Security Best Practices:

    • Store WireGuard private key in Secrets Manager
    • Restrict security group to home lab public IP only
    • Use user data script to retrieve key and configure on boot
    • Enable CloudWatch logging for VPN events
    • Assign IAM instance role with minimal permissions
  5. Monitoring: Set up CloudWatch alarms for:

    • Instance status check failures
    • High CPU usage
    • VPN tunnel age (custom metric)

Cost Optimization

  • Reserved Instance: Commit to 1-year Reserved Instance for ~40% savings
  • Spot Instance: Consider Spot for even lower cost (~70% savings), but adds complexity (handle interruptions)
  • ARM Architecture: Use t4g (Graviton) for 20% better price/performance vs t3

Future Enhancements

  • HA Setup: Deploy secondary WireGuard instance in different AZ
  • Automated Failover: Lambda function to reassociate Elastic IP on failure
  • IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
  • Mesh VPN: Expand to mesh topology if multiple sites added

References

3 - Google Cloud Platform Analysis

Technical analysis of Google Cloud Platform capabilities for hosting network boot infrastructure

This section contains detailed analysis of Google Cloud Platform (GCP) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.

Overview

Google Cloud Platform is Google’s suite of cloud computing services, offering compute, storage, networking, and managed services. This analysis focuses on GCP’s capabilities to support the network boot architecture decided in ADR-0002.

Key Services Evaluated

  • Compute Engine: Virtual machine instances for hosting boot server
  • Cloud VPN / VPC: Network connectivity and VPN capabilities
  • Cloud Load Balancing: Layer 4 and Layer 7 load balancing for HTTP/HTTPS
  • Cloud NAT: Network address translation for outbound connectivity
  • VPC Network: Software-defined networking and routing

Documentation Sections

3.1 - Cloud Storage FUSE (gcsfuse)

Analysis of Google Cloud Storage FUSE for mounting GCS buckets as local filesystems in network boot infrastructure

Overview

Cloud Storage FUSE (gcsfuse) is a FUSE-based filesystem adapter that allows Google Cloud Storage (GCS) buckets to be mounted and accessed as local filesystems on Linux systems. This enables applications to interact with object storage using standard filesystem operations (open, read, write, etc.) rather than requiring GCS-specific APIs.

Project: GoogleCloudPlatform/gcsfuse License: Apache 2.0 Status: Generally Available (GA) Latest Version: v2.x (as of 2024)

How gcsfuse Works

gcsfuse translates filesystem operations into GCS API calls:

  1. Mount Operation: gcsfuse bucket-name /mount/point maps a GCS bucket to a local directory
  2. Directory Structure: Interprets / in object names as directory separators
  3. File Operations: Translates read(), write(), open(), etc. into GCS API requests
  4. Metadata: Maintains file attributes (size, modification time) via GCS metadata
  5. Caching: Optional stat, type, list, and file caching to reduce API calls

Example:

  • GCS object: gs://boot-assets/kernels/talos-v1.6.0.img
  • Mounted path: /mnt/boot-assets/kernels/talos-v1.6.0.img

Relevance to Network Boot Infrastructure

In the context of ADR-0005 Network Boot Infrastructure, gcsfuse offers a potential approach for serving boot assets from Cloud Storage without custom integration code.

Potential Use Cases

  1. Boot Asset Storage: Mount gs://boot-assets/ to /var/lib/boot-server/assets/
  2. Configuration Sync: Access boot profiles and machine mappings from GCS as local files
  3. Matchbox Integration: Mount GCS bucket to /var/lib/matchbox/ for assets/profiles/groups
  4. Simplified Development: Eliminate custom Cloud Storage SDK integration in boot server code

Architecture Pattern

┌─────────────────────────┐
│   Boot Server Process   │
│  (Cloud Run/Compute)    │
└───────────┬─────────────┘
            │ filesystem operations
            │ (read, open, stat)
            ▼
┌─────────────────────────┐
│   gcsfuse mount point   │
│   /var/lib/boot-assets  │
└───────────┬─────────────┘
            │ FUSE layer
            │ (translates to GCS API)
            ▼
┌─────────────────────────┐
│  Cloud Storage Bucket   │
│   gs://boot-assets/     │
└─────────────────────────┘

Performance Characteristics

Latency

  • Much higher latency than local filesystem: Every operation requires GCS API call(s)
  • No default caching: Without caching enabled, every read re-fetches from GCS
  • Network round-trip: Minimum ~10-50ms latency per operation (depending on region)

Throughput

Single Large File:

  • Read: ~4.1 MiB/s (individual file), up to 63.3 MiB/s (archive files)
  • Write: Comparable to gsutil cp for large files
  • With parallel downloads: Up to 9x faster for single-threaded reads of large files

Small Files:

  • Poor performance for random I/O on small files
  • Bulk operations on many small files create significant bottlenecks
  • ls on directories with thousands of objects can take minutes

Concurrent Access:

  • Performance degrades significantly with parallel readers (8 instances: ~30 hours vs 16 minutes with local data)
  • Not recommended for high-concurrency scenarios (web servers, NAS)

Performance Improvements (Recent Features)

  1. Streaming Writes (default): Upload data directly to GCS as written

    • Up to 40% faster for large sequential writes
    • Reduces local disk usage (no staging file)
  2. Parallel Downloads: Download large files using multiple workers

    • Up to 9x faster model load times
    • Best for single-threaded reads of large files
  3. File Cache: Cache file contents locally (Local SSD, Persistent Disk, or tmpfs)

    • Up to 2.3x faster training time (AI/ML workloads)
    • Up to 3.4x higher throughput
    • Requires explicit cache directory configuration
  4. Metadata Cache: Cache stat, type, and list operations

    • Stat and type caches enabled by default
    • Configurable TTL (default: 60s, set -1 for unlimited)

Caching Configuration

gcsfuse provides four types of caching:

1. Stat Cache

Caches file attributes (size, modification time, existence).

# Enable with unlimited size and TTL
gcsfuse \
  --stat-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Reduces API calls for repeated stat() operations (e.g., checking file existence).

2. Type Cache

Caches file vs directory type information.

gcsfuse \
  --type-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Speeds up directory traversal and ls operations.

3. List Cache

Caches directory listing results.

gcsfuse \
  --max-conns-per-host=100 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Improves performance for applications that repeatedly list directory contents.

4. File Cache

Caches actual file contents locally.

gcsfuse \
  --file-cache-max-size-mb=-1 \
  --cache-dir=/mnt/local-ssd \
  --file-cache-cache-file-for-range-read=true \
  --file-cache-enable-parallel-downloads=true \
  bucket-name /mount/point

Use case: Essential for AI/ML training, repeated reads of large files.

Recommended cache storage:

  • Local SSD: Fastest, but ephemeral (data lost on restart)
  • Persistent Disk: Persistent but slower than Local SSD
  • tmpfs (RAM disk): Fastest but limited by memory

Production Configuration Example

# config.yaml for gcsfuse
metadata-cache:
  ttl-secs: -1  # Never expire (use only if bucket is read-only or single-writer)
  stat-cache-max-size-mb: -1
  type-cache-max-size-mb: -1

file-cache:
  max-size-mb: -1  # Unlimited (limited by disk space)
  cache-file-for-range-read: true
  enable-parallel-downloads: true
  parallel-downloads-per-file: 16
  download-chunk-size-mb: 50

write:
  create-empty-file: false  # Streaming writes (default)

logging:
  severity: info
  format: json
gcsfuse --config-file=config.yaml boot-assets /mnt/boot-assets

Limitations and Considerations

Filesystem Semantics

gcsfuse provides approximate POSIX semantics but is not fully POSIX-compliant:

  • No atomic rename: Rename operations are copy-then-delete (not atomic)
  • No hard links: GCS doesn’t support hard links
  • No file locking: flock() is a no-op
  • Limited permissions: GCS has simpler ACLs than POSIX permissions
  • No sparse files: Writes always materialize full file content

Performance Anti-Patterns

Avoid:

  • Serving web content or acting as NAS (concurrent connections)
  • Random I/O on many small files (image datasets, text corpora)
  • Reading during ML training loops (download first, then train)
  • High-concurrency workloads (multiple parallel readers/writers)

Good for:

  • Sequential reads of large files (models, checkpoints, kernels)
  • Infrequent writes of entire files
  • Read-mostly workloads with caching enabled
  • Single-writer scenarios

Consistency Trade-offs

With caching enabled:

  • Stale reads possible if cache TTL > 0 and external modifications occur
  • Safe only for:
    • Read-only buckets
    • Single-writer, single-mount scenarios
    • Workloads tolerant of eventual consistency

Without caching:

  • Strong consistency (every read fetches latest from GCS)
  • Much slower performance

Resource Requirements

  • Disk space: File cache and streaming writes require local storage
    • File cache: Size of cached files (can be large for ML datasets)
    • Streaming writes: Temporary staging (proportional to concurrent writes)
  • Memory: Metadata caches consume RAM
  • File handles: Can exceed system limits with high concurrency
  • Network bandwidth: All data transfers via GCS API

Installation

On Compute Engine (Container-Optimized OS)

# Install gcsfuse (Container-Optimized OS doesn't include package managers)
export GCSFUSE_VERSION=2.x.x
curl -L -O https://github.com/GoogleCloudPlatform/gcsfuse/releases/download/v${GCSFUSE_VERSION}/gcsfuse_${GCSFUSE_VERSION}_amd64.deb
sudo dpkg -i gcsfuse_${GCSFUSE_VERSION}_amd64.deb

On Debian/Ubuntu

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

sudo apt-get update
sudo apt-get install gcsfuse

In Docker/Cloud Run

FROM ubuntu:22.04

# Install gcsfuse
RUN apt-get update && apt-get install -y \
    curl \
    gnupg \
    lsb-release \
  && export GCSFUSE_REPO=gcsfuse-$(lsb_release -c -s) \
  && echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list \
  && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
  && apt-get update \
  && apt-get install -y gcsfuse \
  && rm -rf /var/lib/apt/lists/*

# Create mount point
RUN mkdir -p /mnt/boot-assets

# Mount gcsfuse at startup
CMD gcsfuse --foreground boot-assets /mnt/boot-assets & \
    /usr/local/bin/boot-server

Note: Cloud Run does not support FUSE filesystems (requires privileged mode). gcsfuse only works on Compute Engine or GKE.

Network Boot Infrastructure Evaluation

Applicability to ADR-0005

Based on the analysis, gcsfuse is not recommended for the network boot infrastructure for the following reasons:

❌ Cloud Run Incompatibility

  • gcsfuse requires FUSE kernel module and privileged containers
  • Cloud Run does not support FUSE or privileged mode
  • ADR-0005 prefers Cloud Run deployment (HTTP-only boot enables serverless)
  • Impact: Blocks Cloud Run deployment, forcing Compute Engine VM

❌ Boot Latency Requirements

  • Boot file requests target < 100ms latency (ADR-0005 confirmation criteria)
  • gcsfuse adds 10-50ms+ latency per operation (network round-trips)
  • Kernel/initrd downloads are latency-sensitive (network boot timeout)
  • Impact: May exceed boot timeout thresholds

❌ No Caching for Read-Write Workloads

  • Boot server needs to write new assets and read existing ones
  • File cache with unlimited TTL requires read-only or single-writer assumption
  • Multiple boot server instances (autoscaling) violate single-writer constraint
  • Impact: Either accept stale reads or disable caching (slow)

❌ Small File Performance

  • Machine mapping configs, boot scripts, profiles are small files (KB range)
  • gcsfuse performs poorly on small, random I/O
  • ls operations on directories with many profiles can be slow
  • Impact: Slow boot configuration lookups

✅ Alternative: Direct Cloud Storage SDK

Using cloud.google.com/go/storage SDK directly offers:

  • Lower latency: Direct API calls without FUSE overhead
  • Cloud Run compatible: No kernel module or privileged mode required
  • Better control: Explicit caching, parallel downloads, streaming
  • Simpler deployment: No mount management, no FUSE dependencies
  • Cost: Similar API call costs to gcsfuse

Recommended approach (from ADR-0005):

// Custom boot server using Cloud Storage SDK
storage := storage.NewClient(ctx)
bucket := storage.Bucket("boot-assets")

// Stream kernel to boot client
obj := bucket.Object("kernels/talos-v1.6.0.img")
reader, _ := obj.NewReader(ctx)
defer reader.Close()
io.Copy(w, reader)  // Stream to HTTP response

When gcsfuse MIGHT Be Useful

Despite the above limitations, gcsfuse could be considered for:

  1. Matchbox on Compute Engine:

    • Matchbox expects filesystem paths for assets (/var/lib/matchbox/assets/)
    • Compute Engine VM supports FUSE
    • Read-heavy workload (boot assets rarely change)
    • Could mount gs://boot-assets/ to /var/lib/matchbox/assets/ with file cache
  2. Development/Testing:

    • Quick prototyping without writing Cloud Storage integration
    • Local development with production bucket access
    • Not recommended for production deployment
  3. Low-Throughput Scenarios:

    • Home lab scale (< 10 boots/hour)
    • File cache enabled with Local SSD
    • Single Compute Engine VM (not autoscaled)

Configuration for Matchbox + gcsfuse:

#!/bin/bash
# Mount boot assets for Matchbox

BUCKET="boot-assets"
MOUNT_POINT="/var/lib/matchbox/assets"
CACHE_DIR="/mnt/disks/local-ssd/gcsfuse-cache"

mkdir -p "$MOUNT_POINT" "$CACHE_DIR"

gcsfuse \
  --stat-cache-max-size-mb=-1 \
  --type-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  --file-cache-max-size-mb=-1 \
  --cache-dir="$CACHE_DIR" \
  --file-cache-cache-file-for-range-read=true \
  --file-cache-enable-parallel-downloads=true \
  --implicit-dirs \
  --foreground \
  "$BUCKET" "$MOUNT_POINT"

Monitoring and Troubleshooting

Metrics

gcsfuse exposes Prometheus metrics:

gcsfuse --prometheus --prometheus-port=9101 bucket /mnt/point

Key metrics:

  • gcs_read_count: Number of GCS read operations
  • gcs_write_count: Number of GCS write operations
  • gcs_read_bytes: Bytes read from GCS
  • gcs_write_bytes: Bytes written to GCS
  • fs_ops_count: Filesystem operations by type (open, read, write, etc.)
  • fs_ops_error_count: Filesystem operation errors

Logging

# JSON logging for Cloud Logging integration
gcsfuse --log-format=json --log-file=/var/log/gcsfuse.log bucket /mnt/point

Common Issues

Issue: ls on large directories takes minutes

Solution:

  • Enable list caching with --metadata-cache-ttl-secs=-1
  • Reduce directory depth (flatten object hierarchy)
  • Consider prefix-based filtering instead of full listings

Issue: Stale reads after external bucket modifications

Solution:

  • Reduce --metadata-cache-ttl-secs (default 60s)
  • Disable caching entirely for strong consistency
  • Use versioned object names (immutable assets)

Issue: Transport endpoint is not connected errors

Solution:

  • Unmount cleanly before remounting: fusermount -u /mnt/point
  • Check GCS bucket permissions (IAM roles)
  • Verify network connectivity to storage.googleapis.com

Issue: High memory usage

Solution:

  • Limit metadata cache sizes: --stat-cache-max-size-mb=1024
  • Disable file cache if not needed
  • Monitor with --prometheus metrics

Comparison to Alternatives

gcsfuse vs Direct Cloud Storage SDK

AspectgcsfuseCloud Storage SDK
LatencyHigher (FUSE overhead + GCS API)Lower (direct GCS API)
Cloud Run❌ Not supported✅ Fully supported
Development EffortLow (standard filesystem code)Medium (SDK integration)
PerformanceSlower (filesystem abstraction)Faster (optimized for use case)
CachingBuilt-in (stat, type, list, file)Manual (application-level)
StreamingAutomaticExplicit (io.Copy)
DependenciesFUSE kernel module, privileged modeNone (pure Go library)

Recommendation: Use Cloud Storage SDK directly for production network boot infrastructure.

gcsfuse vs rsync/gsutil Sync

Periodic sync pattern:

# Sync bucket to local disk every 5 minutes
*/5 * * * * gsutil -m rsync -r gs://boot-assets /var/lib/boot-assets
Aspectgcsfusersync/gsutil sync
ConsistencyEventual (with caching)Strong (within sync interval)
Disk UsageMinimal (file cache optional)Full copy of assets
LatencyGCS API per requestLocal disk (fast)
Sync LagReal-time (no caching) or TTLSync interval (minutes)
DeploymentRequires FUSESimple cron job

Recommendation: For read-heavy, infrequent-write workloads on Compute Engine, rsync/gsutil sync is simpler and faster than gcsfuse.

Conclusion

Cloud Storage FUSE (gcsfuse) provides a convenient filesystem abstraction over GCS buckets, but is not recommended for the network boot infrastructure due to:

  1. Cloud Run incompatibility (requires FUSE kernel module)
  2. Added latency (FUSE overhead + network round-trips)
  3. Poor performance for small files and concurrent access
  4. Caching trade-offs (consistency vs performance)

Recommended alternatives:

  • Custom Boot Server: Direct Cloud Storage SDK integration (cloud.google.com/go/storage)
  • Matchbox on Compute Engine: rsync/gsutil sync to local disk
  • Cloud Run Deployment: Direct SDK (no gcsfuse possible)

gcsfuse may be useful for development/testing or Matchbox prototyping on Compute Engine, but production deployments should use direct SDK integration or periodic sync for optimal performance and Cloud Run compatibility.

References

3.2 - GCP Network Boot Protocol Support

Analysis of Google Cloud Platform’s support for TFTP, HTTP, and HTTPS routing for network boot infrastructure

Network Boot Protocol Support on Google Cloud Platform

This document analyzes GCP’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.

TFTP (Trivial File Transfer Protocol) Support

Native Support

Status: ❌ Not natively supported by Cloud Load Balancing

GCP’s Cloud Load Balancing services (Application Load Balancer, Network Load Balancer) do not support TFTP protocol natively. TFTP operates on UDP port 69 and has unique protocol requirements that are not compatible with GCP’s load balancing services.

Implementation Options

Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from a Compute Engine VM without load balancing:

  • Approach: Run TFTP server (e.g., tftpd-hpa, dnsmasq) on a Compute Engine VM
  • Access: Home lab connects via VPN tunnel to the VM’s private IP
  • Routing: VPC firewall rules allow UDP/69 from VPN subnet
  • Pros:
    • Simple implementation
    • No need for load balancing (single boot server sufficient)
    • TFTP traffic encrypted through VPN tunnel
    • Direct VM-to-client communication
  • Cons:
    • Single point of failure (no load balancing/HA)
    • Manual failover required if VM fails

Option 2: Network Load Balancer (NLB) Passthrough

While NLB doesn’t parse TFTP protocol, it can forward UDP traffic:

  • Approach: Configure Network Load Balancer for UDP/69 passthrough
  • Limitations:
    • No protocol-aware health checks for TFTP
    • Health checks would use TCP or HTTP on alternate port
    • Adds complexity without significant benefit for single boot server
  • Use Case: Only relevant for multi-region HA deployment (overkill for home lab)

TFTP Security Considerations

  • Encryption: TFTP protocol itself is unencrypted, but VPN tunnel provides encryption
  • Firewall Rules: Restrict UDP/69 to VPN subnet only (no public access)
  • File Access Control: Configure TFTP server with restricted file access
  • Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads

HTTP Support

Native Support

Status: ✅ Fully supported

GCP provides comprehensive HTTP support through multiple services:

Cloud Load Balancing - Application Load Balancer

  • Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (QUIC)
  • Port: Any port (typically 80 for HTTP)
  • Routing: URL-based routing, host-based routing, path-based routing
  • Health Checks: HTTP health checks with configurable paths
  • SSL Offloading: Can terminate SSL at load balancer and use HTTP backend
  • Backend: Compute Engine VMs, instance groups, Cloud Run, GKE

Compute Engine Direct Access

For VPN scenario, HTTP can be served directly from VM:

  • Approach: Run HTTP server (nginx, Apache, custom service) on Compute Engine VM
  • Access: Home lab accesses via VPN tunnel to private IP
  • Firewall: VPC firewall rules allow TCP/80 from VPN subnet
  • Pros: Simpler than load balancer for single boot server

HTTP Boot Flow for Network Boot

  1. PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
  2. iPXE → HTTP: iPXE chainloads boot files via HTTP from same server
  3. Kernel/Initrd: Large boot files served efficiently over HTTP

Performance Considerations

  • Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
  • Compression: gzip compression for text-based boot configs
  • Caching: Cloud CDN can cache boot files for faster delivery
  • TCP Optimization: GCP’s network optimized for low-latency TCP

HTTPS Support

Native Support

Status: ✅ Fully supported with advanced features

GCP provides enterprise-grade HTTPS support:

Cloud Load Balancing - Application Load Balancer

  • Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 with QUIC
  • SSL/TLS Termination: Terminate SSL at load balancer
  • Certificate Management:
    • Google-managed SSL certificates (automatic renewal)
    • Self-managed certificates (bring your own)
    • Certificate Map for multiple domains
  • TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable minimum version)
  • Cipher Suites: Modern, compatible, or custom cipher suites
  • mTLS Support: Mutual TLS authentication (client certificates)

Certificate Manager

  • Managed Certificates: Automatic provisioning and renewal via Let’s Encrypt integration
  • Private CA: Integration with Google Cloud Certificate Authority Service
  • Certificate Maps: Route different domains to different backends based on SNI
  • Certificate Monitoring: Automatic alerts before expiration

HTTPS for Network Boot

Use Case

Modern UEFI firmware and iPXE support HTTPS boot:

  • iPXE HTTPS: iPXE compiled with DOWNLOAD_PROTO_HTTPS can fetch over HTTPS
  • UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot (RFC 3720 iSCSI boot)
  • Security: Boot file integrity verified via HTTPS chain of trust

Implementation on GCP

  1. Certificate Provisioning:

    • Use Google-managed certificate for public domain (if boot server has public DNS)
    • Use self-signed certificate for VPN-only access (add to iPXE trust store)
    • Use private CA for internal PKI
  2. Load Balancer Configuration:

    • HTTPS frontend (port 443)
    • Backend service to Compute Engine VM running boot server
    • SSL policy with TLS 1.2+ minimum
  3. Alternative: Direct VM HTTPS:

    • Run nginx/Apache with TLS on Compute Engine VM
    • Access via VPN tunnel to private IP with HTTPS
    • Simpler setup for VPN-only scenario

mTLS Support for Enhanced Security

GCP’s Application Load Balancer supports mutual TLS authentication:

  • Client Certificates: Require client certificates for additional authentication
  • Certificate Validation: Validate client certificates against trusted CA
  • Use Case: Ensure only authorized home lab servers can access boot files
  • Integration: Combine with VPN for defense-in-depth

Routing and Load Balancing Capabilities

VPC Routing

  • Custom Routes: Define routes to direct traffic through VPN gateway
  • Route Priority: Configure route priorities for failover scenarios
  • BGP Support: Dynamic routing with Cloud Router (for advanced VPN setups)

Firewall Rules

  • Ingress/Egress Rules: Fine-grained control over traffic
  • Source/Destination Filters: IP ranges, tags, service accounts
  • Protocol Filtering: Allow specific protocols (UDP/69, TCP/80, TCP/443)
  • VPN Subnet Restriction: Limit access to VPN-connected home lab subnet

Cloud Armor (Optional)

For additional security if boot server has public access:

  • DDoS Protection: Layer 3/4 DDoS mitigation
  • WAF Rules: Application-level filtering
  • IP Allowlisting: Restrict to known public IPs
  • Rate Limiting: Prevent abuse

Cost Implications

Network Egress Costs

  • VPN Traffic: Egress to VPN endpoint charged at standard internet egress rates
  • Intra-Region: Free for traffic within same region
  • Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
  • Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.18/month (US egress)

Load Balancing Costs

  • Application Load Balancer: ~$0.025/hour + $0.008 per LCU-hour
  • Network Load Balancer: ~$0.025/hour + data processing charges
  • For VPN Scenario: Load balancer likely unnecessary (single VM sufficient)

Compute Costs

  • e2-micro Instance: ~$6-7/month (suitable for boot server)
  • f1-micro Instance: ~$4-5/month (even smaller, might suffice)
  • Reserved/Committed Use: Discounts for long-term commitment

Comparison with Requirements

RequirementGCP SupportImplementation
TFTP⚠️ Via VM, not LBDirect VM access via VPN
HTTP✅ Full supportVM or ALB
HTTPS✅ Full supportVM or ALB with Certificate Manager
VPN Integration✅ Native VPNCloud VPN or self-managed WireGuard
Load Balancing✅ ALB, NLBOptional for HA
Certificate Mgmt✅ Managed certsCertificate Manager
Cost Efficiency✅ Low-cost VMse2-micro sufficient

Recommendations

For VPN-Based Architecture (per ADR-0002)

  1. Compute Engine VM: Deploy single e2-micro VM with:

    • TFTP server (tftpd-hpa or dnsmasq)
    • HTTP server (nginx or simple Python HTTP server)
    • Optional HTTPS with self-signed certificate
  2. VPN Tunnel: Connect home lab to GCP via:

    • Cloud VPN (IPsec) - easier setup, higher cost
    • Self-managed WireGuard on Compute Engine - lower cost, more control
  3. VPC Firewall: Restrict access to:

    • UDP/69 (TFTP) from VPN subnet only
    • TCP/80 (HTTP) from VPN subnet only
    • TCP/443 (HTTPS) from VPN subnet only
  4. No Load Balancer: For home lab scale, direct VM access is sufficient

  5. Health Monitoring: Use Cloud Monitoring for VM and service health

If HA Required (Future Enhancement)

  • Deploy multi-zone VMs with Network Load Balancer
  • Use Cloud Storage as backend for boot files with VM serving as cache
  • Implement failover automation with Cloud Functions

References

3.3 - GCP WireGuard VPN Support

Analysis of WireGuard VPN deployment options on Google Cloud Platform for secure site-to-site connectivity

WireGuard VPN Support on Google Cloud Platform

This document analyzes options for deploying WireGuard VPN on GCP to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.

WireGuard Overview

WireGuard is a modern VPN protocol that provides:

  • Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
  • Performance: High throughput with low overhead
  • Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
  • Configuration: Simple key-based configuration
  • Kernel Integration: Mainline Linux kernel support since 5.6

GCP Native VPN Support

Cloud VPN (IPsec)

Status: ❌ WireGuard not natively supported

GCP’s managed Cloud VPN service supports:

  • IPsec VPN: IKEv1, IKEv2 with PSK or certificate authentication
  • HA VPN: Highly available VPN with 99.99% SLA
  • Classic VPN: Single-tunnel VPN (deprecated)

Limitation: Cloud VPN does not support WireGuard protocol natively.

Cost: Cloud VPN

  • HA VPN: ~$0.05/hour per tunnel × 2 tunnels = ~$73/month
  • Egress: Standard internet egress rates (~$0.12/GB for first 1TB)
  • Total Estimate: ~$75-100/month for managed VPN

Self-Managed WireGuard on Compute Engine

Implementation Approach

Since GCP doesn’t offer managed WireGuard, deploy WireGuard on a Compute Engine VM:

Status: ✅ Fully supported via Compute Engine

Architecture

graph LR
    A[Home Lab] -->|WireGuard Tunnel| B[GCP Compute Engine VM]
    B -->|Private VPC Network| C[Boot Server VM]
    B -->|IP Forwarding| C
    
    subgraph "Home Network"
        A
        D[UDM Pro]
        D -.WireGuard Client.- A
    end
    
    subgraph "GCP VPC"
        B[WireGuard Gateway VM]
        C[Boot Server VM]
    end

VM Configuration

  1. WireGuard Gateway VM:

    • Instance Type: e2-micro or f1-micro ($4-7/month)
    • OS: Ubuntu 22.04 LTS or Debian 12 (native WireGuard kernel support)
    • IP Forwarding: Enable IP forwarding to route traffic to other VMs
    • External IP: Static external IP for stable WireGuard endpoint
    • Firewall: Allow UDP port 51820 (WireGuard) from home lab public IP
  2. Boot Server VM:

    • Network: Same VPC as WireGuard gateway
    • Private IP Only: No external IP (accessed via VPN)
    • Route Traffic: Through WireGuard gateway VM

Installation Steps

# On GCP Compute Engine VM (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools

# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key

# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf

Example /etc/wireguard/wg0.conf on GCP VM:

[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o ens4 -j MASQUERADE

[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24

Corresponding config on UDM Pro:

[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>

[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <GCP_VM_EXTERNAL_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.128.0.0/20
PersistentKeepalive = 25

Enable and Start WireGuard

# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0

# Verify status
sudo wg show

GCP VPC Configuration

Firewall Rules

Create VPC firewall rule to allow WireGuard:

gcloud compute firewall-rules create allow-wireguard \
    --direction=INGRESS \
    --priority=1000 \
    --network=default \
    --action=ALLOW \
    --rules=udp:51820 \
    --source-ranges=<HOME_LAB_PUBLIC_IP>/32 \
    --target-tags=wireguard-gateway

Tag the WireGuard VM:

gcloud compute instances add-tags wireguard-gateway-vm \
    --tags=wireguard-gateway \
    --zone=us-central1-a

Static External IP

Reserve static IP for stable WireGuard endpoint:

gcloud compute addresses create wireguard-gateway-ip \
    --region=us-central1

gcloud compute instances delete-access-config wireguard-gateway-vm \
    --access-config-name="external-nat" \
    --zone=us-central1-a

gcloud compute instances add-access-config wireguard-gateway-vm \
    --access-config-name="external-nat" \
    --address=wireguard-gateway-ip \
    --zone=us-central1-a

Cost: Static IP ~$3-4/month if VM is always running (free if attached to running VM in some regions).

Route Configuration

For traffic from boot server to reach home lab via WireGuard VM:

gcloud compute routes create route-to-homelab \
    --network=default \
    --priority=100 \
    --destination-range=192.168.1.0/24 \
    --next-hop-instance=wireguard-gateway-vm \
    --next-hop-instance-zone=us-central1-a

This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway VM.

UDM Pro WireGuard Integration

Native Support

Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)

The UniFi Dream Machine Pro includes native WireGuard VPN support:

  • GUI Configuration: Web UI for WireGuard VPN setup
  • Site-to-Site: Support for site-to-site VPN tunnels
  • Performance: Hardware acceleration for encryption (if available)
  • Routing: Automatic route injection for remote subnets

Configuration Steps on UDM Pro

  1. Network Settings → VPN:

    • Create new VPN connection
    • Select “WireGuard”
    • Generate key pair or import existing
  2. Peer Configuration:

    • Peer Public Key: GCP WireGuard VM’s public key
    • Endpoint: GCP VM’s static external IP
    • Port: 51820
    • Allowed IPs: GCP VPC subnet (e.g., 10.128.0.0/20)
    • Persistent Keepalive: 25 seconds
  3. Route Injection:

    • UDM Pro automatically adds routes to GCP subnets
    • Home lab servers can reach GCP boot server via VPN
  4. Firewall Rules:

    • Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN

Alternative: Manual WireGuard on UDM Pro

If native support is insufficient, use wireguard-go via udm-utilities:

  • Repository: boostchicken/udm-utilities
  • Script: on_boot.d script to start WireGuard
  • Persistence: Survives firmware updates with on-boot script

Performance Considerations

Throughput

WireGuard on Compute Engine performance:

  • e2-micro (2 vCPU, shared core): ~100-300 Mbps
  • e2-small (2 vCPU): ~500-800 Mbps
  • e2-medium (2 vCPU): ~1+ Gbps

For network boot (typical boot = 50-200MB), even e2-micro is sufficient:

  • Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
  • Recommendation: e2-micro adequate for home lab scale

Latency

  • VPN Overhead: WireGuard adds minimal latency (~1-5ms overhead)
  • GCP Network: Low-latency network to most regions
  • Total Latency: Primarily dependent on home ISP and GCP region proximity

CPU Usage

  • Encryption: ChaCha20 is CPU-efficient
  • Kernel Module: Minimal CPU overhead in kernel space
  • e2-micro: Sufficient CPU for home lab VPN throughput

Security Considerations

Key Management

  • Private Keys: Store securely, never commit to version control
  • Key Rotation: Rotate keys periodically (e.g., annually)
  • Secret Manager: Store WireGuard private keys in GCP Secret Manager
    • Retrieve at VM startup via startup script
    • Avoid storing in VM metadata or disk images

Firewall Hardening

  • Source IP Restriction: Limit WireGuard port to home lab public IP only
  • Least Privilege: Boot server firewall allows only VPN subnet
  • No Public Access: Boot server has no external IP

Monitoring and Alerts

  • Cloud Logging: Log WireGuard connection events
  • Cloud Monitoring: Alert on VPN tunnel down
  • Metrics: Monitor handshake failures, data transfer

DDoS Protection

  • UDP Amplification: WireGuard resistant to DDoS amplification
  • Cloud Armor: Optional layer for additional DDoS protection (overkill for VPN)

High Availability Options

Multi-Region Failover

Deploy WireGuard gateways in multiple regions:

  • Primary: us-central1 WireGuard VM
  • Secondary: us-east1 WireGuard VM
  • Failover: UDM Pro switches endpoints if primary fails
  • Cost: Doubles VM costs (~$8-14/month for 2 VMs)

Health Checks

Monitor WireGuard tunnel health:

# On UDM Pro (via SSH)
wg show wg0 latest-handshakes

# If handshake timestamp old (>3 minutes), tunnel may be down

Automate failover with script on UDM Pro or external monitoring.

Startup Scripts for Auto-Healing

GCP VM startup script to ensure WireGuard starts on boot:

#!/bin/bash
# /etc/startup-script.sh

# Retrieve WireGuard private key from Secret Manager
gcloud secrets versions access latest --secret="wireguard-server-key" > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key

# Start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0

Attach as metadata:

gcloud compute instances add-metadata wireguard-gateway-vm \
    --metadata-from-file startup-script=/path/to/startup-script.sh \
    --zone=us-central1-a

Cost Analysis

Self-Managed WireGuard on Compute Engine

ComponentCost
e2-micro VM (730 hrs/month)~$6.50
Static External IP~$3.50
Egress (1GB/month boot traffic)~$0.12
Monthly Total~$10.12
Annual Total~$121

Cloud VPN (IPsec - if WireGuard not used)

ComponentCost
HA VPN Gateway (2 tunnels)~$73
Egress (1GB/month)~$0.12
Monthly Total~$73
Annual Total~$876

Cost Savings: Self-managed WireGuard saves ~$755/year vs Cloud VPN.

Comparison with Requirements

RequirementGCP SupportImplementation
WireGuard Protocol✅ Via Compute EngineSelf-managed on VM
Site-to-Site VPN✅ YesWireGuard tunnel
UDM Pro Integration✅ Native supportWireGuard peer config
Cost Efficiency✅ Low coste2-micro ~$10/month
Performance✅ Sufficient100+ Mbps on e2-micro
Security✅ Modern cryptoChaCha20, Curve25519
HA (optional)⚠️ Manual setupMulti-region VMs

Recommendations

For Home Lab VPN (per ADR-0002)

  1. Self-Managed WireGuard: Deploy on Compute Engine e2-micro VM

    • Cost: ~$10/month (vs ~$73/month for Cloud VPN)
    • Performance: Sufficient for network boot traffic
    • Simplicity: Easy to configure and maintain
  2. Single Region Deployment: Unless HA required, single VM adequate

    • Region Selection: Choose region closest to home lab for lowest latency
    • Zone: Single zone sufficient (boot server not mission-critical)
  3. UDM Pro Native WireGuard: Use built-in WireGuard client

    • Configuration: Add GCP VM as WireGuard peer in UDM Pro UI
    • Route Injection: UDM Pro automatically routes GCP subnets
  4. Security Best Practices:

    • Store WireGuard private key in Secret Manager
    • Restrict WireGuard port to home public IP only
    • Use startup script to configure VM on boot
    • Enable Cloud Logging for VPN events
  5. Monitoring: Set up Cloud Monitoring alerts for:

    • VM down
    • High CPU usage (indicates traffic spike or issue)
    • Firewall rule blocks (indicates misconfiguration)

Future Enhancements

  • HA Setup: Deploy secondary WireGuard VM in different region
  • Automated Failover: Script on UDM Pro to switch endpoints
  • IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
  • Mesh VPN: Expand to mesh topology if multiple sites added

References

4 - HP ProLiant DL360 Gen9 Analysis

Technical analysis of HP ProLiant DL360 Gen9 server capabilities with focus on network boot support

This section contains detailed analysis of the HP ProLiant DL360 Gen9 server platform, including hardware specifications, network boot capabilities, and configuration guidance for home lab deployments.

Overview

The HP ProLiant DL360 Gen9 is a 1U rack-mountable server released by HPE as part of their Generation 9 (Gen9) product line, introduced in 2014. It’s a popular choice for home labs due to its balance of performance, density, and relative power efficiency compared to earlier generations.

Key Features

  • Form Factor: 1U rack-mountable
  • Processor Support: Dual Intel Xeon E5-2600 v3/v4 processors (Haswell/Broadwell)
  • Memory: Up to 768GB DDR4 RAM (24 DIMM slots)
  • Storage: Flexible SFF/LFF drive configurations
  • Network: Integrated quad-port 1GbE or 10GbE FlexibleLOM options
  • Management: iLO 4 (Integrated Lights-Out) with remote KVM and virtual media
  • Boot Options: UEFI and Legacy BIOS support with extensive network boot capabilities

Documentation Sections

4.1 - Configuration Guide

Setup, optimization, and configuration recommendations for HP ProLiant DL360 Gen9 in home lab environments

Initial Setup

Hardware Assembly

  1. Install Processors:

    • Use thermal paste (HPE thermal grease recommended)
    • Align CPU carefully with socket (LGA 2011-3)
    • Secure heatsink with proper torque (hand-tighten screws in cross pattern)
    • Install both CPUs for dual-socket configuration
  2. Install Memory:

    • Populate channels evenly (see Memory Configuration below)
    • Seat DIMMs firmly until retention clips engage
    • Verify all DIMMs recognized in POST
  3. Install Storage:

    • Insert drives into hot-swap caddies
    • Label drives clearly for identification
    • Configure RAID controller (see Storage Configuration below)
  4. Install Network Cards:

    • FlexibleLOM: Slide into dedicated slot until seated
    • PCIe cards: Ensure low-profile brackets, secure with screw
    • Note MAC addresses for DHCP reservations
  5. Connect Power:

    • Install PSUs (both for redundancy)
    • Connect power cords
    • Verify PSU LEDs indicate proper operation
  6. Initial Power-On:

    • Press power button
    • Monitor POST on screen or via iLO remote console
    • Address any POST errors before proceeding

iLO 4 Initial Configuration

Physical iLO Connection

  1. Connect Ethernet cable to dedicated iLO port (not FlexibleLOM)
  2. Default iLO IP: Obtains via DHCP, or use temporary address via RBSU
  3. Check DHCP server logs for iLO MAC and assigned IP

First Login

  1. Access iLO web interface: https://<ilo-ip>
  2. Default credentials:
    • Username: Administrator
    • Password: On label on server pull-out tab (or rear label)
  3. Immediately change default password (Administration > Access Settings)

Essential iLO Settings

Network Configuration (Administration > Network):

  • Set static IP or DHCP reservation
  • Configure DNS servers
  • Set hostname (e.g., ilo-dl360-01)
  • Enable SNTP time sync

Security (Administration > Security):

  • Enforce HTTPS only (disable HTTP)
  • Configure SSH key authentication if using CLI
  • Set strong password policy
  • Enable iLO Security features

Access (Administration > Access Settings):

  • Configure iLO username/password for automation
  • Create additional user accounts (separation of duties)
  • Set session timeout (default: 30 minutes)

Date and Time (Administration > Date and Time):

  • Set NTP servers for accurate timestamps
  • Configure timezone

Licenses (Administration > Licensing):

  • Install iLO Advanced license key (required for full virtual media)
  • License can be purchased or acquired from secondary market

iLO Firmware Update

Before production use, update iLO to latest version:

  1. Download latest iLO 4 firmware from HPE Support Portal
  2. Administration > Firmware > Update Firmware
  3. Upload .bin file, apply update
  4. iLO will reboot automatically (system stays running)

System ROM (BIOS/UEFI) Configuration

Accessing RBSU

  • Local: Press F9 during POST
  • Remote: iLO Remote Console > Power > Momentary Press > Press F9 when prompted

Boot Mode Selection

System Configuration > BIOS/Platform Configuration (RBSU) > Boot Mode:

  • UEFI Mode (recommended for modern OS):

    • Supports GPT partitions (>2TB disks)
    • Required for Secure Boot
    • Better UEFI HTTP boot support
    • IPv6 PXE boot support
  • Legacy BIOS Mode:

    • For older OS or compatibility
    • MBR partition tables only
    • Traditional PXE boot

Recommendation: Use UEFI Mode unless legacy compatibility required

Boot Order Configuration

System Configuration > BIOS/Platform Configuration (RBSU) > Boot Options > UEFI Boot Order:

Recommended order for network boot deployment:

  1. Network Boot: FlexibleLOM or PCIe NIC
  2. Internal Storage: RAID controller or disk
  3. Virtual Media: iLO virtual CD/DVD (for installation media)
  4. USB: For rescue/recovery

Enable Network Boot:

  • System Configuration > BIOS/Platform Configuration (RBSU) > Network Options > Network Boot
  • Set to “Enabled”

Performance and Power Settings

System Configuration > BIOS/Platform Configuration (RBSU) > Power Management:

  • Power Regulator Mode:

    • HP Dynamic Power Savings: Balanced power/performance (recommended for home lab)
    • HP Static High Performance: Maximum performance, higher power draw
    • HP Static Low Power: Minimize power, reduced performance
    • OS Control: Let OS manage (e.g., Linux cpufreq)
  • Collaborative Power Control: Disabled (for standalone servers)

  • Minimum Processor Idle Power Core C-State: C6 (lower idle power)

  • Energy/Performance Bias: Balanced Performance (or Maximum Performance for compute workloads)

Recommendation: Start with “Dynamic Power Savings” and adjust based on workload

Memory Configuration

Optimal Population (dual-CPU configuration):

For maximum performance, populate all channels before adding second DIMM per channel:

64GB (8x 8GB):

  • CPU1: Slots 1, 4, 7, 10 and CPU2: Slots 1, 4, 7, 10
  • Result: 4 channels per CPU, 1 DIMM per channel

128GB (8x 16GB):

  • Same as above with 16GB DIMMs

192GB (12x 16GB):

  • CPU1: Slots 1, 4, 7, 10, 2, 5 and CPU2: Slots 1, 4, 7, 10, 2, 5
  • Result: 4 channels per CPU, some with 2 DIMMs per channel

768GB (24x 32GB):

  • All slots populated

Check Configuration: RBSU > System Information > Memory Information

Processor Options

System Configuration > BIOS/Platform Configuration (RBSU) > Processor Options:

  • Intel Hyperthreading: Enabled (recommended for most workloads)

    • Doubles logical cores (e.g., 12-core CPU shows as 24 cores)
    • Benefits most virtualization and multi-threaded workloads
    • Disable only for specific security compliance (e.g., some cloud providers)
  • Intel Virtualization Technology (VT-x): Enabled (required for hypervisors)

  • Intel VT-d (IOMMU): Enabled (required for PCI passthrough, SR-IOV)

  • Turbo Boost: Enabled (allows CPU to exceed base clock)

  • Cores Enabled: All (or reduce to lower power/heat if needed)

Integrated Devices

System Configuration > BIOS/Platform Configuration (RBSU) > System Options > Integrated Devices:

  • Embedded SATA Controller: Enabled (if using SATA drives)
  • Embedded RAID Controller: Enabled (for Smart Array controllers)
  • SR-IOV: Enabled (if using virtual network interfaces with VMs)

Network Controller Options

For each NIC (FlexibleLOM, PCIe):

System Configuration > BIOS/Platform Configuration (RBSU) > Network Options > [Adapter]:

  • Network Boot: Enabled (for network boot on that NIC)
  • PXE/iSCSI: Select PXE for standard network boot
  • Link Speed: Auto-Negotiation (recommended) or force 1G/10G
  • IPv4: Enabled (for IPv4 PXE boot)
  • IPv6: Enabled (if using IPv6 PXE boot)

Boot Order: Configure which NIC boots first if multiple are enabled

Secure Boot Configuration

System Configuration > BIOS/Platform Configuration (RBSU) > Boot Options > Secure Boot:

  • Secure Boot: Disabled (for unsigned boot loaders, custom kernels)
  • Secure Boot: Enabled (for signed boot loaders, Windows, some Linux distros)

Note: If using PXE with unsigned images (e.g., custom iPXE), Secure Boot must be disabled

Firmware Updates

Update System ROM to latest version:

  1. Via iLO:

    • iLO web > Administration > Firmware > Update Firmware
    • Upload System ROM .fwpkg or .bin file
    • Server reboots automatically to apply
  2. Via Service Pack for ProLiant (SPP):

    • Download SPP ISO from HPE Support Portal
    • Mount via iLO Virtual Media
    • Boot server from SPP ISO
    • Smart Update Manager (SUM) runs in Linux environment
    • Select components to update (System ROM, iLO, controller firmware, NIC firmware)
    • Apply updates, reboot

Recommendation: Use SPP for comprehensive updates on initial setup, then iLO for individual component updates

Storage Configuration

Smart Array Controller Setup

Access Smart Array Configuration

  • During POST: Press F5 when “Smart Array Configuration Utility” message appears
  • Via RBSU: System Configuration > BIOS/Platform Configuration (RBSU) > System Options > ROM-Based Setup Utility > Smart Array Configuration

Create RAID Arrays

  1. Delete Existing Arrays (if reconfiguring):

    • Select controller > Configuration > Delete Array
    • Confirm deletion (data loss warning)
  2. Create New Array:

    • Select controller > Configuration > Create Array
    • Select physical drives to include
    • Choose RAID level:
      • RAID 0: Striping, no redundancy (maximum performance, maximum capacity)
      • RAID 1: Mirroring (redundancy, half capacity, good for boot drives)
      • RAID 5: Striping + parity (redundancy, n-1 capacity, balanced)
      • RAID 6: Striping + double parity (dual-drive failure tolerance, n-2 capacity)
      • RAID 10: Mirror + stripe (high performance + redundancy, half capacity)
    • Configure spare drives (hot spares for automatic rebuild)
    • Create logical drive
    • Set bootable flag if boot drive
  3. Recommended Configurations:

    • Boot/OS: 2x SSD in RAID 1 (redundancy, fast boot)
    • Data (performance): 4-6x SSD in RAID 10 (fast, redundant)
    • Data (capacity): 4-8x HDD in RAID 6 (capacity, dual-drive tolerance)

Controller Settings

  • Cache Settings:

    • Write Cache: Enabled (requires battery/flash-backed cache)
    • Read Cache: Enabled
    • No-Battery Write Cache: Disabled (data safety) or Enabled (performance, risk)
  • Rebuild Priority: Medium or High (faster rebuild, may impact performance)

  • Surface Scan Delay: 3-7 days (periodic integrity check)

HBA Mode (Non-RAID)

For software RAID (ZFS, mdadm, Ceph):

  1. Access Smart Array Configuration (F5 during POST)
  2. Controller > Configuration > Enable HBA Mode
  3. Confirm (RAID arrays will be deleted)
  4. Reboot

Note: Not all Smart Array controllers support HBA mode. Check compatibility. Alternative: Use separate LSI HBA in PCIe slot.

Network Configuration for Boot

DHCP Server Setup

For PXE/UEFI network boot, configure DHCP server with appropriate options:

ISC DHCP Example (/etc/dhcp/dhcpd.conf):

# Define subnet
subnet 192.168.10.0 netmask 255.255.255.0 {
    range 192.168.10.100 192.168.10.200;
    option routers 192.168.10.1;
    option domain-name-servers 192.168.10.1;
    
    # PXE boot options
    next-server 192.168.10.5;  # TFTP server IP
    
    # Differentiate UEFI vs BIOS
    if exists user-class and option user-class = "iPXE" {
        # iPXE boot script
        filename "http://boot.example.com/boot.ipxe";
    } elsif option arch = 00:07 or option arch = 00:09 {
        # UEFI (x86-64)
        filename "bootx64.efi";
    } else {
        # Legacy BIOS
        filename "undionly.kpxe";
    }
}

# Static reservation for DL360
host dl360-01 {
    hardware ethernet xx:xx:xx:xx:xx:xx;  # FlexibleLOM MAC
    fixed-address 192.168.10.50;
    option host-name "dl360-01";
}

FlexibleLOM Configuration

Configure FlexibleLOM NIC for network boot:

  1. RBSU > Network Options > FlexibleLOM
  2. Enable “Network Boot”
  3. Select PXE or iSCSI
  4. Configure IPv4/IPv6 as needed
  5. Set as first boot device in boot order

Multi-NIC Boot Priority

If multiple NICs have network boot enabled:

  1. RBSU > Network Options > Network Boot Order
  2. Drag/drop to prioritize NIC boot order
  3. First NIC in list attempts boot first

Recommendation: Enable network boot on one NIC (typically FlexibleLOM port 1) to avoid confusion

Operating System Installation

Traditional Installation (Virtual Media)

  1. Download OS ISO (e.g., Ubuntu Server, ESXi, Proxmox)
  2. Upload ISO to HTTP/HTTPS server or local file
  3. iLO Remote Console > Virtual Devices > Image File CD-ROM/DVD
  4. Browse to ISO location, click “Insert Media”
  5. Set boot order to prioritize virtual media
  6. Reboot server, boot from virtual CD/DVD
  7. Proceed with OS installation

Network Installation (PXE)

See Network Boot Capabilities for detailed PXE/UEFI boot setup

Quick workflow:

  1. Configure DHCP server with PXE options
  2. Setup TFTP server with boot files
  3. Enable network boot in BIOS
  4. Reboot, server PXE boots
  5. Select OS installer from PXE menu
  6. Automated installation proceeds (Kickstart/Preseed/Ignition)

Optimization for Specific Workloads

Virtualization (ESXi, Proxmox, Hyper-V)

BIOS Settings:

  • Hyperthreading: Enabled
  • VT-x: Enabled
  • VT-d: Enabled
  • Power Management: Dynamic or OS Control
  • Turbo Boost: Enabled

Hardware:

  • Maximum memory (384GB+ recommended)
  • Fast storage (SSD RAID 10 for VM storage)
  • 10GbE networking for VM traffic

Configuration:

  • Pass through NICs to VMs (SR-IOV or PCI passthrough)
  • Use storage controller in HBA mode for direct disk access to VM storage (ZFS, Ceph)

Kubernetes/Container Platforms

BIOS Settings:

  • Hyperthreading: Enabled
  • VT-x/VT-d: Enabled (for nested virtualization, kata containers)
  • Power Management: Dynamic or High Performance

Hardware:

  • 128GB+ RAM for multi-tenant workloads
  • Fast local NVMe/SSD for container image cache and ephemeral storage
  • 10GbE for pod networking

OS Recommendations:

  • Talos Linux: Network-bootable, immutable k8s OS
  • Flatcar Container Linux: Auto-updating, minimal OS
  • Ubuntu Server: Broad compatibility, snap/docker native

Storage Server (NAS, SAN)

BIOS Settings:

  • Disable Hyperthreading (slight performance improvement for ZFS)
  • VT-d: Enabled (if passing through HBA to VM)
  • Power Management: High Performance

Hardware:

  • Maximum drive bays (8-10 SFF)
  • HBA mode or separate LSI HBA controller
  • 10GbE or bonded 1GbE for network storage traffic
  • ECC memory (critical for ZFS)

Software:

  • TrueNAS SCALE (Linux-based, k8s apps)
  • OpenMediaVault (Debian-based, plugins)
  • Ubuntu + ZFS (custom setup)

Compute/HPC Workloads

BIOS Settings:

  • Hyperthreading: Depends on workload (test both)
  • Turbo Boost: Enabled
  • Power Management: Maximum Performance
  • C-States: Disabled (reduce latency)

Hardware:

  • High core count CPUs (E5-2680 v4, 2690 v4)
  • Maximum memory bandwidth (populate all channels)
  • Fast local scratch storage (NVMe)

Monitoring and Maintenance

iLO Health Monitoring

Information > System Information:

  • CPU temperature and status
  • Memory status
  • Drive status (via controller)
  • Fan speeds
  • PSU status
  • Overall system health LED status

Alerting (Administration > Alerting):

  • Configure email alerts for:
    • Fan failures
    • Temperature warnings
    • Drive failures
    • Memory errors
    • PSU failures
  • Set up SNMP traps for integration with monitoring systems (Nagios, Zabbix, Prometheus)

Integrated Management Log (IML)

Information > Integrated Management Log:

  • View hardware events and errors
  • Filter by severity (Informational, Caution, Critical)
  • Export log for troubleshooting

Regular Checks:

  • Review IML weekly for early warning signs
  • Address caution-level events before they become critical

Firmware Update Cadence

Recommendation:

  • iLO: Update quarterly or when security advisories released
  • System ROM: Update annually or for bug fixes
  • Storage Controller: Update when issues arise or annually
  • NIC Firmware: Update when issues arise

Method: Use SPP for annual comprehensive updates, iLO web interface for individual component updates

Physical Maintenance

Monthly:

  • Check fan noise (increased noise may indicate clogged air filters or failing fan)
  • Verify PSU and drive LEDs (no amber lights)
  • Check iLO for alerts

Quarterly:

  • Clean air filters (if accessible, depends on rack airflow)
  • Verify backup of iLO configuration
  • Test iLO Virtual Media functionality

Annually:

  • Update all firmware via SPP
  • Verify RAID battery/flash-backed cache status
  • Review and update BIOS settings as workload evolves

Troubleshooting Common Issues

Server Won’t Power On

  1. Check PSU power cords connected
  2. Verify PSU LEDs indicate power
  3. Press iLO power button via web interface
  4. Check iLO IML for power-related errors
  5. Reseat PSUs, check for blown fuses

POST Errors

Memory Errors:

  • Reseat memory DIMMs
  • Test with minimal configuration (1 DIMM per CPU)
  • Replace failing DIMMs identified in POST

CPU Errors:

  • Verify heatsink properly seated
  • Check thermal paste application
  • Reseat CPU (careful with pins)

Drive Errors:

  • Check drive connection to caddy
  • Verify controller recognizes drive
  • Replace failing drive

No Network Boot

See Network Boot Troubleshooting for detailed diagnostics

Quick checks:

  1. Verify NIC link light
  2. Confirm network boot enabled in BIOS
  3. Check DHCP server logs for PXE request
  4. Test TFTP server accessibility

iLO Not Accessible

  1. Check physical Ethernet connection to iLO port
  2. Verify switch port active
  3. Reset iLO: Press and hold iLO NMI button (rear) for 5 seconds
  4. Factory reset iLO via jumper (see maintenance guide)
  5. Check iLO firmware version, update if outdated

High Fan Noise

  1. Check ambient temperature (<25°C recommended)
  2. Verify airflow not blocked (front/rear clearance)
  3. Clean dust from intake (compressed air)
  4. Check iLO temperature sensors for elevated temps
  5. Lower CPU TDP if temperatures excessive (lower power CPUs)
  6. Verify all fans operational (replace failed fans)

Security Hardening

iLO Security

  1. Change Default Credentials: Immediately on first boot
  2. Disable Unused Services: SSH, IPMI if not needed
  3. Use HTTPS Only: Disable HTTP (Administration > Network > HTTP Port)
  4. Network Isolation: Dedicated management VLAN, firewall iLO access
  5. Update Firmware: Apply security patches promptly
  6. Account Management: Use separate accounts, least privilege

BIOS/UEFI Security

  1. BIOS Password: Set administrator password (RBSU > System Options > BIOS Admin Password)
  2. Secure Boot: Enable if using signed boot loaders
  3. Boot Order Lock: Prevent unauthorized boot device changes
  4. TPM: Enable if using BitLocker or LUKS disk encryption

Operating System Security

  1. Minimal Installation: Install only required packages
  2. Firewall: Enable host firewall (iptables, firewalld, ufw)
  3. SSH Hardening: Key-based auth, disable password auth, non-standard port
  4. Automatic Updates: Enable for security patches
  5. Monitoring: Deploy intrusion detection (fail2ban, OSSEC)

Conclusion

Proper configuration of the HP ProLiant DL360 Gen9 ensures optimal performance, reliability, and manageability for home lab and production deployments. The combination of UEFI boot capabilities, iLO remote management, and flexible hardware configuration makes the DL360 Gen9 a versatile platform for virtualization, containerization, storage, and compute workloads.

Key takeaways:

  • Update firmware early (iLO, System ROM, controllers)
  • Configure iLO for remote management and monitoring
  • Choose boot mode (UEFI recommended) and configure network boot appropriately
  • Optimize BIOS settings for specific workload (virtualization, storage, compute)
  • Implement security hardening (iLO, BIOS, OS)
  • Establish monitoring and maintenance schedule

For network boot-specific configuration, refer to the Network Boot Capabilities guide.

4.2 - Hardware Specifications

Detailed hardware specifications and configuration options for HP ProLiant DL360 Gen9

System Overview

The HP ProLiant DL360 Gen9 is a dual-socket 1U rack server designed for data center and enterprise deployments, also popular in home lab environments due to its performance and manageability.

Generation: Gen9 (2014-2017 product cycle)
Form Factor: 1U rack-mountable (19-inch standard rack)
Dimensions: 43.46 x 67.31 x 4.29 cm (17.1 x 26.5 x 1.69 in)

Processor Support

Supported CPU Families

The DL360 Gen9 supports Intel Xeon E5-2600 v3 and v4 series processors:

  • E5-2600 v3 (Haswell-EP): Released Q3 2014

    • Process: 22nm
    • Cores: 4-18 per socket
    • TDP: 55W-145W
    • Max Memory Speed: DDR4-2133
  • E5-2600 v4 (Broadwell-EP): Released Q1 2016

    • Process: 14nm
    • Cores: 4-22 per socket
    • TDP: 55W-145W
    • Max Memory Speed: DDR4-2400

Value: E5-2620 v3/v4 (6 cores, 15MB cache, 85W)
Balanced: E5-2650 v3/v4 (10-12 cores, 25-30MB cache, 105W)
Performance: E5-2680 v3/v4 (12-14 cores, 30-35MB cache, 120W)
High Core Count: E5-2699 v4 (22 cores, 55MB cache, 145W)

Configuration Options

  • Single Processor: One CPU socket populated (budget option)
  • Dual Processor: Both sockets populated (full performance)

Note: Memory and I/O performance scales with processor count. Single-CPU configuration limits memory channels and PCIe lanes.

Memory Architecture

Memory Specifications

  • Type: DDR4 RDIMM or LRDIMM
  • Speed: DDR4-2133 (v3) or DDR4-2400 (v4)
  • Slots: 24 DIMM slots (12 per processor)
  • Maximum Capacity:
    • 768GB with 32GB RDIMMs
    • 1.5TB with 64GB LRDIMMs (v4 processors)
  • Minimum: 8GB (1x 8GB DIMM)

Memory Configuration Rules

  • Channels per CPU: 4 channels, 3 DIMMs per channel
  • Population: Populate channels evenly for optimal bandwidth
  • Mixing: Do not mix RDIMM and LRDIMM types
  • Speed: All DIMMs run at speed of slowest DIMM

Basic Home Lab (Single CPU):

  • 4x 16GB = 64GB (one DIMM per channel on both memory boards)

Standard (Dual CPU):

  • 8x 16GB = 128GB (one DIMM per channel)
  • 12x 16GB = 192GB (two DIMMs per channel on primary channels)

High Capacity (Dual CPU):

  • 24x 32GB = 768GB (all slots populated, RDIMM)

Performance Priority: Populate all channels before adding second DIMM per channel

Storage Options

Drive Bay Configurations

The DL360 Gen9 offers multiple drive bay configurations:

  1. 8 SFF (2.5-inch): Most common configuration
  2. 10 SFF: Extended bay version
  3. 4 LFF (3.5-inch): Less common in 1U form factor

Drive Types Supported

  • SAS: 12Gb/s, 6Gb/s (enterprise-grade)
  • SATA: 6Gb/s, 3Gb/s (value option)
  • SSD: SAS/SATA SSD, NVMe (with appropriate controller)

Storage Controllers

Smart Array Controllers (HPE proprietary RAID):

  • P440ar: Entry-level, 2GB FBWC (Flash-Backed Write Cache), RAID 0/1/5/6/10
  • P840ar: High-performance, 4GB FBWC, RAID 0/1/5/6/10/50/60
  • P440: PCIe card version, 2GB FBWC
  • P840: PCIe card version, 4GB FBWC

HBA Mode (non-RAID pass-through):

  • Smart Array controllers in HBA mode for software RAID (ZFS, mdadm)
  • Limited support; check firmware version

Alternative Controllers:

  • LSI/Broadcom HBA controllers in PCIe slots
  • H240ar (12Gb/s HBA mode)

Boot Drive Options

For network-focused deployments:

  • Minimal Local Storage: 2x SSD in RAID 1 for hypervisor/OS
  • USB/SD Boot: iLO supports USB boot, SD card (internal USB)
  • Diskless: Pure network boot (subject of network-boot.md)

Network Connectivity

Integrated FlexibleLOM

The DL360 Gen9 includes a FlexibleLOM slot for swappable network adapters:

Common FlexibleLOM Options:

  • HPE 366FLR: 4x 1GbE (Broadcom BCM5719)

    • Most common, good for general use
    • Supports PXE, UEFI network boot, SR-IOV
  • HPE 560FLR-SFP+: 2x 10GbE SFP+ (Intel X710)

    • High performance, fiber or DAC
    • Supports PXE, UEFI boot, SR-IOV, RDMA (RoCE)
  • HPE 361i: 2x 1GbE (Intel I350)

    • Entry-level, good driver support

PCIe Expansion Slots

Slot Configuration:

  • Slot 1: PCIe 3.0 x16 (low-profile)
  • Slot 2: PCIe 3.0 x8 (low-profile)
  • Slot 3: PCIe 3.0 x8 (low-profile) - optional, depends on riser

Network Card Options:

  • Intel X520/X710 (10GbE)
  • Mellanox ConnectX-3/ConnectX-4 (10/25/40GbE, InfiniBand)
  • Broadcom NetXtreme (1/10/25GbE)

Note: Ensure cards are low-profile for 1U chassis compatibility

Power Supply

PSU Options

  • 500W: Single PSU, non-redundant (not recommended)
  • 800W: Common, supports dual CPU + moderate expansion
  • 1400W: High-power, dual CPU with high TDP + GPUs
  • Redundancy: 1+1 redundant hot-plug recommended

Power Configuration

  • Platinum Efficiency: 94%+ at 50% load
  • Hot-Plug: Replace without powering down
  • Auto-Switching: 100-240V AC, 50/60Hz

Home Lab Power Draw (typical):

  • Idle (dual E5-2650 v3, 128GB RAM): 100-130W
  • Load: 200-350W depending on CPU and drive configuration

Power Management

  • HPE Dynamic Power Capping: Limit max power via iLO
  • Collaborative Power: Share power budget across chassis in blade environments
  • Energy Efficient Ethernet (EEE): Reduce NIC power during low utilization

Cooling and Acoustics

Fan Configuration

  • 6x Hot-Plug Fans: Front-mounted, redundant (N+1)
  • Variable Speed: Controlled by System ROM based on thermal sensors
  • iLO Management: Monitor fan speed, temperature via iLO

Thermal Management

  • Temperature Range: 10-35°C (50-95°F) operating
  • Altitude: Up to 3,050m (10,000 ft) at reduced temperature
  • Airflow: Front-to-back, ensure clear intake and exhaust

Noise Level

  • Idle: ~45 dBA (quiet for 1U server)
  • Load: 55-70 dBA depending on thermal demand
  • Home Lab Consideration: Audible but acceptable in dedicated space; louder than desktop workstation

Noise Reduction:

  • Run lower TDP CPUs (e.g., E5-2620 series)
  • Maintain ambient temperature <25°C
  • Ensure adequate airflow (not in enclosed cabinet without ventilation)

Management - iLO 4

iLO 4 Features

The Integrated Lights-Out 4 (iLO 4) provides out-of-band management:

  • Web Interface: HTTPS management console
  • Remote Console: HTML5 or Java-based KVM
  • Virtual Media: Mount ISOs/images remotely
  • Power Control: Power on/off, reset, cold boot
  • Monitoring: Sensors, event logs, hardware health
  • Alerting: Email alerts, SNMP traps, syslog
  • Scripting: RESTful API (Redfish standard)

iLO Licensing

  • iLO Standard (included): Basic management, remote console
  • iLO Advanced (license required):
    • Virtual media
    • Remote console performance improvements
    • Directory integration (LDAP/AD)
    • Graphical remote console
  • iLO Advanced Premium (license required):
    • Insight Remote Support
    • Federation
    • Jitter smoothing

Home Lab: iLO Advanced license highly recommended for virtual media and full remote console features

iLO Network Configuration

  • Dedicated iLO Port: Separate 1GbE management port (recommended)
  • Shared LOM: Share FlexibleLOM port with OS (not recommended for isolation)

Security: Isolate iLO on dedicated management VLAN, disable if not needed

BIOS and Firmware

System ROM (BIOS/UEFI)

  • Firmware Type: UEFI 2.31 or later
  • Boot Modes: UEFI, Legacy BIOS, or hybrid
  • Configuration: RBSU (ROM-Based Setup Utility) accessible via F9

Firmware Update Methods

  1. Service Pack for ProLiant (SPP): Comprehensive bundle of all firmware
  2. iLO Online Flash: Update via web interface
  3. Online ROM Flash: Linux utility for online updates
  4. USB Flash: Boot from USB with firmware update utility

Recommended Practice: Update to latest SPP for security patches and feature improvements

Secure Boot

  • UEFI Secure Boot: Supported, validates boot loader signatures
  • TPM: Optional Trusted Platform Module 1.2 or 2.0
  • Boot Order Protection: Prevent unauthorized boot device changes

Expansion and Modularity

GPU Support

Limited GPU support due to 1U form factor and power constraints:

  • Low-Profile GPUs: Nvidia T4, AMD Instinct MI25 (may require custom cooling)
  • Power: Consider 1400W PSU for high-power GPUs
  • Not Ideal: For GPU-heavy workloads, consider 2U+ servers (e.g., DL380 Gen9)

USB Ports

  • Front: 1x USB 3.0
  • Rear: 2x USB 3.0
  • Internal: 1x USB 2.0 (for SD/USB boot device)

Serial Port

  • Rear serial port for legacy console access
  • Useful for network equipment serial console, debug

Home Lab Considerations

Pros for Home Lab

  1. Density: 1U form factor saves rack space
  2. iLO Management: Enterprise remote management without KVM
  3. Network Boot: Excellent PXE/UEFI boot support (see network-boot.md)
  4. Serviceability: Hot-swap drives, PSU, fans
  5. Documentation: Extensive HPE documentation and community support
  6. Parts Availability: Common on secondary market, affordable

Cons for Home Lab

  1. Noise: Louder than tower servers or workstations
  2. Power: Higher idle power than consumer hardware (100-130W idle)
  3. 1U Limitations: Limited GPU, PCIe expansion vs 2U/4U chassis
  4. Firmware: Requires HPE account for SPP downloads (free but registration required)

Budget (~$500-800 used):

  • Dual E5-2620 v3 or v4 (6 cores each, 85W TDP)
  • 128GB RAM (8x 16GB DDR4)
  • 2x SSD (boot), 4-6x HDD/SSD (data)
  • HPE 366FLR (4x 1GbE)
  • Dual 500W or 800W PSU (redundant)
  • iLO Advanced license

Performance (~$1000-1500 used):

  • Dual E5-2680 v4 (14 cores each, 120W TDP)
  • 256GB RAM (16x 16GB DDR4)
  • 2x NVMe SSD (boot/cache), 6-8x SSD (data)
  • HPE 560FLR-SFP+ (2x 10GbE) + PCIe 4x1GbE card
  • Dual 800W PSU
  • iLO Advanced license

Comparison with Other Generations

vs Gen8 (Previous)

Gen9 Advantages:

  • DDR4 vs DDR3 (lower power, higher capacity)
  • Better UEFI support and HTTP boot
  • Newer processor architecture (Haswell/Broadwell vs Sandy Bridge/Ivy Bridge)
  • iLO 4 vs iLO 3 (better HTML5 console)

Gen8 Advantages:

  • Lower cost on secondary market
  • Adequate for light workloads

vs Gen10 (Next)

Gen10 Advantages:

  • Newer CPUs (Skylake-SP/Cascade Lake)
  • More PCIe lanes
  • Better UEFI firmware and security features
  • DDR4-2666/2933 support

Gen9 Advantages:

  • Lower cost (mature product cycle)
  • Excellent value for performance/dollar
  • Still well-supported by modern OS and firmware

Technical Resources

  • QuickSpecs: HPE ProLiant DL360 Gen9 Server QuickSpecs
  • User Guide: HPE ProLiant DL360 Gen9 Server User Guide
  • Maintenance and Service Guide: Detailed disassembly and part replacement
  • Firmware Downloads: HPE Support Portal (requires free account)

Summary

The HP ProLiant DL360 Gen9 remains an excellent choice for home labs and small deployments in 2024-2025. Its balance of performance (dual Xeon v4, 768GB RAM capacity), manageability (iLO 4), and network boot capabilities make it particularly well-suited for virtualization, container hosting, and infrastructure automation workflows. While not the latest generation, it offers strong value with robust firmware support and wide secondary market availability.

Best For:

  • Virtualization hosts (ESXi, Proxmox, Hyper-V)
  • Kubernetes/container platforms
  • Network boot/diskless deployments
  • Storage servers (with appropriate controller)
  • General compute workloads

Avoid For:

  • GPU-intensive workloads (1U constraints)
  • Noise-sensitive environments (unless isolated)
  • Extreme low-power requirements (100W+ idle)

4.3 - Network Boot Capabilities

Comprehensive analysis of network boot support on HP ProLiant DL360 Gen9

Overview

The HP ProLiant DL360 Gen9 provides robust network boot capabilities through multiple protocols and firmware interfaces. This makes it particularly well-suited for diskless deployments, automated provisioning, and infrastructure-as-code workflows.

Supported Network Boot Protocols

PXE (Preboot Execution Environment)

The DL360 Gen9 fully supports PXE boot via both legacy BIOS and UEFI firmware modes:

  • Legacy BIOS PXE: Traditional PXE implementation using TFTP

    • Protocol: PXEv2 (PXE 2.1)
    • Network Stack: IPv4 only in legacy mode
    • Boot files: pxelinux.0, undionly.kpxe, or custom NBP
    • DHCP options: Standard options 66 (TFTP server) and 67 (boot filename)
  • UEFI PXE: Modern UEFI network boot implementation

    • Protocol: PXEv2 with UEFI extensions
    • Network Stack: IPv4 and IPv6 support
    • Boot files: bootx64.efi, grubx64.efi, shimx64.efi
    • Architecture: x64 (EFI BC)
    • DHCP Architecture ID: 0x0007 (EFI BC) or 0x0009 (EFI x86-64)

iPXE Support

The DL360 Gen9 can boot iPXE, enabling advanced features:

  • Chainloading: Boot standard PXE, then chainload iPXE for enhanced capabilities
  • HTTP/HTTPS Boot: Download kernels and images over HTTP(S) instead of TFTP
  • SAN Boot: iSCSI and AoE (ATA over Ethernet) support
  • Scripting: Conditional boot logic and dynamic configuration
  • Embedded Scripts: iPXE can be compiled with embedded boot scripts

Implementation Methods:

  1. Chainload from standard PXE: DHCP points to undionly.kpxe or ipxe.efi
  2. Flash iPXE to FlexibleLOM option ROM (advanced, requires care)
  3. Boot iPXE from USB, then continue network boot

UEFI HTTP Boot

Native UEFI HTTP boot is supported on Gen9 servers with recent firmware:

  • Protocol: RFC 7230 HTTP/1.1
  • Requirements:
    • UEFI firmware version 2.40 or later (check via iLO)
    • DHCP option 60 (vendor class identifier) = “HTTPClient”
    • DHCP option 67 pointing to HTTP(S) URL
  • Advantages:
    • No TFTP server required
    • Faster transfers than TFTP
    • Support for HTTPS with certificate validation
    • Better suited for large images (kernels, initramfs)
  • Limitations:
    • UEFI mode only (not available in legacy BIOS)
    • Requires DHCP server with HTTP URL support

HTTP(S) Boot Configuration

For UEFI HTTP boot on DL360 Gen9:

# Example ISC DHCP configuration for UEFI HTTP boot
class "httpclients" {
    match if substring(option vendor-class-identifier, 0, 10) = "HTTPClient";
}

pool {
    allow members of "httpclients";
    option vendor-class-identifier "HTTPClient";
    # Point to HTTP boot URI
    filename "http://boot.example.com/boot/efi/bootx64.efi";
}

Network Interface Options

The DL360 Gen9 supports multiple network adapter configurations for boot:

FlexibleLOM (LOM = LAN on Motherboard)

HPE FlexibleLOM slot supports:

  • HPE 366FLR: Quad-port 1GbE (Broadcom BCM5719)
  • HPE 560FLR-SFP+: Dual-port 10GbE (Intel X710)
  • HPE 361i: Dual-port 1GbE (Intel I350)

All FlexibleLOM adapters support PXE and UEFI network boot. The option ROM can be configured via BIOS/UEFI settings.

PCIe Network Adapters

Standard PCIe network cards with PXE/UEFI boot ROM support:

  • Intel X520, X710 series (10GbE)
  • Broadcom NetXtreme series
  • Mellanox ConnectX-3/4 (with appropriate firmware)

Boot Priority: Configure via System ROM > Network Boot Options to select which NIC boots first.

Firmware Configuration

Accessing Boot Configuration

  1. RBSU (ROM-Based Setup Utility): Press F9 during POST
  2. iLO 4 Remote Console: Access via network, then virtual F9
  3. UEFI System Utilities: Modern interface for UEFI firmware settings

Key Settings

Navigate to: System Configuration > BIOS/Platform Configuration (RBSU) > Network Boot Options

  • Network Boot: Enable/Disable
  • Boot Mode: UEFI or Legacy BIOS
  • IPv4/IPv6: Enable protocol support
  • Boot Retry: Number of attempts before falling back to next boot device
  • Boot Order: Prioritize network boot in boot sequence

Per-NIC Configuration

In RBSU > Network Options:

  • Option ROM: Enable/Disable per adapter
  • Link Speed: Force speed/duplex or auto-negotiate
  • VLAN: VLAN tagging for boot (if supported by DHCP/PXE environment)
  • PXE Menu: Enable interactive PXE menu (Ctrl+S during PXE boot)

iLO 4 Integration

The DL360 Gen9’s iLO 4 provides additional network boot features:

Virtual Media Network Boot

  • Mount ISO images remotely via iLO Virtual Media
  • Boot from network-attached ISO without physical media
  • Useful for OS installation or diagnostics

Workflow:

  1. Upload ISO to HTTP/HTTPS server or use SMB/NFS share
  2. iLO Remote Console > Virtual Devices > Image File CD-ROM/DVD
  3. Set boot order to prioritize virtual optical drive
  4. Reboot server

Scripted Deployment via iLO

iLO 4 RESTful API allows:

  • Setting one-time boot to network via API call
  • Automating PXE boot for provisioning pipelines
  • Integration with tools like Terraform, Ansible

Example using iLO RESTful API:

curl -k -u admin:password -X PATCH \
  https://ilo-hostname/redfish/v1/Systems/1/ \
  -d '{"Boot":{"BootSourceOverrideTarget":"Pxe","BootSourceOverrideEnabled":"Once"}}'

Boot Process Flow

Legacy BIOS PXE Boot

  1. Server powers on, initializes NICs
  2. NIC sends DHCPDISCOVER with PXE vendor options
  3. DHCP server responds with IP, TFTP server (option 66), boot file (option 67)
  4. NIC downloads NBP (Network Bootstrap Program) via TFTP
  5. NBP executes (e.g., pxelinux.0 loads syslinux menu)
  6. User selects boot target or automated script continues
  7. Kernel and initramfs download and boot

UEFI PXE Boot

  1. UEFI firmware initializes network stack
  2. UEFI PXE driver sends DHCPv4/v6 DISCOVER
  3. DHCP responds with boot file (e.g., bootx64.efi)
  4. UEFI downloads boot file via TFTP
  5. UEFI loads and executes boot loader (GRUB2, systemd-boot, iPXE)
  6. Boot loader may download additional files (kernel, initrd, config)
  7. OS boots

UEFI HTTP Boot

  1. UEFI firmware with HTTP Boot support enabled
  2. DHCP request includes “HTTPClient” vendor class
  3. DHCP responds with HTTP(S) URL in option 67
  4. UEFI HTTP client downloads boot file over HTTP(S)
  5. Execution continues as with UEFI PXE

Performance Considerations

TFTP vs HTTP

  • TFTP: Slow for large files (typical: 1-5 MB/s)
    • Use for small boot loaders only
    • Chainload to iPXE or HTTP boot for better performance
  • HTTP: 10-100x faster depending on network and server
    • Recommended for kernels, initramfs, live OS images
    • iPXE or UEFI HTTP boot required

Network Speed Impact

DL360 Gen9 boot performance by NIC speed:

  • 1GbE: Adequate for most PXE deployments (100-125 MB/s theoretical max)
  • 10GbE: Significant improvement for large image downloads (1-2 GB/s)
  • Bonding/Teaming: Not typically used for boot (single NIC boots)

Recommendation: For production diskless nodes or frequent re-provisioning, 10GbE with HTTP boot provides best performance.

Common Use Cases

1. Automated OS Provisioning

Boot into installer via PXE:

  • Kickstart (RHEL/CentOS/Rocky)
  • Preseed (Debian/Ubuntu)
  • Ignition (Fedora CoreOS, Flatcar)

2. Diskless Boot

Boot OS entirely from network/RAM:

  • Network root: NFS or iSCSI root filesystem
  • Overlay: Persistent storage via network overlay
  • Stateless: Boot identical image, no local state

3. Rescue and Diagnostics

Boot live environments:

  • SystemRescue
  • Clonezilla
  • Memtest86+
  • Hardware diagnostics (HPE Service Pack for ProLiant)

4. Kubernetes/Container Hosts

PXE boot immutable OS images:

  • Talos Linux: API-driven, diskless k8s nodes
  • Flatcar Container Linux: Automated updates
  • k3OS: Lightweight k8s OS

Troubleshooting

PXE Boot Fails

Symptoms: “PXE-E51: No DHCP or proxy DHCP offers received” or timeout

Checks:

  1. Verify NIC link light and switch port status
  2. Confirm DHCP server is responding (check DHCP logs)
  3. Ensure DHCP options 66 and 67 are set correctly
  4. Test TFTP server accessibility (tftp -i <server> GET <file>)
  5. Check BIOS/UEFI network boot is enabled
  6. Verify boot order prioritizes network boot
  7. Disable Secure Boot if using unsigned boot files

UEFI Network Boot Not Available

Symptoms: Network boot option missing in UEFI boot menu

Resolution:

  1. Enter RBSU (F9), navigate to Network Options
  2. Ensure at least one NIC has “Option ROM” enabled
  3. Verify Boot Mode is set to UEFI (not Legacy)
  4. Update System ROM to latest version if option is missing
  5. Some FlexibleLOM cards require firmware update for UEFI boot support

HTTP Boot Fails

Symptoms: UEFI HTTP boot option present but fails to download

Checks:

  1. Verify firmware version supports HTTP boot (>=2.40)
  2. Ensure DHCP option 67 contains valid HTTP(S) URL
  3. Test URL accessibility from another client
  4. Check DNS resolution if using hostname in URL
  5. For HTTPS: Verify certificate is trusted (or disable cert validation in test)

Slow PXE Boot

Symptoms: Boot process takes minutes instead of seconds

Optimizations:

  1. Switch from TFTP to HTTP (chainload iPXE or use UEFI HTTP boot)
  2. Increase TFTP server block size (tftp-hpa --blocksize 1468)
  3. Tune DHCP response times (reduce lease query delays)
  4. Use local network segment for boot server (avoid WAN/VPN)
  5. Enable NIC interrupt coalescing in BIOS for 10GbE

Security Considerations

Secure Boot

DL360 Gen9 supports UEFI Secure Boot:

  • Validates signed boot loaders (shim, GRUB, kernel)
  • Prevents unsigned code execution during boot
  • Required for some compliance scenarios

Configuration: RBSU > Boot Options > Secure Boot = Enabled

Implications for Network Boot:

  • Must use signed boot loaders (e.g., shim.efi signed by Microsoft/vendor)
  • Custom kernels require signing or disabling Secure Boot
  • iPXE must be signed or chainloaded from signed shim

Network Security

Risks:

  • PXE/TFTP is unencrypted and unauthenticated
  • Attacker on network can serve malicious boot images
  • DHCP spoofing can redirect to malicious boot server

Mitigations:

  1. Network Segmentation: Isolate PXE boot to management VLAN
  2. DHCP Snooping: Prevent rogue DHCP servers on switch
  3. HTTPS Boot: Use UEFI HTTP boot with TLS and certificate validation
  4. iPXE with HTTPS: Chainload iPXE, then use HTTPS for all downloads
  5. Signed Images: Use Secure Boot with signed boot chain
  6. 802.1X: Require network authentication before DHCP (complex for PXE)

iLO Security

  • Change default iLO password immediately
  • Use TLS for iLO web interface and API
  • Restrict iLO network access (firewall, separate VLAN)
  • Disable iLO Virtual Media if not needed
  • Enable iLO Security Override for extra security during boot

Firmware and Driver Resources

Required Firmware Versions

For optimal network boot support:

  • System ROM: v2.60 or later (latest recommended)
  • iLO 4 Firmware: v2.80 or later
  • NIC Firmware: Latest for specific FlexibleLOM/PCIe card

Check current versions: iLO web interface > Information > Firmware Information

Updating Firmware

Methods:

  1. HPE Service Pack for ProLiant (SPP): Comprehensive update bundle

    • Boot from SPP ISO (via iLO Virtual Media or USB)
    • Runs Smart Update Manager (SUM) in Linux environment
    • Updates all firmware, drivers, system ROM automatically
  2. iLO Web Interface: Individual component updates

    • System ROM: Administration > Firmware > Update Firmware
    • Upload .fwpkg or .bin files from HPE support site
  3. Online Flash Component: Linux Online ROM Flash utility

    • Install hp-firmware-* packages
    • Run updates while OS is running (requires reboot to apply)

Download Source: https://support.hpe.com/connect/s/product?language=en_US&kmpmoid=1010026910 (requires HPE Passport account, free registration)

Best Practices

  1. Use UEFI Mode: Better security, IPv6 support, larger disk support
  2. Enable HTTP Boot: Faster and more reliable than TFTP for large files
  3. Chainload iPXE: Flexibility of iPXE with standard PXE infrastructure
  4. Update Firmware: Keep System ROM and iLO current for bug fixes and features
  5. Isolate Boot Network: Use dedicated management VLAN for PXE/provisioning
  6. Test Failover: Configure multiple DHCP servers and boot mirrors for redundancy
  7. Document Configuration: Record BIOS settings, DHCP config, and boot infrastructure
  8. Monitor iLO Logs: Track boot failures and hardware issues via iLO event log

References

  • HPE ProLiant DL360 Gen9 Server User Guide
  • HPE UEFI System Utilities User Guide
  • iLO 4 User Guide (firmware version 2.80)
  • Intel PXE Specification v2.1
  • UEFI Specification v2.8 (HTTP Boot)
  • iPXE Documentation: https://ipxe.org/

Conclusion

The HP ProLiant DL360 Gen9 provides enterprise-grade network boot capabilities suitable for both traditional PXE deployments and modern UEFI HTTP boot scenarios. Its flexible configuration options, mature firmware support, and iLO integration make it an excellent platform for automated provisioning, diskless computing, and infrastructure-as-code workflows in home lab environments.

For home lab use, the recommended configuration is:

  • UEFI boot mode with Secure Boot disabled (unless required)
  • iPXE chainloading for flexibility and HTTP performance
  • iLO 4 configured for remote management and scripted provisioning
  • Latest firmware for stability and feature support

5 - Matchbox Analysis

Analysis of Matchbox network boot service capabilities and architecture

Matchbox Network Boot Analysis

This section contains a comprehensive analysis of Matchbox, a network boot service for provisioning bare-metal machines.

Overview

Matchbox is an HTTP and gRPC service developed by Poseidon that automates bare-metal machine provisioning through network booting. It matches machines to configuration profiles based on hardware attributes and serves boot configurations, kernel images, and provisioning configs.

Primary Repository: poseidon/matchbox
Documentation: https://matchbox.psdn.io/
License: Apache 2.0

Key Features

  • Network Boot Support: iPXE, PXELINUX, GRUB2 chainloading
  • OS Provisioning: Fedora CoreOS, Flatcar Linux, RHEL CoreOS
  • Configuration Management: Ignition v3.x configs, Butane transpilation
  • Machine Matching: Label-based matching (MAC, UUID, hostname, serial, custom)
  • API: Read-only HTTP API + authenticated gRPC API
  • Asset Serving: Local caching of OS images for faster deployment
  • Templating: Go template support for dynamic configuration

Use Cases

  1. Bare-metal Kubernetes clusters - Provision CoreOS nodes for k8s
  2. Lab/development environments - Quick PXE boot for testing
  3. Datacenter provisioning - Automate OS installation across fleets
  4. Immutable infrastructure - Declarative machine provisioning via Terraform

Analysis Contents

Quick Architecture

┌─────────────┐
│   Machine   │ PXE Boot
│  (BIOS/UEFI)│───┐
└─────────────┘   │
                  │
┌─────────────┐   │ DHCP/TFTP
│   dnsmasq   │◄──┘ (chainload to iPXE)
│  DHCP+TFTP  │
└─────────────┘
       │
       │ HTTP
       ▼
┌─────────────────────────┐
│      Matchbox           │
│  ┌──────────────────┐   │
│  │  HTTP Endpoints  │   │ /boot.ipxe, /ignition
│  └──────────────────┘   │
│  ┌──────────────────┐   │
│  │   gRPC API       │   │ Terraform provider
│  └──────────────────┘   │
│  ┌──────────────────┐   │
│  │ Profile/Group    │   │ Match machines
│  │   Matcher        │   │ to configs
│  └──────────────────┘   │
└─────────────────────────┘

Technology Stack

  • Language: Go
  • Config Formats: Ignition JSON, Butane YAML
  • Boot Protocols: PXE, iPXE, GRUB2
  • APIs: HTTP (read-only), gRPC (authenticated)
  • Deployment: Binary, container (Podman/Docker), Kubernetes

Integration Points

  • Terraform: terraform-provider-matchbox for declarative provisioning
  • Ignition/Butane: CoreOS provisioning configs
  • dnsmasq: Reference DHCP/TFTP/DNS implementation (quay.io/poseidon/dnsmasq)
  • Asset sources: Can serve local or remote (HTTPS) OS images

5.1 - Configuration Model

Analysis of Matchbox’s profile, group, and templating system

Matchbox Configuration Model

Matchbox uses a flexible configuration model based on Profiles (what to provision) and Groups (which machines get which profile), with support for templating and metadata.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    Matchbox Store                           │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐            │
│  │  Profiles  │  │   Groups   │  │   Assets   │            │
│  └────────────┘  └────────────┘  └────────────┘            │
│        │               │                                    │
│        │               │                                    │
│        ▼               ▼                                    │
│  ┌─────────────────────────────────────┐                   │
│  │       Matcher Engine                │                   │
│  │  (Label-based group selection)      │                   │
│  └─────────────────────────────────────┘                   │
│                    │                                        │
│                    ▼                                        │
│  ┌─────────────────────────────────────┐                   │
│  │    Template Renderer                │                   │
│  │  (Go templates + metadata)          │                   │
│  └─────────────────────────────────────┘                   │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
            Rendered Config (iPXE, Ignition, etc.)

Data Directory Structure

Matchbox uses a FileStore (default) that reads from -data-path (default: /var/lib/matchbox):

/var/lib/matchbox/
├── groups/              # Machine group definitions (JSON)
│   ├── default.json
│   ├── node1.json
│   └── us-west.json
├── profiles/            # Profile definitions (JSON)
│   ├── worker.json
│   ├── controller.json
│   └── etcd.json
├── ignition/            # Ignition configs (.ign) or Butane (.yaml)
│   ├── worker.ign
│   ├── controller.ign
│   └── butane-example.yaml
├── cloud/               # Cloud-Config templates (DEPRECATED)
│   └── legacy.yaml.tmpl
├── generic/             # Arbitrary config templates
│   ├── setup.cfg
│   └── metadata.yaml.tmpl
└── assets/              # Static files (kernel, initrd)
    ├── fedora-coreos/
    └── flatcar/

Version control: Poseidon recommends keeping /var/lib/matchbox under git for auditability and rollback.

Profiles

Profiles define what to provision: network boot settings (kernel, initrd, args) and config references (Ignition, Cloud-Config, generic).

Profile Schema

{
  "id": "worker",
  "name": "Fedora CoreOS Worker Node",
  "boot": {
    "kernel": "/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-kernel-x86_64",
    "initrd": [
      "--name main /assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img"
    ],
    "args": [
      "initrd=main",
      "coreos.live.rootfs_url=http://matchbox.example.com:8080/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-rootfs.x86_64.img",
      "coreos.inst.install_dev=/dev/sda",
      "coreos.inst.ignition_url=http://matchbox.example.com:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
    ]
  },
  "ignition_id": "worker.ign",
  "cloud_id": "",
  "generic_id": ""
}

Profile Fields

FieldTypeRequiredDescription
idstringUnique profile identifier (referenced by groups)
namestringHuman-readable description
bootobjectNetwork boot configuration
boot.kernelstringKernel URL (HTTP/HTTPS or /assets path)
boot.initrdarrayInitrd URLs (can specify --name for multi-initrd)
boot.argsarrayKernel command-line arguments
ignition_idstringIgnition/Butane config filename in ignition/
cloud_idstringCloud-Config filename in cloud/ (deprecated)
generic_idstringGeneric config filename in generic/

Boot Configuration Patterns

Pattern 1: Live PXE (RAM-based, ephemeral)

Boot and run OS entirely from RAM, no disk install:

{
  "boot": {
    "kernel": "/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-kernel-x86_64",
    "initrd": [
      "--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img"
    ],
    "args": [
      "initrd=main",
      "coreos.live.rootfs_url=http://matchbox/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-rootfs.x86_64.img",
      "ignition.config.url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
    ]
  }
}

Use case: Diskless workers, testing, ephemeral compute

Pattern 2: Disk Install (persistent)

PXE boot live image, install to disk, reboot to disk:

{
  "boot": {
    "kernel": "/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-kernel-x86_64",
    "initrd": [
      "--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img"
    ],
    "args": [
      "initrd=main",
      "coreos.live.rootfs_url=http://matchbox/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-rootfs.x86_64.img",
      "coreos.inst.install_dev=/dev/sda",
      "coreos.inst.ignition_url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
    ]
  }
}

Key difference: coreos.inst.install_dev triggers disk install before reboot

Pattern 3: Multi-initrd (layered)

Multiple initrds can be loaded (e.g., base + drivers):

{
  "initrd": [
    "--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img",
    "--name drivers /assets/drivers/custom-drivers.img"
  ],
  "args": [
    "initrd=main,drivers",
    "..."
  ]
}

Config References

Ignition Configs

Direct Ignition (.ign files):

{
  "ignition_id": "worker.ign"
}

File: /var/lib/matchbox/ignition/worker.ign

{
  "ignition": { "version": "3.3.0" },
  "systemd": {
    "units": [{
      "name": "example.service",
      "enabled": true,
      "contents": "[Service]\nType=oneshot\nExecStart=/usr/bin/echo Hello\n\n[Install]\nWantedBy=multi-user.target"
    }]
  }
}

Butane Configs (transpiled to Ignition):

{
  "ignition_id": "worker.yaml"
}

File: /var/lib/matchbox/ignition/worker.yaml

variant: fcos
version: 1.5.0
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - ssh-ed25519 AAAA...
systemd:
  units:
    - name: etcd.service
      enabled: true

Matchbox automatically:

  1. Detects Butane format (file doesn’t end in .ign or .ignition)
  2. Transpiles Butane → Ignition using embedded library
  3. Renders templates with group metadata
  4. Serves as Ignition v3.3.0

Generic Configs

For non-Ignition configs (scripts, YAML, arbitrary data):

{
  "generic_id": "setup-script.sh.tmpl"
}

File: /var/lib/matchbox/generic/setup-script.sh.tmpl

#!/bin/bash
# Rendered with group metadata
NODE_NAME={{.node_name}}
CLUSTER_ID={{.cluster_id}}
echo "Provisioning ${NODE_NAME} in cluster ${CLUSTER_ID}"

Access via: GET /generic?uuid=...&mac=...

Groups

Groups match machines to profiles using selectors (label matching) and provide metadata for template rendering.

Group Schema

{
  "id": "node1-worker",
  "name": "Worker Node 1",
  "profile": "worker",
  "selector": {
    "mac": "52:54:00:89:d8:10",
    "uuid": "550e8400-e29b-41d4-a716-446655440000"
  },
  "metadata": {
    "node_name": "worker-01",
    "cluster_id": "prod-cluster",
    "etcd_endpoints": "https://10.0.1.10:2379,https://10.0.1.11:2379",
    "ssh_authorized_keys": [
      "ssh-ed25519 AAAA...",
      "ssh-rsa AAAA..."
    ]
  }
}

Group Fields

FieldTypeRequiredDescription
idstringUnique group identifier
namestringHuman-readable description
profilestringProfile ID to apply
selectorobjectLabel match criteria (omit for default group)
metadataobjectKey-value data for template rendering

Selector Matching

Reserved selectors (automatically populated from machine attributes):

SelectorSourceExampleNormalized
uuidSMBIOS UUID550e8400-e29b-41d4-a716-446655440000Lowercase
macPrimary NIC MAC52:54:00:89:d8:10Colon-separated
hostnameNetwork hostnamenode1.example.comAs reported
serialHardware serialVMware-42 1a...As reported

Custom selectors (passed as query params):

{
  "selector": {
    "region": "us-west",
    "environment": "production",
    "rack": "A23"
  }
}

Matching request: /ipxe?mac=52:54:00:89:d8:10&region=us-west&environment=production&rack=A23

Matching logic:

  1. All selector key-value pairs must match request labels (AND logic)
  2. Most specific group wins (most selector matches)
  3. If multiple groups have same specificity, first match wins (undefined order)
  4. Groups with no selectors = default group (matches all)

Default Groups

Group with empty selector matches all machines:

{
  "id": "default-worker",
  "name": "Default Worker",
  "profile": "worker",
  "metadata": {
    "environment": "dev"
  }
}

⚠️ Warning: Avoid multiple default groups (non-deterministic matching)

Example: Region-based Matching

Group 1: US-West Workers

{
  "id": "us-west-workers",
  "profile": "worker",
  "selector": {
    "region": "us-west"
  },
  "metadata": {
    "etcd_endpoints": "https://etcd-usw.example.com:2379"
  }
}

Group 2: EU Workers

{
  "id": "eu-workers",
  "profile": "worker",
  "selector": {
    "region": "eu"
  },
  "metadata": {
    "etcd_endpoints": "https://etcd-eu.example.com:2379"
  }
}

Group 3: Specific Machine Override

{
  "id": "node-special",
  "profile": "controller",
  "selector": {
    "mac": "52:54:00:89:d8:10",
    "region": "us-west"
  },
  "metadata": {
    "role": "controller"
  }
}

Matching precedence:

  • Machine with mac=52:54:00:89:d8:10&region=us-westnode-special (2 selectors)
  • Machine with region=us-westus-west-workers (1 selector)
  • Machine with region=eueu-workers (1 selector)

Templating System

Matchbox uses Go’s text/template for rendering configs with group metadata.

Template Context

Available variables in Ignition/Butane/Cloud-Config/generic templates:

// Group metadata (all keys from group.metadata)
{{.node_name}}
{{.cluster_id}}
{{.etcd_endpoints}}

// Group selectors (normalized)
{{.mac}}      // e.g., "52:54:00:89:d8:10"
{{.uuid}}     // e.g., "550e8400-..."
{{.region}}   // Custom selector

// Request query params (raw)
{{.request.query.mac}}     // As passed in URL
{{.request.query.foo}}     // Custom query param
{{.request.raw_query}}     // Full query string

// Special functions
{{if index . "ssh_authorized_keys"}}  // Check if key exists
{{range $element := .ssh_authorized_keys}}  // Iterate arrays

Example: Templated Butane Config

Group metadata:

{
  "metadata": {
    "node_name": "worker-01",
    "ssh_authorized_keys": [
      "ssh-ed25519 AAA...",
      "ssh-rsa BBB..."
    ],
    "ntp_servers": ["time1.google.com", "time2.google.com"]
  }
}

Butane template: /var/lib/matchbox/ignition/worker.yaml

variant: fcos
version: 1.5.0

storage:
  files:
    - path: /etc/hostname
      mode: 0644
      contents:
        inline: {{.node_name}}

    - path: /etc/systemd/timesyncd.conf
      mode: 0644
      contents:
        inline: |
          [Time]
          {{range $server := .ntp_servers}}
          NTP={{$server}}
          {{end}}

{{if index . "ssh_authorized_keys"}}
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        {{range $key := .ssh_authorized_keys}}
        - {{$key}}
        {{end}}
{{end}}

Rendered Ignition (simplified):

{
  "ignition": {"version": "3.3.0"},
  "storage": {
    "files": [
      {
        "path": "/etc/hostname",
        "contents": {"source": "data:,worker-01"},
        "mode": 420
      },
      {
        "path": "/etc/systemd/timesyncd.conf",
        "contents": {"source": "data:,%5BTime%5D%0ANTP%3Dtime1.google.com%0ANTP%3Dtime2.google.com"},
        "mode": 420
      }
    ]
  },
  "passwd": {
    "users": [{
      "name": "core",
      "sshAuthorizedKeys": ["ssh-ed25519 AAA...", "ssh-rsa BBB..."]
    }]
  }
}

Template Best Practices

  1. Prefer external rendering: Use Terraform + ct_config provider for complex templates
  2. Validate Butane: Use strict: true in Terraform or fcct --strict
  3. Escape carefully: Go templates use {{}}, Butane uses YAML - mind the interaction
  4. Test rendering: Request /ignition?mac=... directly to inspect output
  5. Version control: Keep templates + groups in git for auditability

Reserved Metadata Keys

Warning: .request is reserved for query param access. Group metadata with "request": {...} will be overwritten.

Reserved keys:

  • request.query.* - Query parameters
  • request.raw_query - Raw query string

API Integration

HTTP Endpoints (Read-only)

EndpointPurposeTemplate Context
/ipxeiPXE boot scriptProfile boot section
/grubGRUB configProfile boot section
/ignitionIgnition configGroup metadata + selectors + query
/cloudCloud-Config (deprecated)Group metadata + selectors + query
/genericGeneric configGroup metadata + selectors + query
/metadataKey-value env formatGroup metadata + selectors + query

Example metadata endpoint response:

GET /metadata?mac=52:54:00:89:d8:10&foo=bar

NODE_NAME=worker-01
CLUSTER_ID=prod
MAC=52:54:00:89:d8:10
REQUEST_QUERY_MAC=52:54:00:89:d8:10
REQUEST_QUERY_FOO=bar
REQUEST_RAW_QUERY=mac=52:54:00:89:d8:10&foo=bar

gRPC API (Authenticated, mutable)

Used by terraform-provider-matchbox for declarative infrastructure:

Terraform example:

provider "matchbox" {
  endpoint    = "matchbox.example.com:8081"
  client_cert = file("~/.matchbox/client.crt")
  client_key  = file("~/.matchbox/client.key")
  ca          = file("~/.matchbox/ca.crt")
}

resource "matchbox_profile" "worker" {
  name   = "worker"
  kernel = "/assets/fedora-coreos/.../kernel"
  initrd = ["--name main /assets/fedora-coreos/.../initramfs.img"]
  args   = [
    "initrd=main",
    "coreos.inst.install_dev=/dev/sda",
    "coreos.inst.ignition_url=${var.matchbox_http_endpoint}/ignition?uuid=$${uuid}&mac=$${mac:hexhyp}"
  ]
  raw_ignition = data.ct_config.worker.rendered
}

resource "matchbox_group" "node1" {
  name    = "node1"
  profile = matchbox_profile.worker.name
  selector = {
    mac = "52:54:00:89:d8:10"
  }
  metadata = {
    node_name = "worker-01"
  }
}

Operations:

  • CreateProfile, GetProfile, UpdateProfile, DeleteProfile
  • CreateGroup, GetGroup, UpdateGroup, DeleteGroup

TLS client authentication required (see deployment docs)

Configuration Workflow

┌─────────────────────────────────────────────────────────────┐
│ 1. Write Butane configs (YAML)                             │
│    - worker.yaml, controller.yaml                          │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Terraform ct_config transpiles Butane → Ignition        │
│    data "ct_config" "worker" {                             │
│      content = file("worker.yaml")                         │
│      strict  = true                                        │
│    }                                                        │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Terraform creates profiles + groups in Matchbox         │
│    matchbox_profile.worker → gRPC CreateProfile()          │
│    matchbox_group.node1 → gRPC CreateGroup()               │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Machine PXE boots, queries Matchbox                     │
│    GET /ipxe?mac=... → matches group → returns profile     │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Ignition fetches rendered config                        │
│    GET /ignition?mac=... → Matchbox returns Ignition       │
└─────────────────────────────────────────────────────────────┘

Benefits:

  • Rich Terraform templating (loops, conditionals, external data sources)
  • Butane validation before deployment
  • Declarative infrastructure (can terraform plan before apply)
  • Version control workflow (git + CI/CD)

Alternative: Manual FileStore

┌─────────────────────────────────────────────────────────────┐
│ 1. Create profile JSON manually                            │
│    /var/lib/matchbox/profiles/worker.json                  │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Create group JSON manually                              │
│    /var/lib/matchbox/groups/node1.json                     │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Write Ignition/Butane config                            │
│    /var/lib/matchbox/ignition/worker.ign                   │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Restart matchbox (to reload FileStore)                  │
│    systemctl restart matchbox                              │
└─────────────────────────────────────────────────────────────┘

Drawbacks:

  • Manual file management
  • No validation before deployment
  • Requires matchbox restart to pick up changes
  • Error-prone for large fleets

Storage Backends

FileStore (Default)

Config: -data-path=/var/lib/matchbox

Pros:

  • Simple file-based storage
  • Easy to version control (git)
  • Human-readable JSON

Cons:

  • Requires file system access
  • Manual reload for gRPC-created resources

Custom Store (Extensible)

Matchbox’s Store interface allows custom backends:

type Store interface {
  ProfileGet(id string) (*Profile, error)
  GroupGet(id string) (*Group, error)
  IgnitionGet(name string) (string, error)
  // ... other methods
}

Potential custom stores:

  • etcd backend (for HA Matchbox)
  • Database backend (PostgreSQL, MySQL)
  • S3/object storage backend

Note: Not officially provided by Matchbox project; requires custom implementation

Security Considerations

  1. gRPC API authentication: Requires TLS client certificates

    • ca.crt - CA that signed client certs
    • server.crt/server.key - Server TLS identity
    • client.crt/client.key - Client credentials (Terraform)
  2. HTTP endpoints are read-only: No auth, machines fetch configs

    • Do NOT put secrets in Ignition configs
    • Use external secret stores (Vault, GCP Secret Manager)
    • Reference secrets via Ignition files.source with auth headers
  3. Network segmentation: Matchbox on provisioning VLAN, isolate from production

  4. Config validation: Validate Ignition/Butane before deployment to avoid boot failures

  5. Audit logging: Version control groups/profiles; log gRPC API changes

Operational Tips

  1. Test groups with curl:

    curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10'
    
  2. List profiles:

    ls -la /var/lib/matchbox/profiles/
    
  3. Validate Butane:

    podman run -i --rm quay.io/coreos/fcct:release --strict < worker.yaml
    
  4. Check group matching:

    # Default group (no selectors)
    curl http://matchbox.example.com:8080/ignition
    
    # Specific machine
    curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10&uuid=550e8400-e29b-41d4-a716-446655440000'
    
  5. Backup configs:

    tar -czf matchbox-backup-$(date +%F).tar.gz /var/lib/matchbox/{groups,profiles,ignition}
    

Summary

Matchbox’s configuration model provides:

  • Separation of concerns: Profiles (what) vs Groups (who/where)
  • Flexible matching: Label-based, multi-attribute, custom selectors
  • Template support: Go templates for dynamic configs (but prefer external rendering)
  • API-driven: Terraform integration for GitOps workflows
  • Storage options: FileStore (simple) or custom backends (extensible)
  • OS-agnostic: Works with any Ignition-based distro (FCOS, Flatcar, RHCOS)

Best practice: Use Terraform + external Butane configs for production; manual FileStore for labs/development.

5.2 - Deployment Patterns

Matchbox deployment options and operational considerations

Matchbox Deployment Patterns

Analysis of deployment architectures, installation methods, and operational considerations for running Matchbox in production.

Deployment Architectures

Single-Host Deployment

┌─────────────────────────────────────────────────────┐
│           Provisioning Host                         │
│  ┌─────────────┐        ┌─────────────┐            │
│  │  Matchbox   │        │  dnsmasq    │            │
│  │  :8080 HTTP │        │  DHCP/TFTP  │            │
│  │  :8081 gRPC │        │  :67,:69    │            │
│  └─────────────┘        └─────────────┘            │
│         │                      │                    │
│         └──────────┬───────────┘                    │
│                    │                                │
│  /var/lib/matchbox/                                 │
│  ├── groups/                                        │
│  ├── profiles/                                      │
│  ├── ignition/                                      │
│  └── assets/                                        │
└─────────────────────────────────────────────────────┘
              │
              │ Network
              ▼
     ┌──────────────┐
     │ PXE Clients  │
     └──────────────┘

Use case: Lab, development, small deployments (<50 machines)

Pros:

  • Simple setup
  • Single service to manage
  • Minimal resource requirements

Cons:

  • Single point of failure
  • No scalability
  • Downtime during updates

HA Deployment (Multiple Matchbox Instances)

┌─────────────────────────────────────────────────────┐
│              Load Balancer (Ingress/HAProxy)        │
│           :8080 HTTP        :8081 gRPC              │
└─────────────────────────────────────────────────────┘
       │                              │
       ├─────────────┬────────────────┤
       ▼             ▼                ▼
┌──────────┐  ┌──────────┐    ┌──────────┐
│Matchbox 1│  │Matchbox 2│    │Matchbox N│
│ (Pod/VM) │  │ (Pod/VM) │    │ (Pod/VM) │
└──────────┘  └──────────┘    └──────────┘
       │             │                │
       └─────────────┴────────────────┘
                     │
                     ▼
         ┌────────────────────────┐
         │  Shared Storage        │
         │  /var/lib/matchbox     │
         │  (NFS, PV, ConfigMap)  │
         └────────────────────────┘

Use case: Production, datacenter-scale (100+ machines)

Pros:

  • High availability (no single point of failure)
  • Rolling updates (zero downtime)
  • Load distribution

Cons:

  • Complex storage (shared volume or etcd backend)
  • More infrastructure required

Storage options:

  1. Kubernetes PersistentVolume (RWX mode)
  2. NFS share mounted on multiple hosts
  3. Custom etcd-backed Store (requires custom implementation)
  4. Git-sync sidecar (read-only, periodic pull)

Kubernetes Deployment

┌─────────────────────────────────────────────────────┐
│              Ingress Controller                     │
│  matchbox.example.com → Service matchbox:8080       │
│  matchbox-rpc.example.com → Service matchbox:8081   │
└─────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────┐
│          Service: matchbox (ClusterIP)              │
│            ports: 8080/TCP, 8081/TCP                │
└─────────────────────────────────────────────────────┘
                     │
         ┌───────────┴───────────┐
         ▼                       ▼
┌─────────────────┐     ┌─────────────────┐
│  Pod: matchbox  │     │  Pod: matchbox  │
│  replicas: 2+   │     │  replicas: 2+   │
└─────────────────┘     └─────────────────┘
         │                       │
         └───────────┬───────────┘
                     ▼
┌─────────────────────────────────────────────────────┐
│    PersistentVolumeClaim: matchbox-data             │
│    /var/lib/matchbox (RWX mode)                     │
└─────────────────────────────────────────────────────┘

Manifest structure:

contrib/k8s/
├── matchbox-deployment.yaml  # Deployment + replicas
├── matchbox-service.yaml     # Service (8080, 8081)
├── matchbox-ingress.yaml     # Ingress (HTTP + gRPC TLS)
└── matchbox-pvc.yaml         # PersistentVolumeClaim

Key configurations:

  1. Secret for gRPC TLS:

    kubectl create secret generic matchbox-rpc \
      --from-file=ca.crt \
      --from-file=server.crt \
      --from-file=server.key
    
  2. Ingress for gRPC (TLS passthrough):

    metadata:
      annotations:
        nginx.ingress.kubernetes.io/ssl-passthrough: "true"
        nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    
  3. Volume mount:

    volumes:
      - name: data
        persistentVolumeClaim:
          claimName: matchbox-data
    volumeMounts:
      - name: data
        mountPath: /var/lib/matchbox
    

Use case: Cloud-native deployments, Kubernetes-based infrastructure

Pros:

  • Native Kubernetes primitives (Deployments, Services, Ingress)
  • Rolling updates via Deployment strategy
  • Easy scaling (kubectl scale)
  • Health checks + auto-restart

Cons:

  • Requires RWX PersistentVolume or shared storage
  • Ingress TLS configuration complexity (gRPC passthrough)
  • Cluster dependency (can’t provision cluster bootstrap nodes)

⚠️ Bootstrap problem: Kubernetes-hosted Matchbox can’t PXE boot its own cluster nodes (chicken-and-egg). Use external Matchbox for initial cluster bootstrap, then migrate.

Installation Methods

1. Binary Installation (systemd)

Recommended for: Bare-metal hosts, VMs, traditional Linux servers

Steps:

  1. Download and verify:

    wget https://github.com/poseidon/matchbox/releases/download/v0.10.0/matchbox-v0.10.0-linux-amd64.tar.gz
    wget https://github.com/poseidon/matchbox/releases/download/v0.10.0/matchbox-v0.10.0-linux-amd64.tar.gz.asc
    gpg --verify matchbox-v0.10.0-linux-amd64.tar.gz.asc
    
  2. Extract and install:

    tar xzf matchbox-v0.10.0-linux-amd64.tar.gz
    sudo cp matchbox-v0.10.0-linux-amd64/matchbox /usr/local/bin/
    
  3. Create user and directories:

    sudo useradd -U matchbox
    sudo mkdir -p /var/lib/matchbox/{assets,groups,profiles,ignition}
    sudo chown -R matchbox:matchbox /var/lib/matchbox
    
  4. Install systemd unit:

    sudo cp contrib/systemd/matchbox.service /etc/systemd/system/
    
  5. Configure via systemd dropin:

    sudo systemctl edit matchbox
    
    [Service]
    Environment="MATCHBOX_ADDRESS=0.0.0.0:8080"
    Environment="MATCHBOX_RPC_ADDRESS=0.0.0.0:8081"
    Environment="MATCHBOX_LOG_LEVEL=debug"
    
  6. Start service:

    sudo systemctl daemon-reload
    sudo systemctl start matchbox
    sudo systemctl enable matchbox
    

Pros:

  • Direct control over service
  • Easy log access (journalctl -u matchbox)
  • Native OS integration

Cons:

  • Manual updates required
  • OS dependency (package compatibility)

2. Container Deployment (Docker/Podman)

Recommended for: Docker hosts, quick testing, immutable infrastructure

Docker:

mkdir -p /var/lib/matchbox/assets
docker run -d --name matchbox \
  --net=host \
  -v /var/lib/matchbox:/var/lib/matchbox:Z \
  -v /etc/matchbox:/etc/matchbox:Z,ro \
  quay.io/poseidon/matchbox:v0.10.0 \
  -address=0.0.0.0:8080 \
  -rpc-address=0.0.0.0:8081 \
  -log-level=debug

Podman:

podman run -d --name matchbox \
  --net=host \
  -v /var/lib/matchbox:/var/lib/matchbox:Z \
  -v /etc/matchbox:/etc/matchbox:Z,ro \
  quay.io/poseidon/matchbox:v0.10.0 \
  -address=0.0.0.0:8080 \
  -rpc-address=0.0.0.0:8081 \
  -log-level=debug

Volume mounts:

  • /var/lib/matchbox - Data directory (groups, profiles, configs, assets)
  • /etc/matchbox - TLS certificates (ca.crt, server.crt, server.key)

Network mode:

  • --net=host - Required for DHCP/TFTP interaction on same host
  • Bridge mode possible if Matchbox is on separate host from dnsmasq

Pros:

  • Immutable deployments
  • Easy updates (pull new image)
  • Portable across hosts

Cons:

  • Volume management complexity
  • SELinux considerations (:Z flag)

3. Kubernetes Deployment

Recommended for: Kubernetes environments, cloud platforms

Quick start:

# Create TLS secret for gRPC
kubectl create secret generic matchbox-rpc \
  --from-file=ca.crt=~/.matchbox/ca.crt \
  --from-file=server.crt=~/.matchbox/server.crt \
  --from-file=server.key=~/.matchbox/server.key

# Deploy manifests
kubectl apply -R -f contrib/k8s/

# Check status
kubectl get pods -l app=matchbox
kubectl get svc matchbox
kubectl get ingress matchbox matchbox-rpc

Persistence options:

Option 1: emptyDir (ephemeral, dev only):

volumes:
  - name: data
    emptyDir: {}

Option 2: PersistentVolumeClaim (production):

volumes:
  - name: data
    persistentVolumeClaim:
      claimName: matchbox-data

Option 3: ConfigMap (static configs):

volumes:
  - name: groups
    configMap:
      name: matchbox-groups
  - name: profiles
    configMap:
      name: matchbox-profiles

Option 4: Git-sync sidecar (GitOps):

initContainers:
  - name: git-sync
    image: k8s.gcr.io/git-sync:v3.6.3
    env:
      - name: GIT_SYNC_REPO
        value: https://github.com/example/matchbox-configs
      - name: GIT_SYNC_DEST
        value: /var/lib/matchbox
    volumeMounts:
      - name: data
        mountPath: /var/lib/matchbox

Pros:

  • Native k8s features (scaling, health checks, rolling updates)
  • Ingress integration
  • GitOps workflows

Cons:

  • Complexity (Ingress, PVC, TLS)
  • Can’t bootstrap own cluster

Network Boot Environment Setup

Matchbox requires separate DHCP/TFTP/DNS services. Options:

Option 1: dnsmasq Container (Quickest)

Use case: Lab, testing, environments without existing DHCP

Full DHCP + TFTP + DNS:

docker run -d --name dnsmasq \
  --cap-add=NET_ADMIN \
  --net=host \
  quay.io/poseidon/dnsmasq:latest \
  -d -q \
  --dhcp-range=192.168.1.3,192.168.1.254,30m \
  --enable-tftp \
  --tftp-root=/var/lib/tftpboot \
  --dhcp-match=set:bios,option:client-arch,0 \
  --dhcp-boot=tag:bios,undionly.kpxe \
  --dhcp-match=set:efi64,option:client-arch,9 \
  --dhcp-boot=tag:efi64,ipxe.efi \
  --dhcp-userclass=set:ipxe,iPXE \
  --dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
  --address=/matchbox.example.com/192.168.1.2 \
  --log-queries \
  --log-dhcp

Proxy DHCP (alongside existing DHCP):

docker run -d --name dnsmasq \
  --cap-add=NET_ADMIN \
  --net=host \
  quay.io/poseidon/dnsmasq:latest \
  -d -q \
  --dhcp-range=192.168.1.1,proxy,255.255.255.0 \
  --enable-tftp \
  --tftp-root=/var/lib/tftpboot \
  --dhcp-userclass=set:ipxe,iPXE \
  --pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe \
  --pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe \
  --log-queries \
  --log-dhcp

Included files: undionly.kpxe, ipxe.efi, grub.efi (bundled in image)

Option 2: Existing DHCP/TFTP Infrastructure

Use case: Enterprise environments with network admin policies

Required DHCP options (ISC DHCP example):

subnet 192.168.1.0 netmask 255.255.255.0 {
  range 192.168.1.10 192.168.1.250;
  
  # BIOS clients
  if option architecture-type = 00:00 {
    filename "undionly.kpxe";
  }
  # UEFI clients
  elsif option architecture-type = 00:09 {
    filename "ipxe.efi";
  }
  # iPXE clients
  elsif exists user-class and option user-class = "iPXE" {
    filename "http://matchbox.example.com:8080/boot.ipxe";
  }
  
  next-server 192.168.1.100;  # TFTP server IP
}

TFTP files (place in tftp root):

Option 3: iPXE-only (No PXE Chainload)

Use case: Modern hardware with native iPXE firmware

DHCP config (simpler):

filename "http://matchbox.example.com:8080/boot.ipxe";

No TFTP server needed (iPXE fetches directly via HTTP)

Limitation: Doesn’t support legacy BIOS with basic PXE ROM

TLS Certificate Setup

gRPC API requires TLS client certificates for authentication.

Option 1: Provided cert-gen Script

cd scripts/tls
export SAN=DNS.1:matchbox.example.com,IP.1:192.168.1.100
./cert-gen

Generates:

  • ca.crt - Self-signed CA
  • server.crt, server.key - Server credentials
  • client.crt, client.key - Client credentials (for Terraform)

Install server certs:

sudo mkdir -p /etc/matchbox
sudo cp ca.crt server.crt server.key /etc/matchbox/
sudo chown -R matchbox:matchbox /etc/matchbox

Save client certs for Terraform:

mkdir -p ~/.matchbox
cp client.crt client.key ca.crt ~/.matchbox/

Option 2: Corporate PKI

Preferred for production: Use organization’s certificate authority

Requirements:

  • Server cert with SAN: DNS:matchbox.example.com
  • Client cert issued by same CA
  • CA cert for validation

Matchbox flags:

-ca-file=/etc/matchbox/ca.crt
-cert-file=/etc/matchbox/server.crt
-key-file=/etc/matchbox/server.key

Terraform provider config:

provider "matchbox" {
  endpoint    = "matchbox.example.com:8081"
  client_cert = file("/path/to/client.crt")
  client_key  = file("/path/to/client.key")
  ca          = file("/path/to/ca.crt")
}

Option 3: Let’s Encrypt (HTTP API only)

Note: gRPC requires client cert auth (incompatible with Let’s Encrypt)

Use case: TLS for HTTP endpoints only (read-only API)

Matchbox flags:

-web-ssl=true
-web-cert-file=/etc/letsencrypt/live/matchbox.example.com/fullchain.pem
-web-key-file=/etc/letsencrypt/live/matchbox.example.com/privkey.pem

Limitation: Still need self-signed certs for gRPC API

Configuration Flags

Core Flags

FlagDefaultDescription
-address127.0.0.1:8080HTTP API listen address
-rpc-address``gRPC API listen address (empty = disabled)
-data-path/var/lib/matchboxData directory (FileStore)
-assets-path/var/lib/matchbox/assetsStatic assets directory
-log-levelinfoLogging level (debug, info, warn, error)

TLS Flags (gRPC)

FlagDefaultDescription
-ca-file/etc/matchbox/ca.crtCA certificate for client verification
-cert-file/etc/matchbox/server.crtServer TLS certificate
-key-file/etc/matchbox/server.keyServer TLS private key

TLS Flags (HTTP, optional)

FlagDefaultDescription
-web-sslfalseEnable TLS for HTTP API
-web-cert-file``HTTP server TLS certificate
-web-key-file``HTTP server TLS private key

Environment Variables

All flags can be set via environment variables with MATCHBOX_ prefix:

export MATCHBOX_ADDRESS=0.0.0.0:8080
export MATCHBOX_RPC_ADDRESS=0.0.0.0:8081
export MATCHBOX_LOG_LEVEL=debug
export MATCHBOX_DATA_PATH=/custom/path

Operational Considerations

Firewall Configuration

Matchbox host:

firewall-cmd --permanent --add-port=8080/tcp  # HTTP API
firewall-cmd --permanent --add-port=8081/tcp  # gRPC API
firewall-cmd --reload

dnsmasq host (if separate):

firewall-cmd --permanent --add-service=dhcp
firewall-cmd --permanent --add-service=tftp
firewall-cmd --permanent --add-service=dns  # optional
firewall-cmd --reload

Monitoring

Health check endpoints:

# HTTP API
curl http://matchbox.example.com:8080
# Should return: matchbox

# gRPC API
openssl s_client -connect matchbox.example.com:8081 \
  -CAfile ~/.matchbox/ca.crt \
  -cert ~/.matchbox/client.crt \
  -key ~/.matchbox/client.key

Prometheus metrics: Not built-in; consider adding reverse proxy (e.g., nginx) with metrics exporter

Logs (systemd):

journalctl -u matchbox -f

Logs (container):

docker logs -f matchbox

Backup Strategy

What to backup:

  1. /var/lib/matchbox/{groups,profiles,ignition} - Configs
  2. /etc/matchbox/*.{crt,key} - TLS certificates
  3. Terraform state (if using Terraform provider)

Backup command:

tar -czf matchbox-backup-$(date +%F).tar.gz \
  /var/lib/matchbox/{groups,profiles,ignition} \
  /etc/matchbox

Restore:

tar -xzf matchbox-backup-YYYY-MM-DD.tar.gz -C /
sudo chown -R matchbox:matchbox /var/lib/matchbox
sudo systemctl restart matchbox

GitOps approach: Store configs in git repository for versioning and auditability

Updates

Binary deployment:

# Download new version
wget https://github.com/poseidon/matchbox/releases/download/vX.Y.Z/matchbox-vX.Y.Z-linux-amd64.tar.gz
tar xzf matchbox-vX.Y.Z-linux-amd64.tar.gz

# Replace binary
sudo systemctl stop matchbox
sudo cp matchbox-vX.Y.Z-linux-amd64/matchbox /usr/local/bin/
sudo systemctl start matchbox

Container deployment:

docker pull quay.io/poseidon/matchbox:vX.Y.Z
docker stop matchbox
docker rm matchbox
docker run -d --name matchbox ... quay.io/poseidon/matchbox:vX.Y.Z ...

Kubernetes deployment:

kubectl set image deployment/matchbox matchbox=quay.io/poseidon/matchbox:vX.Y.Z
kubectl rollout status deployment/matchbox

Scaling Considerations

Vertical scaling (single instance):

  • CPU: Minimal (config rendering is lightweight)
  • Memory: ~50MB base + asset cache
  • Disk: Depends on cached assets (100MB - 10GB+)

Horizontal scaling (multiple instances):

  • Stateless HTTP API (load balance round-robin)
  • Shared storage required (RWX PV, NFS, or custom backend)
  • gRPC API can be load-balanced with gRPC-aware LB

Asset serving optimization:

  • Use CDN or cache proxy for remote assets
  • Local asset caching for <100 machines
  • Dedicated HTTP server (nginx) for large deployments (1000+ machines)

Security Best Practices

  1. Don’t store secrets in Ignition configs

    • Use Ignition files.source with auth headers to fetch from Vault
    • Or provision minimal config, fetch secrets post-boot
  2. Network segmentation

    • Provision VLAN isolated from production
    • Firewall rules: only allow provisioning traffic
  3. gRPC API access control

    • Client cert authentication (mandatory)
    • Restrict cert issuance to authorized personnel/systems
    • Rotate certs periodically
  4. Audit logging

    • Version control groups/profiles (git)
    • Log gRPC API changes (Terraform state tracking)
    • Monitor HTTP endpoint access
  5. Validate configs before deployment

    • fcct --strict for Butane configs
    • Terraform plan before apply
    • Test in dev environment first

Troubleshooting

Common Issues

1. Machines not PXE booting:

# Check DHCP responses
tcpdump -i eth0 port 67 and port 68

# Verify TFTP files
ls -la /var/lib/tftpboot/
curl tftp://192.168.1.100/undionly.kpxe

# Check Matchbox accessibility
curl http://matchbox.example.com:8080/boot.ipxe

2. 404 Not Found on /ignition:

# Test group matching
curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10'

# Check group exists
ls -la /var/lib/matchbox/groups/

# Check profile referenced by group exists
ls -la /var/lib/matchbox/profiles/

# Verify ignition_id file exists
ls -la /var/lib/matchbox/ignition/

3. gRPC connection refused (Terraform):

# Test TLS connection
openssl s_client -connect matchbox.example.com:8081 \
  -CAfile ~/.matchbox/ca.crt \
  -cert ~/.matchbox/client.crt \
  -key ~/.matchbox/client.key

# Check Matchbox gRPC is listening
sudo ss -tlnp | grep 8081

# Verify firewall
sudo firewall-cmd --list-ports

4. Ignition config validation errors:

# Validate Butane locally
podman run -i --rm quay.io/coreos/fcct:release --strict < config.yaml

# Fetch rendered Ignition
curl 'http://matchbox.example.com:8080/ignition?mac=...' | jq .

# Validate Ignition spec
curl 'http://matchbox.example.com:8080/ignition?mac=...' | \
  podman run -i --rm quay.io/coreos/ignition-validate:latest

Summary

Matchbox deployment considerations:

  • Architecture: Single-host (dev/lab) vs HA (production) vs Kubernetes
  • Installation: Binary (systemd), container (Docker/Podman), or Kubernetes manifests
  • Network boot: dnsmasq container (quick), existing infrastructure (enterprise), or iPXE-only (modern)
  • TLS: Self-signed (dev), corporate PKI (production), Let’s Encrypt (HTTP only)
  • Scaling: Vertical (simple) vs horizontal (requires shared storage)
  • Security: Client cert auth, network segmentation, no secrets in configs
  • Operations: Backup configs, GitOps workflow, monitoring/logging

Recommendation for production:

  • HA deployment (2+ instances) with load balancer
  • Shared storage (NFS or RWX PV on Kubernetes)
  • Corporate PKI for TLS certificates
  • GitOps workflow (Terraform + git-controlled configs)
  • Network segmentation (dedicated provisioning VLAN)
  • Prometheus/Grafana monitoring

5.3 - Network Boot Support

Detailed analysis of Matchbox’s network boot capabilities

Network Boot Support in Matchbox

Matchbox provides comprehensive network boot support for bare-metal provisioning, supporting multiple boot firmware types and protocols.

Overview

Matchbox serves as an HTTP entrypoint for network-booted machines but does not implement DHCP, TFTP, or DNS services itself. Instead, it integrates with existing network infrastructure (or companion services like dnsmasq) to provide a complete PXE boot solution.

Boot Protocol Support

1. PXE (Preboot Execution Environment)

Legacy BIOS support via chainloading to iPXE:

Machine BIOS → DHCP (gets TFTP server) → TFTP (gets undionly.kpxe) 
→ iPXE firmware → HTTP (Matchbox /boot.ipxe)

Key characteristics:

  • Requires TFTP server to serve undionly.kpxe (iPXE bootloader)
  • Chainloads from legacy PXE ROM to modern iPXE
  • Supports older hardware with basic PXE firmware
  • TFTP only used for initial iPXE bootstrap; subsequent downloads via HTTP

2. iPXE (Enhanced PXE)

Primary boot method supported by Matchbox:

iPXE Client → DHCP (gets boot script URL) → HTTP (Matchbox endpoints)
→ Kernel/initrd download → Boot with Ignition config

Endpoints served by Matchbox:

EndpointPurpose
/boot.ipxeStatic script that gathers machine attributes (UUID, MAC, hostname, serial)
/ipxe?<labels>Rendered iPXE script with kernel, initrd, and boot args for matched machine
/assets/Optional local caching of kernel/initrd images

Example iPXE flow:

  1. Machine boots with iPXE firmware
  2. DHCP response points to http://matchbox.example.com:8080/boot.ipxe
  3. iPXE fetches /boot.ipxe:
    #!ipxe
    chain ipxe?uuid=${uuid}&mac=${mac:hexhyp}&domain=${domain}&hostname=${hostname}&serial=${serial}
    
  4. iPXE makes request to /ipxe?uuid=...&mac=... with machine attributes
  5. Matchbox matches machine to group/profile and renders iPXE script:
    #!ipxe
    kernel /assets/coreos/VERSION/coreos_production_pxe.vmlinuz \
      coreos.config.url=http://matchbox.foo:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp} \
      coreos.first_boot=1
    initrd /assets/coreos/VERSION/coreos_production_pxe_image.cpio.gz
    boot
    

Advantages:

  • HTTP downloads (faster than TFTP)
  • Scriptable boot logic
  • Can fetch configs from HTTP endpoints
  • Supports HTTPS (if compiled with TLS support)

3. GRUB2

UEFI firmware support:

UEFI Firmware → DHCP (gets GRUB bootloader) → TFTP (grub.efi)
→ GRUB → HTTP (Matchbox /grub endpoint)

Matchbox endpoint: /grub?<labels>

Example GRUB config rendered by Matchbox:

default=0
timeout=1
menuentry "CoreOS" {
  echo "Loading kernel"
  linuxefi "(http;matchbox.foo:8080)/assets/coreos/VERSION/coreos_production_pxe.vmlinuz" \
    "coreos.config.url=http://matchbox.foo:8080/ignition" "coreos.first_boot"
  echo "Loading initrd"
  initrdefi "(http;matchbox.foo:8080)/assets/coreos/VERSION/coreos_production_pxe_image.cpio.gz"
}

Use case:

  • UEFI systems that prefer GRUB over iPXE
  • Environments with existing GRUB network boot infrastructure

4. PXELINUX (Legacy, via TFTP)

While not a primary Matchbox target, PXELINUX clients can be configured to chainload iPXE:

# /var/lib/tftpboot/pxelinux.cfg/default
timeout 10
default iPXE
LABEL iPXE
KERNEL ipxe.lkrn
APPEND dhcp && chain http://matchbox.example.com:8080/boot.ipxe

DHCP Configuration Patterns

Matchbox supports two DHCP deployment models:

Pattern 1: PXE-Enabled DHCP

Full DHCP server provides IP allocation + PXE boot options.

Example dnsmasq configuration:

dhcp-range=192.168.1.1,192.168.1.254,30m
enable-tftp
tftp-root=/var/lib/tftpboot

# Legacy BIOS → chainload to iPXE
dhcp-match=set:bios,option:client-arch,0
dhcp-boot=tag:bios,undionly.kpxe

# UEFI → iPXE
dhcp-match=set:efi32,option:client-arch,6
dhcp-boot=tag:efi32,ipxe.efi
dhcp-match=set:efi64,option:client-arch,9
dhcp-boot=tag:efi64,ipxe.efi

# iPXE clients → Matchbox
dhcp-userclass=set:ipxe,iPXE
dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe

# DNS for Matchbox
address=/matchbox.example.com/192.168.1.100

Client architecture detection:

  • Option 93 (client-arch): Identifies BIOS (0), UEFI32 (6), UEFI64 (9)
  • User class: Detects iPXE clients to skip TFTP chainloading

Pattern 2: Proxy DHCP

Runs alongside existing DHCP server; provides only boot options (no IP allocation).

Example dnsmasq proxy-DHCP:

dhcp-range=192.168.1.1,proxy,255.255.255.0
enable-tftp
tftp-root=/var/lib/tftpboot

# Chainload legacy PXE to iPXE
pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe
# iPXE clients → Matchbox
dhcp-userclass=set:ipxe,iPXE
pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe

Benefits:

  • Non-invasive: doesn’t replace existing DHCP
  • PXE clients receive merged responses from both DHCP servers
  • Ideal for environments where main DHCP cannot be modified

Network Boot Flow (Complete)

Scenario: BIOS machine with legacy PXE firmware

┌──────────────────────────────────────────────────────────────────┐
│ 1. Machine powers on, BIOS set to network boot                  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 2. NIC PXE firmware broadcasts DHCPDISCOVER (PXEClient)          │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 3. DHCP/proxyDHCP responds with:                                 │
│    - IP address (if full DHCP)                                   │
│    - Next-server: TFTP server IP                                 │
│    - Filename: undionly.kpxe (based on arch=0)                   │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 4. PXE firmware downloads undionly.kpxe via TFTP                 │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 5. Execute iPXE (undionly.kpxe)                                  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 6. iPXE requests DHCP again, identifies as iPXE (user-class)     │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 7. DHCP responds with boot URL (not TFTP):                       │
│    http://matchbox.example.com:8080/boot.ipxe                    │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 8. iPXE fetches /boot.ipxe via HTTP:                             │
│    #!ipxe                                                        │
│    chain ipxe?uuid=${uuid}&mac=${mac:hexhyp}&...                 │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 9. iPXE chains to /ipxe?uuid=XXX&mac=YYY (introspected labels)   │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 10. Matchbox matches machine to group/profile                    │
│     - Finds most specific group matching labels                  │
│     - Retrieves profile (kernel, initrd, args, configs)          │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 11. Matchbox renders iPXE script with:                           │
│     - kernel URL (local asset or remote HTTPS)                   │
│     - initrd URL                                                 │
│     - kernel args (including ignition.config.url)                │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 12. iPXE downloads kernel + initrd (HTTP/HTTPS)                  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 13. iPXE boots kernel with specified args                        │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 14. Fedora CoreOS/Flatcar boots, Ignition runs                   │
│     - Fetches /ignition?uuid=XXX&mac=YYY from Matchbox           │
│     - Matchbox renders Ignition config with group metadata       │
│     - Ignition partitions disk, writes files, creates users      │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 15. System reboots (if disk install), boots from disk            │
└──────────────────────────────────────────────────────────────────┘

Asset Serving

Matchbox can serve static assets (kernel, initrd images) from a local directory to reduce bandwidth and increase speed:

Asset directory structure:

/var/lib/matchbox/assets/
├── fedora-coreos/
│   └── 36.20220906.3.2/
│       ├── fedora-coreos-36.20220906.3.2-live-kernel-x86_64
│       ├── fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img
│       └── fedora-coreos-36.20220906.3.2-live-rootfs.x86_64.img
└── flatcar/
    └── 3227.2.0/
        ├── flatcar_production_pxe.vmlinuz
        ├── flatcar_production_pxe_image.cpio.gz
        └── version.txt

HTTP endpoint: http://matchbox.example.com:8080/assets/

Scripts provided:

  • scripts/get-fedora-coreos - Download/verify Fedora CoreOS images
  • scripts/get-flatcar - Download/verify Flatcar Linux images

Profile reference:

{
  "boot": {
    "kernel": "/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-kernel-x86_64",
    "initrd": ["--name main /assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img"]
  }
}

Alternative: Profiles can reference remote HTTPS URLs (requires iPXE compiled with TLS support):

{
  "kernel": "https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/36.20220906.3.2/x86_64/fedora-coreos-36.20220906.3.2-live-kernel-x86_64"
}

OS Support

Fedora CoreOS

Boot types:

  1. Live PXE (RAM-only, ephemeral)
  2. Install to disk (persistent, recommended)

Required kernel args:

  • coreos.inst.install_dev=/dev/sda - Target disk for install
  • coreos.inst.ignition_url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp} - Provisioning config
  • coreos.live.rootfs_url=... - Root filesystem image

Ignition fetch: During first boot, ignition.service fetches config from Matchbox

Flatcar Linux

Boot types:

  1. Live PXE (RAM-only)
  2. Install to disk

Required kernel args:

  • flatcar.first_boot=yes - Marks first boot
  • flatcar.config.url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp} - Ignition config URL
  • flatcar.autologin - Auto-login to console (optional, dev/debug)

Ignition support: Flatcar uses Ignition v3.x for provisioning

RHEL CoreOS

Supported as it uses Ignition like Fedora CoreOS. Requires Red Hat-specific image sources.

Machine Matching & Labels

Matchbox matches machines to profiles using labels extracted during boot:

Reserved Label Selectors

LabelSourceExampleNormalized
uuidSMBIOS UUID550e8400-e29b-41d4-a716-446655440000Lowercase
macNIC MAC address52:54:00:89:d8:10Normalized to colons
hostnameNetwork boot programnode1.example.comAs-is
serialHardware serialVMware-42 1a...As-is

Custom Labels

Groups can match on arbitrary labels passed as query params:

/ipxe?mac=52:54:00:89:d8:10&region=us-west&env=prod

Matching precedence: Most specific group wins (most selector matches)

Firmware Compatibility

Firmware TypeClient ArchBoot FileProtocolMatchbox Support
BIOS (legacy PXE)0undionly.kpxe → iPXETFTP → HTTP✅ Via chainload
UEFI 32-bit6ipxe.efiTFTP → HTTP
UEFI (BIOS compat)7ipxe.efiTFTP → HTTP
UEFI 64-bit9ipxe.efiTFTP → HTTP
Native iPXE-N/AHTTP✅ Direct
GRUB (UEFI)-grub.efiTFTP → HTTP/grub endpoint

Network Requirements

Firewall rules on Matchbox host:

# HTTP API (read-only)
firewall-cmd --add-port=8080/tcp --permanent

# gRPC API (authenticated, Terraform)
firewall-cmd --add-port=8081/tcp --permanent

DNS requirement:

  • matchbox.example.com must resolve to Matchbox server IP
  • Can be configured in dnsmasq, corporate DNS, or /etc/hosts on DHCP server

DHCP/TFTP host (if using dnsmasq):

firewall-cmd --add-service=dhcp --permanent
firewall-cmd --add-service=tftp --permanent
firewall-cmd --add-service=dns --permanent  # optional

Troubleshooting Tips

  1. Verify Matchbox endpoints:

    curl http://matchbox.example.com:8080
    # Should return: matchbox
    
    curl http://matchbox.example.com:8080/boot.ipxe
    # Should return iPXE script
    
  2. Test machine matching:

    curl 'http://matchbox.example.com:8080/ipxe?mac=52:54:00:89:d8:10'
    # Should return rendered iPXE script with kernel/initrd
    
  3. Check TFTP files:

    ls -la /var/lib/tftpboot/
    # Should contain: undionly.kpxe, ipxe.efi, grub.efi
    
  4. Verify DHCP responses:

    tcpdump -i eth0 -n port 67 and port 68
    # Watch for DHCP offers with PXE options
    
  5. iPXE console debugging:

    • Press Ctrl+B during iPXE boot to enter console
    • Commands: dhcp, ifstat, show net0/ip, chain http://...

Limitations

  1. HTTPS support: iPXE must be compiled with crypto support (larger binary, ~80KB vs ~45KB)
  2. TFTP dependency: Legacy PXE requires TFTP for initial chainload (can’t skip)
  3. No DHCP/TFTP built-in: Must use external services or dnsmasq container
  4. Boot firmware variations: Some vendor PXE implementations have quirks
  5. SecureBoot: iPXE and GRUB must be signed (or SecureBoot disabled)

Reference Implementation: dnsmasq Container

Matchbox project provides quay.io/poseidon/dnsmasq with:

  • Pre-configured DHCP/TFTP/DNS service
  • Bundled ipxe.efi, undionly.kpxe, grub.efi
  • Example configs for PXE-DHCP and proxy-DHCP modes

Quick start (full DHCP):

docker run --rm --cap-add=NET_ADMIN --net=host quay.io/poseidon/dnsmasq \
  -d -q \
  --dhcp-range=192.168.1.3,192.168.1.254 \
  --enable-tftp --tftp-root=/var/lib/tftpboot \
  --dhcp-match=set:bios,option:client-arch,0 \
  --dhcp-boot=tag:bios,undionly.kpxe \
  --dhcp-match=set:efi64,option:client-arch,9 \
  --dhcp-boot=tag:efi64,ipxe.efi \
  --dhcp-userclass=set:ipxe,iPXE \
  --dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
  --address=/matchbox.example.com/192.168.1.2 \
  --log-queries --log-dhcp

Quick start (proxy-DHCP):

docker run --rm --cap-add=NET_ADMIN --net=host quay.io/poseidon/dnsmasq \
  -d -q \
  --dhcp-range=192.168.1.1,proxy,255.255.255.0 \
  --enable-tftp --tftp-root=/var/lib/tftpboot \
  --dhcp-userclass=set:ipxe,iPXE \
  --pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe \
  --pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe \
  --log-queries --log-dhcp

Summary

Matchbox provides robust network boot support through:

  • Protocol flexibility: iPXE (primary), GRUB2, legacy PXE (via chainload)
  • Firmware compatibility: BIOS and UEFI
  • Modern approach: HTTP-based with optional local asset caching
  • Clean separation: Matchbox handles config rendering; external services handle DHCP/TFTP
  • Production-ready: Used by Typhoon Kubernetes distributions for bare-metal provisioning

5.4 - Use Case Evaluation

Evaluation of Matchbox for specific use cases and comparison with alternatives

Matchbox Use Case Evaluation

Analysis of Matchbox’s suitability for various use cases, strengths, limitations, and comparison with alternative provisioning solutions.

Use Case Fit Analysis

✅ Ideal Use Cases

1. Bare-Metal Kubernetes Clusters

Scenario: Provisioning 10-1000 physical servers for Kubernetes nodes

Why Matchbox Excels:

  • Ignition-native (perfect for Fedora CoreOS/Flatcar)
  • Declarative machine provisioning via Terraform
  • Label-based matching (region, role, hardware type)
  • Integration with Typhoon Kubernetes distribution
  • Minimal OS surface (immutable, container-optimized)

Example workflow:

resource "matchbox_profile" "k8s_controller" {
  name   = "k8s-controller"
  kernel = "/assets/fedora-coreos/.../kernel"
  raw_ignition = data.ct_config.controller.rendered
}

resource "matchbox_group" "controllers" {
  profile = matchbox_profile.k8s_controller.name
  selector = {
    role = "controller"
  }
}

Alternatives considered:

  • Cloud-init + netboot.xyz: Less declarative, no native Ignition support
  • Foreman: Heavier, more complex for container-centric workloads
  • Metal³: Kubernetes-native but requires existing cluster

Verdict: ⭐⭐⭐⭐⭐ Matchbox is purpose-built for this


2. Lab/Development Environments

Scenario: Rapid PXE boot testing with QEMU/KVM VMs or homelab servers

Why Matchbox Excels:

  • Quick setup (binary + dnsmasq container)
  • No DHCP infrastructure required (proxy-DHCP mode)
  • Localhost deployment (no external dependencies)
  • Fast iteration (change configs, re-PXE)
  • Included examples and scripts

Example setup:

# Start Matchbox locally
docker run -d --net=host -v /var/lib/matchbox:/var/lib/matchbox \
  quay.io/poseidon/matchbox:latest -address=0.0.0.0:8080

# Start dnsmasq on same host
docker run -d --net=host --cap-add=NET_ADMIN \
  quay.io/poseidon/dnsmasq ...

Alternatives considered:

  • netboot.xyz: Great for manual OS selection, no automation
  • PiXE server: Simpler but less flexible matching logic
  • Manual iPXE scripts: No dynamic matching, manual maintenance

Verdict: ⭐⭐⭐⭐⭐ Minimal setup, maximum flexibility


3. Edge/Remote Site Provisioning

Scenario: Provision machines at 10+ remote datacenters or edge locations

Why Matchbox Excels:

  • Lightweight (single binary, ~20MB)
  • Declarative region-based matching
  • Centralized config management (Terraform)
  • Can run on minimal hardware (ARM support)
  • HTTP-based (works over WAN with reverse proxy)

Architecture:

Central Matchbox (via Terraform)
  ↓ gRPC API
Regional Matchbox Instances (read-only cache)
  ↓ HTTP
Edge Machines (PXE boot)

Label-based routing:

{
  "selector": {
    "region": "us-west",
    "site": "pdx-1"
  },
  "metadata": {
    "ntp_servers": ["10.100.1.1", "10.100.1.2"]
  }
}

Alternatives considered:

  • Foreman: Requires more resources per site
  • Ansible + netboot: No declarative PXE boot, post-install only
  • Cloud-init datasources: Requires cloud metadata service per site

Verdict: ⭐⭐⭐⭐☆ Good fit, but consider caching strategy for WAN


⚠️ Moderate Fit Use Cases

4. Multi-Tenant Bare-Metal Cloud

Scenario: Provide bare-metal-as-a-service to multiple customers

Matchbox challenges:

  • No built-in multi-tenancy (single namespace)
  • No RBAC (gRPC API is all-or-nothing with client certs)
  • No customer self-service portal

Workarounds:

  • Deploy separate Matchbox per tenant (isolation via separate instances)
  • Proxy gRPC API with custom RBAC layer
  • Use group selectors with customer IDs

Better alternatives:

  • Metal³ (Kubernetes-native, better multi-tenancy)
  • OpenStack Ironic (purpose-built for bare-metal cloud)
  • MAAS (Ubuntu-specific, has RBAC)

Verdict: ⭐⭐☆☆☆ Possible but architecturally challenging


5. Heterogeneous OS Provisioning

Scenario: Need to provision Fedora CoreOS, Ubuntu, RHEL, Windows

Matchbox challenges:

  • Designed for Ignition-based OSes (FCOS, Flatcar, RHCOS)
  • No native support for Kickstart (RHEL/CentOS)
  • No support for Preseed (Ubuntu/Debian)
  • No Windows unattend.xml support

What works:

  • Fedora CoreOS ✅
  • Flatcar Linux ✅
  • RHEL CoreOS ✅
  • Container Linux (deprecated but supported) ✅

What requires workarounds:

  • RHEL/CentOS: Possible via generic configs + Kickstart URLs, but not native
  • Ubuntu: Can PXE boot and point to autoinstall ISO, but loses Matchbox templating benefits
  • Debian: Similar to Ubuntu
  • Windows: Not supported (different PXE boot mechanisms)

Better alternatives for heterogeneous environments:

  • Foreman (supports Kickstart, Preseed, unattend.xml)
  • MAAS (Ubuntu-centric but extensible)
  • Cobbler (older but supports many OS types)

Verdict: ⭐⭐☆☆☆ Stick to Ignition-based OSes or use different tool


❌ Poor Fit Use Cases

6. Windows PXE Boot

Why Matchbox doesn’t fit:

  • No WinPE support
  • No unattend.xml rendering
  • Different PXE boot chain (WDS/SCCM model)

Recommendation: Use Microsoft WDS or SCCM

Verdict: ⭐☆☆☆☆ Not designed for this


7. BIOS/Firmware Updates

Why Matchbox doesn’t fit:

  • Focused on OS provisioning, not firmware
  • No vendor-specific tooling (Dell iDRAC, HP iLO integration)

Recommendation: Use vendor tools or Ansible with ipmi/redfish modules

Verdict: ⭐☆☆☆☆ Out of scope


Strengths

1. Ignition-First Design

  • Native support for modern immutable OSes
  • Declarative, atomic provisioning (no config drift)
  • First-boot partition/filesystem setup

2. Label-Based Matching

  • Flexible machine classification (MAC, UUID, region, role, custom)
  • Most-specific-match algorithm (override defaults per machine)
  • Query params for dynamic attributes

3. Terraform Integration

  • Declarative infrastructure as code
  • Plan before apply (preview changes)
  • State tracking for auditability
  • Rich templating (ct_config provider for Butane)

4. Minimal Dependencies

  • Single static binary (~20MB)
  • No database required (FileStore default)
  • No built-in DHCP/TFTP (separation of concerns)
  • Container-ready (OCI image available)

5. HTTP-Centric

  • Faster downloads than TFTP (iPXE via HTTP)
  • Proxy/CDN friendly for asset distribution
  • Standard web tooling (curl, load balancers, Ingress)

6. Production-Ready

  • Used by Typhoon Kubernetes (battle-tested)
  • Clear upgrade path (SemVer releases)
  • OpenPGP signature support for config integrity

Limitations

1. No Multi-Tenancy

  • Single namespace (all groups/profiles global)
  • No RBAC on gRPC API (client cert = full access)
  • Requires separate instances per tenant

2. Ignition-Only Focus

  • Cloud-Config deprecated (legacy support only)
  • No native Kickstart/Preseed/unattend.xml
  • Limits OS choice to CoreOS family

3. Storage Constraints

  • FileStore doesn’t scale to 10,000+ profiles
  • No built-in HA storage (requires NFS or custom backend)
  • Kubernetes deployment needs RWX PersistentVolume

4. No Machine Discovery

  • Doesn’t detect new machines (passive service)
  • No inventory management (use external CMDB)
  • No hardware introspection (use Ironic for that)

5. Limited Observability

  • No built-in metrics (Prometheus integration requires reverse proxy)
  • Logs are minimal (request logging only)
  • No audit trail for gRPC API changes (use Terraform state)

6. TFTP Still Required

  • Legacy BIOS PXE needs TFTP for chainloading to iPXE
  • Can’t fully eliminate TFTP unless all machines have native iPXE

Comparison with Alternatives

vs. Foreman

FeatureMatchboxForeman
OS SupportIgnition-basedKickstart, Preseed, AutoYaST, etc.
ComplexityLow (single binary)High (Rails app, DB, Puppet/Ansible)
Config ModelDeclarative (Ignition)Imperative (post-install scripts)
APIHTTP + gRPCREST API
UINone (API-only)Full web UI
TerraformNative providerCommunity modules
Use CaseContainer-centric infraTraditional Linux servers

When to choose Matchbox: CoreOS-based Kubernetes clusters, minimal infrastructure
When to choose Foreman: Heterogeneous OS, need web UI, traditional config mgmt


vs. Metal³

FeatureMatchboxMetal³
PlatformStandaloneKubernetes-native (operator)
BootstrapCan bootstrap k8s clusterNeeds existing k8s cluster
Machine LifecycleProvision onlyProvision + decommission + reprovision
Hardware IntrospectionNo (labels passed manually)Yes (via Ironic)
Multi-tenancyNoYes (via k8s namespaces)
ComplexityLowHigh (requires Ironic, DHCP, etc.)

When to choose Matchbox: Greenfield bare-metal, no existing k8s
When to choose Metal³: Existing k8s, need hardware mgmt lifecycle


vs. Cobbler

FeatureMatchboxCobbler
AgeModern (2016+)Legacy (2008+)
Config FormatIgnition (declarative)Kickstart/Preseed (imperative)
TemplatingGo templates (minimal)Cheetah templates (extensive)
PythonGo (static binary)Python (requires interpreter)
DHCP ManagementExternalCan manage DHCP
MaintenanceActive (Poseidon)Low activity

When to choose Matchbox: Modern immutable OSes, container workloads
When to choose Cobbler: Legacy infra, need DHCP management, heterogeneous OS


vs. MAAS (Ubuntu)

FeatureMatchboxMAAS
OS SupportCoreOS familyUbuntu (primary), others (limited)
IPAMNo (external DHCP)Built-in IPAM
Power MgmtNo (manual or scripts)Built-in (IPMI, AMT, etc.)
UINoFull web UI
DeclarativeYes (Terraform)Limited (CLI mostly)
Cloud IntegrationNoYes (libvirt, LXD, VM hosts)

When to choose Matchbox: Non-Ubuntu, Kubernetes, minimal dependencies
When to choose MAAS: Ubuntu-centric, need power mgmt, cloud integration


vs. netboot.xyz

FeatureMatchboxnetboot.xyz
PurposeAutomated provisioningManual OS selection menu
AutomationFull (API-driven)None (interactive menu)
CustomizationPer-machine configsGlobal menu
IgnitionNative supportNo
ComplexityMediumVery low

When to choose Matchbox: Automated fleet provisioning
When to choose netboot.xyz: Ad-hoc OS installation, homelab


Decision Matrix

Use this table to evaluate Matchbox for your use case:

RequirementWeightMatchbox ScoreNotes
Ignition/CoreOS supportHigh⭐⭐⭐⭐⭐Native, first-class
Heterogeneous OSHigh⭐⭐☆☆☆Limited to Ignition OSes
Declarative provisioningMedium⭐⭐⭐⭐⭐Terraform native
Multi-tenancyMedium⭐☆☆☆☆Requires separate instances
Web UIMedium☆☆☆☆☆No UI (API-only)
Ease of deploymentMedium⭐⭐⭐⭐☆Binary or container, minimal deps
ScalabilityMedium⭐⭐⭐☆☆FileStore limits, need shared storage for HA
Hardware mgmtLow☆☆☆☆☆No power mgmt, no introspection
CostLow⭐⭐⭐⭐⭐Open source, Apache 2.0

Scoring:

  • ⭐⭐⭐⭐⭐ Excellent
  • ⭐⭐⭐⭐☆ Good
  • ⭐⭐⭐☆☆ Adequate
  • ⭐⭐☆☆☆ Limited
  • ⭐☆☆☆☆ Poor
  • ☆☆☆☆☆ Not supported

Recommendations

Choose Matchbox if:

  1. ✅ Provisioning Fedora CoreOS, Flatcar, or RHEL CoreOS
  2. ✅ Building bare-metal Kubernetes clusters
  3. ✅ Prefer declarative infrastructure (Terraform)
  4. ✅ Want minimal dependencies (single binary)
  5. ✅ Need flexible label-based machine matching
  6. ✅ Have homogeneous OS requirements (all Ignition-based)

Avoid Matchbox if:

  1. ❌ Need multi-OS support (Windows, traditional Linux)
  2. ❌ Require web UI for operations teams
  3. ❌ Need built-in hardware management (power, BIOS config)
  4. ❌ Have strict multi-tenancy requirements
  5. ❌ Need automated hardware discovery/introspection

Hybrid Approaches

Pattern 1: Matchbox + Ansible

  • Matchbox: Initial OS provisioning
  • Ansible: Post-boot configuration, app deployment
  • Works well for stateful services on bare-metal

Pattern 2: Matchbox + Metal³

  • Matchbox: Bootstrap initial k8s cluster
  • Metal³: Ongoing cluster node lifecycle management
  • Gradual migration from Matchbox to Metal³

Pattern 3: Matchbox + Terraform + External Secrets

  • Matchbox: Base OS + minimal config
  • Ignition: Fetch secrets from Vault/GCP Secret Manager
  • Terraform: Orchestrate end-to-end provisioning

Conclusion

Matchbox is a purpose-built, minimalist network boot service optimized for modern immutable operating systems (Ignition-based). It excels in container-centric bare-metal environments, particularly for Kubernetes clusters built with Fedora CoreOS or Flatcar Linux.

Best fit: Organizations adopting immutable infrastructure patterns, container orchestration, and declarative provisioning workflows.

Not ideal for: Heterogeneous OS environments, multi-tenant bare-metal clouds, or teams requiring extensive web UI and built-in hardware management.

For home labs and development, Matchbox offers an excellent balance of simplicity and power. For production Kubernetes deployments, it’s a proven, battle-tested solution (via Typhoon). For complex enterprise provisioning with mixed OS requirements, consider Foreman or MAAS instead.

6 - Ubiquiti Dream Machine Pro Analysis

Comprehensive analysis of the Ubiquiti Dream Machine Pro capabilities, focusing on network boot (PXE) support and infrastructure integration.

Overview

The Ubiquiti Dream Machine Pro (UDM Pro) is an all-in-one network gateway, router, and switch designed for enterprise and advanced home lab environments. This analysis focuses on its capabilities relevant to infrastructure automation and network boot scenarios.

Key Specifications

Hardware

  • Processor: Quad-core ARM Cortex-A57 @ 1.7 GHz
  • RAM: 4GB DDR4
  • Storage: 128GB eMMC (for UniFi OS, applications, and logs)
  • Network Interfaces:
    • 1x WAN port (RJ45, SFP, or SFP+)
    • 8x LAN ports (1 Gbps RJ45, configurable)
    • 1x SFP+ port (10 Gbps)
    • 1x SFP port (1 Gbps)
  • Additional Features:
    • 3.5" SATA HDD bay (for UniFi Protect surveillance)
    • IDS/IPS engine
    • Deep packet inspection
    • Built-in UniFi Network Controller

Software

  • OS: UniFi OS (Linux-based)
  • Controller: Built-in UniFi Network Controller
  • Services: DHCP, DNS, routing, firewall, VPN (site-to-site and remote access)

Network Boot (PXE) Support

Native DHCP PXE Capabilities

The UDM Pro provides basic PXE boot support through its DHCP server:

Supported:

  • DHCP Option 66 (next-server / TFTP server address)
  • DHCP Option 67 (filename / boot file name)
  • Basic single-architecture PXE booting

Configuration via UniFi Controller:

  1. Navigate to SettingsNetworks → Select your network
  2. Scroll to DHCP section
  3. Enable DHCP
  4. Under Advanced DHCP Options:
    • TFTP Server: IP address of your TFTP/PXE server (e.g., 192.168.42.16)
    • Boot Filename: Name of the bootloader file (e.g., pxelinux.0 for BIOS or bootx64.efi for UEFI)

Limitations:

  • No multi-architecture support: Cannot differentiate boot files based on client architecture (BIOS vs. UEFI, x86_64 vs. ARM64)
  • No conditional DHCP options: Cannot vary filename or next-server based on client characteristics
  • Fixed boot parameters: One boot configuration for all PXE clients
  • Single bootloader only: Must choose either BIOS or UEFI bootloader, not both

Use Cases:

  • ✅ Homogeneous environments (all BIOS or all UEFI)
  • ✅ Single OS deployment scenarios
  • ✅ Simple provisioning workflows
  • ❌ Mixed BIOS/UEFI environments (requires external DHCP server with conditional logic)

Network Segmentation & VLANs

The UDM Pro excels at network segmentation, critical for infrastructure isolation:

  • VLAN Support: Native 802.1Q tagging
  • Firewall Rules: Inter-VLAN routing with granular firewall policies
  • Network Isolation: Can create fully isolated networks or controlled inter-network traffic
  • Use Cases for Infrastructure:
    • Management VLAN (for PXE/provisioning)
    • Production VLAN (workloads)
    • IoT/OT VLAN (isolated devices)
    • DMZ (exposed services)

VPN Capabilities

Site-to-Site VPN

  • Protocols: IPsec, WireGuard (experimental)
  • Use Case: Connect home lab to cloud infrastructure (GCP, AWS, Azure)
  • Performance: Hardware-accelerated encryption on UDM Pro

Remote Access VPN

  • Protocols: L2TP, OpenVPN
  • Use Case: Remote administration of home lab infrastructure
  • Integration: Can work with Cloudflare Access for additional security layer

IDS/IPS Engine

  • Technology: Suricata-based
  • Capabilities:
    • Intrusion detection
    • Intrusion prevention (can drop malicious traffic)
    • Threat signatures updated via UniFi
  • Performance Impact: Can affect throughput on high-bandwidth connections
  • Recommendation: Enable for security-sensitive infrastructure segments

DNS & DHCP Services

DNS

  • Local DNS: Can act as caching DNS resolver
  • Custom DNS Records: Limited to UniFi controller hostname
  • Recommendation: Use external DNS (Pi-hole, Bind9) for advanced features like split-horizon DNS

DHCP

  • Static Leases: Supports MAC-based static IP assignments
  • DHCP Options: Can configure common options (NTP, DNS, domain name)
  • Reservations: Per-client reservations via GUI
  • PXE Options: Basic Option 66/67 support (as noted above)

Integration with Infrastructure-as-Code

UniFi Network API

  • REST API: Available for configuration automation
  • Python Libraries: pyunifi and others for programmatic access
  • Use Cases:
    • Terraform provider for network state management
    • Ansible modules for configuration automation
    • CI/CD integration for network-as-code

Terraform Provider

  • Provider: paultyng/unifi
  • Capabilities: Manage networks, firewall rules, port forwarding, DHCP settings
  • Limitations: Not all UI features exposed via API

Configuration Persistence

  • Backup/Restore: JSON-based configuration export
  • Version Control: Can track config changes in Git
  • Recovery: Auto-backup to cloud (optional)

Performance Characteristics

Throughput

  • Routing/NAT: ~3.5 Gbps (without IDS/IPS)
  • IDS/IPS Enabled: ~850 Mbps - 1 Gbps
  • VPN (IPsec): ~1 Gbps
  • Inter-VLAN Routing: Wire speed (8 Gbps backplane)

Scalability

  • Concurrent Devices: 500+ clients tested
  • VLANs: Up to 32 networks/VLANs
  • Firewall Rules: Thousands (performance depends on complexity)
  • DHCP Leases: Supports large pools efficiently

Comparison to Alternatives

FeatureUDM PropfSenseOPNsenseMikroTik
Basic PXE
Conditional DHCP
All-in-oneVaries
GUI Ease-of-use✅✅⚠️⚠️
API/Automation⚠️✅✅
IDS/IPS Built-in⚠️ (addon)⚠️ (addon)
HardwareFixedFlexibleFlexibleFlexible
Price$$$$ (+ hardware)$ (+ hardware)$ - $$$

Recommendations for Home Lab Use

Ideal Use Cases

Use the UDM Pro when:

  • You want an all-in-one solution with minimal configuration
  • You need integrated UniFi controller and network management
  • Your home lab has mixed UniFi hardware (switches, APs)
  • You want a polished GUI and mobile app management
  • Network segmentation and VLANs are critical

Consider Alternatives When

⚠️ Look elsewhere if:

  • You need conditional DHCP options or multi-architecture PXE boot
  • You require advanced routing protocols (BGP, OSPF beyond basics)
  • You need granular firewall control and scripting (pfSense/OPNsense better)
  • Budget is tight and you already have x86 hardware (pfSense on old PC)
  • You need extremely low latency (sub-1ms) routing
  1. Network Segmentation:

    • VLAN 10: Management (PXE, Ansible, provisioning tools)
    • VLAN 20: Kubernetes cluster
    • VLAN 30: Storage network (NFS, iSCSI)
    • VLAN 40: Public-facing services (behind Cloudflare)
  2. DHCP Strategy:

    • Use UDM Pro native DHCP with basic PXE options for single-arch PXE needs
    • Static reservations for infrastructure components
    • Consider external DHCP server if conditional options are required
  3. Firewall Rules:

    • Default deny between VLANs
    • Allow management VLAN → all (with source IP restrictions)
    • Allow cluster VLAN → storage VLAN (on specific ports)
    • NAT only on VLAN 40 (public services)
  4. VPN Configuration:

    • Site-to-Site to GCP via WireGuard (lower overhead than IPsec)
    • Remote access VPN on separate VLAN with restrictive firewall
  5. Integration:

    • Terraform for network state management
    • Ansible for DHCP/DNS servers in management VLAN
    • Cloudflare Access for secure public service exposure

Conclusion

The UDM Pro is a capable all-in-one network device ideal for home labs that prioritize ease-of-use and integration with the UniFi ecosystem. It provides basic PXE boot support suitable for single-architecture environments, though conditional DHCP options require external DHCP servers for complex scenarios.

For infrastructure automation projects, the UDM Pro serves well as a reliable network foundation that handles VLANs, routing, and basic services, allowing you to focus on higher-level infrastructure concerns like container orchestration and cloud integration.

6.1 - UDM Pro VLAN Configuration & Capabilities

Detailed analysis of VLAN support on the Ubiquiti Dream Machine Pro, including port-based VLAN assignment and VPN integration.

Overview

The Ubiquiti Dream Machine Pro (UDM Pro) provides robust VLAN support through native 802.1Q tagging, enabling network segmentation for security, performance, and organizational purposes. This document covers VLAN configuration capabilities, port assignments, and VPN integration.

VLAN Fundamentals on UDM Pro

Supported Standards

  • 802.1Q VLAN Tagging: Full support for standard VLAN tagging
  • VLAN Range: IDs 1-4094 (standard IEEE 802.1Q range)
  • Maximum VLANs: Up to 32 networks/VLANs per device
  • Native VLAN: Configurable per port (default: VLAN 1)

VLAN Types

Corporate Network

  • Default network type for general-purpose VLANs
  • Provides DHCP, inter-VLAN routing, and firewall capabilities
  • Can enable/disable guest policies, IGMP snooping, and multicast DNS

Guest Network

  • Isolated network with internet-only access
  • Automatic firewall rules preventing access to other VLANs
  • Captive portal support for guest authentication

IoT Network

  • Optimized for IoT devices with device isolation
  • Prevents lateral movement between IoT devices
  • Allows communication with controller/gateway only

Port-Based VLAN Assignment

Per-Port VLAN Configuration

The UDM Pro’s 8x 1 Gbps LAN ports and SFP/SFP+ ports support flexible VLAN assignment:

Configuration Options per Port:

  1. Native VLAN/Untagged VLAN: The default VLAN for untagged traffic on the port
  2. Tagged VLANs: Multiple VLANs that can pass through the port with 802.1Q tags
  3. Port Profile: Pre-configured VLAN assignments that can be applied to ports

Port Profile Types

All: Port accepts all VLANs (trunk mode)

  • Passes all configured VLANs with tags
  • Used for connecting managed switches or access points
  • Native VLAN for untagged traffic

Specific VLANs: Port limited to selected VLANs

  • Choose which VLANs are allowed (tagged)
  • Set native/untagged VLAN
  • Used for controlled trunk links

Single VLAN: Access port mode

  • Port carries only one VLAN (untagged)
  • All traffic on this port belongs to specified VLAN
  • Used for end devices (PCs, servers, printers)

Configuration Steps

Via UniFi Controller GUI:

  1. Create Port Profile:

    • Navigate to SettingsProfilesPort Manager
    • Click Create New Port Profile
    • Select profile type (All, LAN, or Custom)
    • Configure VLAN settings:
      • Native VLAN/Network: Untagged VLAN
      • Tagged VLANs: Select allowed VLANs (for trunk mode)
    • Enable/disable settings: PoE, Storm Control, Port Isolation
  2. Assign Profile to Ports:

    • Navigate to UniFi Devices → Select UDM Pro
    • Go to Ports tab
    • For each LAN port (1-8) or SFP port:
      • Click port to edit
      • Select Port Profile from dropdown
      • Apply changes
  3. Quick Port Assignment (Alternative):

    • SettingsNetworks → Select VLAN
    • Under Port Manager, assign specific ports to this network
    • Ports become access ports for this VLAN

Example Port Layout

UDM Pro Port Assignment Example:

Port 1: Native VLAN 10 (Management) - Access Mode
        └── Use: Ansible control server

Port 2: Native VLAN 20 (Kubernetes) - Access Mode
        └── Use: K8s master node

Port 3: Native VLAN 30 (Storage) - Access Mode
        └── Use: NAS/SAN device

Port 4: Native VLAN 1, Tagged: 10,20,30,40 - Trunk Mode
        └── Use: Managed switch uplink

Port 5-7: Native VLAN 40 (DMZ) - Access Mode
          └── Use: Public-facing servers

Port 8: Native VLAN 1 (Default/Untagged) - Access Mode
        └── Use: Management laptop (temporary)

SFP+: Native VLAN 1, Tagged: All - Trunk Mode
      └── Use: 10G uplink to core switch

VLAN Features and Capabilities

Inter-VLAN Routing

Enabled by Default:

  • Hardware-accelerated routing between VLANs
  • Wire-speed performance (8 Gbps backplane)
  • Routing decisions made at Layer 3

Firewall Control:

  • Default behavior: Allow all inter-VLAN traffic
  • Recommended: Create explicit allow/deny rules per VLAN pair
  • Granular control: Protocol, port, source/destination filtering

Example Firewall Rules:

Rule 1: Allow Management (VLAN 10) → All VLANs
        Source: 192.168.10.0/24
        Destination: Any
        Action: Accept

Rule 2: Allow K8s (VLAN 20) → Storage (VLAN 30) - NFS only
        Source: 192.168.20.0/24
        Destination: 192.168.30.0/24
        Ports: 2049 (NFS), 111 (Portmapper)
        Action: Accept

Rule 3: Block IoT (VLAN 50) → All Private Networks
        Source: 192.168.50.0/24
        Destination: 192.168.0.0/16, 10.0.0.0/8, 172.16.0.0/12
        Action: Drop

Rule 4 (Implicit): Default Deny Between VLANs
        Source: Any
        Destination: Any
        Action: Drop

DHCP per VLAN

Each VLAN can have its own DHCP server:

  • Independent IP ranges per VLAN
  • Separate DHCP options (DNS, gateway, NTP, domain)
  • Static DHCP reservations per VLAN
  • PXE boot options (Option 66/67) per network

Configuration:

  • SettingsNetworks → Select VLAN
  • DHCP section:
    • Enable DHCP server
    • Define IP range (e.g., 192.168.10.100-192.168.10.254)
    • Set lease time
    • Configure gateway (usually UDM Pro’s IP on this VLAN)
    • Add custom DHCP options

Example DHCP Configuration:

VLAN 10 (Management):
  Subnet: 192.168.10.0/24
  Gateway: 192.168.10.1 (UDM Pro)
  DHCP Range: 192.168.10.100-192.168.10.200
  DNS: 192.168.10.10 (local DNS server)
  TFTP Server (Option 66): 192.168.10.16
  Boot Filename (Option 67): pxelinux.0

VLAN 20 (Kubernetes):
  Subnet: 192.168.20.0/24
  Gateway: 192.168.20.1 (UDM Pro)
  DHCP Range: 192.168.20.50-192.168.20.99
  DNS: 8.8.8.8, 8.8.4.4
  Domain Name: k8s.lab.local

VLAN Isolation

Guest Portal Isolation:

  • Guest networks auto-configured with isolation rules
  • Prevents access to RFC1918 private networks
  • Internet-only access by default

Manual Isolation (Firewall Rules):

  • Create LAN In rules to block inter-VLAN traffic
  • Use groups for easier management of multiple VLANs
  • Apply port isolation for additional security

Device Isolation (IoT Networks):

  • Prevents devices on same VLAN from communicating
  • Only controller/gateway access allowed
  • Use for untrusted IoT devices (cameras, smart home)

VPN and VLAN Integration

Site-to-Site VPN VLAN Assignment

✅ VLANs CAN be assigned to site-to-site VPN connections:

WireGuard VPN:

  • Configure remote subnet to map to specific local VLAN
  • Example: GCP subnet 10.128.0.0/20 → routed through VLAN 10
  • Routing table automatically updated
  • Firewall rules apply to VPN traffic

IPsec Site-to-Site:

  • Specify local networks (can select specific VLANs)
  • Remote networks configured in tunnel settings
  • Multiple VLANs can traverse single VPN tunnel
  • Perfect Forward Secrecy supported

Configuration Steps:

  1. SettingsVPNSite-to-Site VPN
  2. Create New VPN tunnel (WireGuard or IPsec)
  3. Under Local Networks, select VLANs to include:
    • Option 1: Select “All” networks
    • Option 2: Choose specific VLANs (e.g., VLAN 10, 20 only)
  4. Configure Remote Networks (cloud provider subnets)
  5. Set encryption parameters and pre-shared keys
  6. Create Firewall Rules for VPN traffic:
    • Allow specific VLAN → VPN tunnel
    • Control which VLANs can reach remote networks

Example Site-to-Site Config:

Home Lab → GCP WireGuard VPN

Local Networks:
  - VLAN 10 (Management): 192.168.10.0/24
  - VLAN 20 (Kubernetes): 192.168.20.0/24

Remote Networks:
  - GCP VPC: 10.128.0.0/20

Firewall Rules:
  - Allow VLAN 10 → GCP VPC (all protocols)
  - Allow VLAN 20 → GCP VPC (HTTPS, kubectl API only)
  - Block all other VLANs from VPN tunnel

Remote Access VPN VLAN Assignment

✅ VLANs CAN be assigned to remote access VPN clients:

L2TP/IPsec Remote Access:

  • VPN clients land on a specific VLAN
  • Default: All clients in same VPN subnet
  • Firewall rules control VLAN access from VPN

OpenVPN Remote Access (via UniFi Network Application addon):

  • Not natively built into UDM Pro
  • Requires UniFi Network Application 6.0+
  • Can route VPN clients to specific VLAN

Teleport VPN (UniFi’s solution):

  • Built-in remote access VPN
  • Clients route through UDM Pro
  • Can access specific VLANs based on firewall rules
  • Layer 3 routing to VLANs

Configuration:

  1. SettingsVPNRemote Access
  2. Enable L2TP or configure Teleport
  3. Set VPN Network (e.g., 192.168.100.0/24)
  4. Advanced:
    • Enable access to specific VLANs
    • By default, VPN network is treated as separate VLAN
  5. Firewall Rules to allow VPN → VLANs:
    • Source: VPN network (192.168.100.0/24)
    • Destination: VLAN 10, VLAN 20 (or specific resources)
    • Action: Accept

Example Remote Access Config:

Remote VPN Users → Home Lab Access

VPN Network: 192.168.100.0/24
VPN Gateway: 192.168.100.1 (UDM Pro)

Firewall Rules:
  Rule 1: Allow VPN → Management VLAN (admin users)
          Source: 192.168.100.0/24
          Dest: 192.168.10.0/24
          Ports: SSH (22), HTTPS (443)
  
  Rule 2: Allow VPN → Kubernetes VLAN (developers)
          Source: 192.168.100.0/24
          Dest: 192.168.20.0/24
          Ports: kubectl (6443), app ports (8080-8090)
  
  Rule 3: Block VPN → Storage VLAN (security)
          Source: 192.168.100.0/24
          Dest: 192.168.30.0/24
          Action: Drop

VPN VLAN Routing Limitations

Current Limitations:

  • Cannot assign individual VPN clients to different VLANs dynamically
  • No VLAN assignment based on user identity (all clients in same VPN network)
  • RADIUS integration does not support per-user VLAN assignment for VPN
  • For per-user VLAN control, use firewall rules based on source IP

Workarounds:

  • Use firewall rules with VPN client IP ranges for granular access
  • Deploy separate VPN tunnels for different access levels
  • Use RADIUS for authentication + firewall rules for authorization

VLAN Best Practices for Home Lab

Network Segmentation Strategy

Recommended VLAN Layout:

VLAN 1:   Default/Management (UDM Pro access)
VLAN 10:  Infrastructure Management (Ansible, PXE, monitoring)
VLAN 20:  Kubernetes Cluster (control plane + workers)
VLAN 30:  Storage Network (NFS, iSCSI, object storage)
VLAN 40:  DMZ/Public Services (exposed to internet via Cloudflare)
VLAN 50:  IoT Devices (isolated smart home devices)
VLAN 60:  Guest Network (visitor WiFi, untrusted devices)
VLAN 100: VPN Remote Access (remote admin/dev access)

Firewall Policy Design

Default Deny Approach:

  1. Create explicit allow rules for necessary traffic
  2. Set implicit deny for all inter-VLAN traffic
  3. Log dropped packets for troubleshooting

Rule Order (top to bottom):

  1. Management VLAN → All (with source IP restrictions)
  2. Kubernetes → Storage (specific ports)
  3. DMZ → Internet (outbound only)
  4. VPN → Specific VLANs (based on role)
  5. All → Internet (NAT)
  6. Block RFC1918 from DMZ
  7. Drop all (implicit)

Performance Optimization

VLAN Routing Performance:

  • Inter-VLAN routing is hardware-accelerated
  • No performance penalty for multiple VLANs
  • Use VLAN tagging on trunk ports to reduce switch load

Multicast and Broadcast Control:

  • Enable IGMP snooping per VLAN for multicast efficiency
  • Disable multicast DNS (mDNS) between VLANs if not needed
  • Use multicast routing for cross-VLAN multicast (advanced)

Advanced VLAN Features

VLAN-Specific Services

DNS per VLAN:

  • Configure different DNS servers per VLAN via DHCP
  • Example: Management VLAN uses local DNS, DMZ uses public DNS

NTP per VLAN:

  • DHCP Option 42 for NTP server
  • Different time sources per network segment

Domain Name per VLAN:

  • DHCP Option 15 for domain name
  • Useful for split-horizon DNS setups

VLAN Tagging on WiFi

UniFi WiFi Integration:

  • Each WiFi SSID can map to a specific VLAN
  • Multiple SSIDs on same AP → different VLANs
  • Seamless VLAN tagging for wireless clients

Configuration:

  • Create WiFi network in UniFi Controller
  • Assign VLAN ID to SSID
  • Client traffic automatically tagged

VLAN Monitoring and Troubleshooting

Traffic Statistics:

  • Per-VLAN bandwidth usage visible in UniFi Controller
  • Deep Packet Inspection (DPI) provides application-level stats
  • Export data for analysis in external tools

Debugging Tools:

  • Port mirroring for packet capture
  • Flow logs for traffic analysis
  • Firewall logs show inter-VLAN blocks

Common Issues:

  1. VLAN not working: Check port profile assignment and native VLAN config
  2. No inter-VLAN routing: Verify firewall rules aren’t blocking traffic
  3. DHCP not working on VLAN: Ensure DHCP server enabled on that network
  4. VPN can’t reach VLAN: Check VPN local networks include the VLAN

Summary

VLAN Port Assignment: ✅ YES

The UDM Pro fully supports port-based VLAN assignment:

  • Individual ports can be assigned to specific VLANs (access mode)
  • Ports can carry multiple tagged VLANs (trunk mode)
  • Native/untagged VLAN configurable per port
  • Port profiles simplify configuration across multiple devices

VPN VLAN Assignment: ✅ YES

VLANs can be assigned to VPN connections:

  • Site-to-Site VPN: Select which VLANs traverse the tunnel
  • Remote Access VPN: VPN clients route to specific VLANs via firewall rules
  • Routing Control: Full control over which VLANs are accessible via VPN
  • Limitations: No per-user VLAN assignment; use firewall rules for granular access

Key Capabilities

  • Up to 32 VLANs supported
  • Hardware-accelerated inter-VLAN routing
  • Per-VLAN DHCP, DNS, and firewall policies
  • Full integration with UniFi WiFi for SSID-to-VLAN mapping
  • Flexible port profiles for easy configuration
  • VPN integration for both site-to-site and remote access scenarios