Matchbox - Network boot service for bare-metal provisioning
- Comprehensive analysis of PXE/iPXE/GRUB support
- Configuration model (profiles, groups, templating)
- Deployment patterns and operational considerations
- Use case evaluation and comparison with alternatives

Cloud Providers

Google Cloud Platform - GCP capabilities for network boot infrastructure
- Network boot protocol support (TFTP, HTTP, HTTPS)
- WireGuard VPN deployment and integration
- Cost analysis and performance considerations
Amazon Web Services - AWS capabilities for network boot infrastructure
- Network boot protocol support (TFTP, HTTP, HTTPS)
- WireGuard VPN deployment and integration
- Cost analysis and performance considerations

Operating Systems

Server Operating Systems - OS evaluation for Kubernetes homelab infrastructure
- Ubuntu Server analysis (kubeadm, k3s, MicroK8s)
- Fedora Server analysis (kubeadm with CRI-O)
- Talos Linux analysis (purpose-built Kubernetes OS)
- Harvester HCI analysis (hyperconverged platform)
- Comparison of setup complexity, maintenance, security, and resource overhead

Hardware

HP DL360 Gen9 - Enterprise server hardware analysis
UniFi Dream Machine Pro - Network gateway and controller

Future Analysis Topics

Planned technology evaluations:

Storage Solutions: Ceph, GlusterFS, ZFS over iSCSI
Container Orchestration: Kubernetes distributions (k3s, Talos, etc.)
Observability: Prometheus, Grafana, Loki, Tempo stack
Service Mesh: Istio, Linkerd, Cilium comparison
CI/CD: GitLab Runner, Tekton, Argo Workflows
Secret Management: Vault, External Secrets Operator
Load Balancing: MetalLB, kube-vip, Cilium LB-IPAM

1.1 - Server Operating System Analysis

Evaluation of operating systems for homelab Kubernetes infrastructure

This section provides detailed analysis of operating systems evaluated for the homelab server infrastructure, with a focus on Kubernetes cluster setup and maintenance.

Overview

The selection of a server operating system is critical for homelab infrastructure. The primary evaluation criterion is ease of Kubernetes cluster initialization and ongoing maintenance burden.

Evaluated Options

Ubuntu - Traditional general-purpose Linux distribution
- Kubernetes via kubeadm, k3s, or MicroK8s
- Strong community support and extensive documentation
- Familiar package management and system administration
Fedora - Cutting-edge Linux distribution
- Latest kernel and system components
- Kubernetes via kubeadm or k3s
- Shorter support lifecycle with more frequent upgrades
Talos Linux - Purpose-built Kubernetes OS
- API-driven, immutable infrastructure
- Built-in Kubernetes with minimal attack surface
- Designed specifically for container workloads
Harvester - Hyperconverged infrastructure platform
- Built on Rancher and K3s
- Combines compute, storage, and networking
- VM and container workloads on unified platform

Evaluation Criteria

Each option is evaluated based on:

Kubernetes Installation Methods - Available tooling and installation approaches
Cluster Initialization Process - Steps required to bootstrap a cluster
Maintenance Requirements - OS updates, Kubernetes upgrades, security patches
Resource Overhead - Memory, CPU, and storage footprint
Learning Curve - Ease of adoption and operational complexity
Community Support - Documentation quality and ecosystem maturity
Security Posture - Attack surface and security-first design

ADR-0004: Server Operating System Selection - Final decision based on this analysis

1.1.1 - Ubuntu Analysis

Analysis of Ubuntu for Kubernetes homelab infrastructure

Overview

Ubuntu Server is a popular general-purpose Linux distribution developed by Canonical. It provides Long Term Support (LTS) releases with 5 years of standard support and optional Extended Security Maintenance (ESM).

Key Facts:

Latest LTS: Ubuntu 24.04 LTS (Noble Numbat)
Support Period: 5 years standard, 10 years with Ubuntu Pro (free for personal use)
Kernel: Linux 6.8+ (LTS), regular HWE updates
Package Manager: APT/DPKG, Snap
Init System: systemd

Kubernetes Installation Methods

Ubuntu supports multiple Kubernetes installation approaches:

1. kubeadm (Official Kubernetes Tool)

Installation:

# Install container runtime (containerd)
sudo apt-get update
sudo apt-get install -y containerd

# Configure containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd

# Install kubeadm, kubelet, kubectl
sudo apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.31/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.31/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

Cluster Initialization:

# Initialize control plane
sudo kubeadm init --pod-network-cidr=10.244.0.0/16

# Configure kubectl for admin
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install CNI (e.g., Calico, Flannel)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

# Join worker nodes
kubeadm token create --print-join-command

Pros:

Official Kubernetes tooling, well-documented
Full control over cluster configuration
Supports latest Kubernetes versions
Large community and extensive resources

Cons:

More manual steps than turnkey solutions
Requires understanding of Kubernetes architecture
Manual upgrade process for each component
More complex troubleshooting

2. k3s (Lightweight Kubernetes)

Installation:

# Single-command install on control plane
curl -sfL https://get.k3s.io | sh -

# Get node token for workers
sudo cat /var/lib/rancher/k3s/server/node-token

# Install on worker nodes
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -

Pros:

Extremely simple installation (single command)
Lightweight (< 512MB RAM)
Built-in container runtime (containerd)
Automatic updates via Rancher System Upgrade Controller
Great for edge and homelab use cases

Cons:

Less customizable than kubeadm
Some features removed (e.g., in-tree storage, cloud providers)
Slightly different from upstream Kubernetes

3. MicroK8s (Canonical’s Distribution)

Installation:

# Install via snap
sudo snap install microk8s --classic

# Join cluster
sudo microk8s add-node
# Run output command on worker nodes

# Enable addons
microk8s enable dns storage ingress

Pros:

Zero-ops, single package install
Snap-based automatic updates
Addons for common services (DNS, storage, ingress)
Canonical support available

Cons:

Requires snap (not universally liked)
Less ecosystem compatibility than vanilla Kubernetes
Ubuntu-specific (less portable)

Cluster Initialization Sequence

kubeadm Approach

sequenceDiagram
    participant Admin
    participant Server as Ubuntu Server
    participant K8s as Kubernetes Components
    
    Admin->>Server: Install Ubuntu 24.04 LTS
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system (apt update && upgrade)
    Admin->>Server: Install containerd
    Server->>Server: Configure containerd (CRI)
    Admin->>Server: Install kubeadm/kubelet/kubectl
    Server->>Server: Disable swap, configure kernel modules
    Admin->>K8s: kubeadm init --pod-network-cidr=10.244.0.0/16
    K8s->>Server: Generate certificates
    K8s->>Server: Start etcd
    K8s->>Server: Start API server
    K8s->>Server: Start controller-manager
    K8s->>Server: Start scheduler
    K8s-->>Admin: Control plane ready
    Admin->>K8s: kubectl apply -f calico.yaml
    K8s->>Server: Deploy CNI pods
    Admin->>K8s: kubeadm join (on workers)
    K8s->>Server: Add worker nodes
    K8s-->>Admin: Cluster ready

k3s Approach

sequenceDiagram
    participant Admin
    participant Server as Ubuntu Server
    participant K3s as k3s Components
    
    Admin->>Server: Install Ubuntu 24.04 LTS
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system
    Admin->>Server: curl -sfL https://get.k3s.io | sh -
    Server->>K3s: Download k3s binary
    K3s->>Server: Configure containerd
    K3s->>Server: Start k3s service
    K3s->>Server: Initialize etcd (embedded)
    K3s->>Server: Start API server
    K3s->>Server: Start controller-manager
    K3s->>Server: Start scheduler
    K3s->>Server: Deploy built-in CNI (Flannel)
    K3s-->>Admin: Control plane ready
    Admin->>Server: Retrieve node token
    Admin->>Server: Install k3s agent on workers
    K3s->>Server: Join workers to cluster
    K3s-->>Admin: Cluster ready (5-10 minutes total)

Maintenance Requirements

OS Updates

Security Patches:

# Automatic security updates (recommended)
sudo apt-get install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades

# Manual updates
sudo apt-get update
sudo apt-get upgrade

Frequency:

Security patches: Weekly to monthly
Kernel updates: Monthly (may require reboot)
Major version upgrades: Every 2 years (LTS to LTS)

Kubernetes Upgrades

kubeadm Upgrade:

# Upgrade control plane
sudo apt-get update
sudo apt-get install -y kubeadm=1.32.0-*
sudo kubeadm upgrade apply v1.32.0
sudo apt-get install -y kubelet=1.32.0-* kubectl=1.32.0-*
sudo systemctl restart kubelet

# Upgrade workers
kubectl drain <node> --ignore-daemonsets
sudo apt-get install -y kubeadm=1.32.0-* kubelet=1.32.0-* kubectl=1.32.0-*
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon <node>

k3s Upgrade:

# Manual upgrade
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.32.0+k3s1 sh -

# Automatic upgrade via system-upgrade-controller
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml

Upgrade Frequency: Every 3-6 months (Kubernetes minor versions)

Resource Overhead

Minimal Installation (Ubuntu Server + k3s):

RAM: ~512MB (OS) + 512MB (k3s) = 1GB total
CPU: 1 core minimum, 2 cores recommended
Disk: 10GB (OS) + 10GB (container images) = 20GB
Network: 1 Gbps recommended

Full Installation (Ubuntu Server + kubeadm):

RAM: ~512MB (OS) + 1-2GB (Kubernetes components) = 2GB+ total
CPU: 2 cores minimum
Disk: 15GB (OS) + 20GB (container images/etcd) = 35GB
Network: 1 Gbps recommended

Security Posture

Strengths:

Regular security updates via Ubuntu Security Team
AppArmor enabled by default
SELinux support available
Kernel hardening features (ASLR, stack protection)
Ubuntu Pro ESM for extended CVE coverage (free for personal use)

Attack Surface:

Full general-purpose OS (larger attack surface than minimal OS)
Many installed packages by default (can be minimized)
Requires manual hardening for production use

Hardening Steps:

# Disable unnecessary services
sudo systemctl disable snapd.service
sudo systemctl disable bluetooth.service

# Configure firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 6443/tcp  # Kubernetes API
sudo ufw allow 10250/tcp # Kubelet
sudo ufw enable

# CIS Kubernetes Benchmark compliance
# Use tools like kube-bench for validation

Learning Curve

Ease of Adoption: ⭐⭐⭐⭐⭐ (Excellent)

Most familiar Linux distribution for many users
Extensive documentation and tutorials
Large community support (forums, Stack Overflow)
Straightforward package management
Similar to Debian-based systems

Required Knowledge:

Basic Linux system administration (apt, systemd, networking)
Kubernetes concepts (pods, services, deployments)
Container runtime basics (containerd, Docker)
Text editor (vim, nano) for configuration

Community Support

Ecosystem Maturity: ⭐⭐⭐⭐⭐ (Excellent)

Documentation: Comprehensive official docs, community guides
Community: Massive user base, active forums
Commercial Support: Available from Canonical (Ubuntu Pro)
Third-Party Tools: Excellent compatibility with all Kubernetes tools
Tutorials: Abundant resources for Kubernetes on Ubuntu

Resources:

Pros and Cons Summary

Pros

Good, because most familiar and well-documented Linux distribution
Good, because 5-year LTS support (10 years with Ubuntu Pro)
Good, because multiple Kubernetes installation options (kubeadm, k3s, MicroK8s)
Good, because k3s provides extremely simple setup (single command)
Good, because extensive package ecosystem (60,000+ packages)
Good, because strong community support and resources
Good, because automatic security updates available
Good, because low learning curve for most administrators
Good, because compatible with all Kubernetes tooling and addons
Good, because Ubuntu Pro free for personal use (extended security)

Cons

Bad, because general-purpose OS has larger attack surface than minimal OS
Bad, because more resource overhead than purpose-built Kubernetes OS (1-2GB RAM)
Bad, because requires manual OS updates and reboots
Bad, because kubeadm setup is complex with many manual steps
Bad, because snap packages controversial (for MicroK8s)
Bad, because Kubernetes upgrades require manual intervention (unless using k3s auto-upgrade)
Bad, because managing OS + Kubernetes lifecycle separately increases complexity
Neutral, because many preinstalled packages (can be removed, but require effort)

Recommendations

Best for:

Users familiar with Ubuntu/Debian ecosystem
Homelabs requiring general-purpose server functionality (not just Kubernetes)
Teams wanting multiple Kubernetes installation options
Users prioritizing community support and documentation

Best Installation Method:

Homelab/Learning: k3s (simplest, auto-updates, lightweight)
Production-like: kubeadm (full control, upstream Kubernetes)
Ubuntu-specific: MicroK8s (Canonical support, snap-based)

Avoid if:

Seeking minimal attack surface (consider Talos Linux)
Want infrastructure-as-code for OS layer (consider Talos Linux)
Prefer hyperconverged platform (consider Harvester)

1.1.2 - Fedora Analysis

Analysis of Fedora Server for Kubernetes homelab infrastructure

Overview

Fedora Server is a cutting-edge Linux distribution sponsored by Red Hat, serving as the upstream for Red Hat Enterprise Linux (RHEL). It emphasizes innovation with the latest software packages and kernel versions.

Key Facts:

Latest Version: Fedora 41 (October 2024)
Support Period: ~13 months per release (shorter than Ubuntu LTS)
Kernel: Linux 6.11+ (latest stable)
Package Manager: DNF/RPM, Flatpak
Init System: systemd

Kubernetes Installation Methods

Fedora supports standard Kubernetes installation approaches:

1. kubeadm (Official Kubernetes Tool)

Installation:

# Install container runtime (CRI-O preferred on Fedora)
sudo dnf install -y cri-o
sudo systemctl enable --now crio

# Add Kubernetes repository
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.key
EOF

# Install kubeadm, kubelet, kubectl
sudo dnf install -y kubelet kubeadm kubectl
sudo systemctl enable --now kubelet

Cluster Initialization:

# Initialize control plane
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/crio/crio.sock

# Configure kubectl
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

# Join workers
kubeadm token create --print-join-command

Pros:

CRI-O is native to Fedora ecosystem (same as RHEL/OpenShift)
Latest Kubernetes versions available quickly
Familiar to RHEL/CentOS users
Fully upstream Kubernetes

Cons:

Manual setup process (same as Ubuntu/kubeadm)
Requires Kubernetes knowledge
More complex than turnkey solutions

2. k3s (Lightweight Kubernetes)

Installation:

# Same single-command install
curl -sfL https://get.k3s.io | sh -

# Retrieve token
sudo cat /var/lib/rancher/k3s/server/node-token

# Install on workers
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -

Pros:

Simple installation (identical to Ubuntu)
Lightweight and fast
Well-tested on Fedora/RHEL family

Cons:

Less customizable
Not using native CRI-O by default (uses embedded containerd)

3. OKD (OpenShift Kubernetes Distribution)

Installation (Single-Node):

# Download and install OKD
wget https://github.com/okd-project/okd/releases/download/4.15.0-0.okd-2024-01-27-070424/openshift-install-linux-4.15.0-0.okd-2024-01-27-070424.tar.gz
tar -xvf openshift-install-linux-*.tar.gz
sudo mv openshift-install /usr/local/bin/

# Create install config
./openshift-install create install-config --dir=cluster

# Install cluster
./openshift-install create cluster --dir=cluster

Pros:

Enterprise features (operators, web console, image registry)
Built-in CI/CD and developer tools
Based on Fedora CoreOS (immutable, auto-updating)

Cons:

Very heavy resource requirements (16GB+ RAM)
Complex installation and management
Overkill for simple homelab use

Cluster Initialization Sequence

kubeadm with CRI-O

sequenceDiagram
    participant Admin
    participant Server as Fedora Server
    participant K8s as Kubernetes Components
    
    Admin->>Server: Install Fedora 41
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system (dnf update)
    Admin->>Server: Install CRI-O
    Server->>Server: Configure CRI-O runtime
    Server->>Server: Enable crio.service
    Admin->>Server: Install kubeadm/kubelet/kubectl
    Server->>Server: Disable swap, load kernel modules
    Server->>Server: Configure SELinux (permissive for Kubernetes)
    Admin->>K8s: kubeadm init --cri-socket=unix:///var/run/crio/crio.sock
    K8s->>Server: Generate certificates
    K8s->>Server: Start etcd
    K8s->>Server: Start API server
    K8s->>Server: Start controller-manager
    K8s->>Server: Start scheduler
    K8s-->>Admin: Control plane ready
    Admin->>K8s: kubectl apply CNI
    K8s->>Server: Deploy CNI pods
    Admin->>K8s: kubeadm join (workers)
    K8s->>Server: Add worker nodes
    K8s-->>Admin: Cluster ready

k3s Approach

sequenceDiagram
    participant Admin
    participant Server as Fedora Server
    participant K3s as k3s Components
    
    Admin->>Server: Install Fedora 41
    Server->>Server: Configure network
    Admin->>Server: Update system (dnf update)
    Admin->>Server: Disable firewalld (or configure)
    Admin->>Server: curl -sfL https://get.k3s.io | sh -
    Server->>K3s: Download k3s binary
    K3s->>Server: Configure containerd
    K3s->>Server: Start k3s service
    K3s->>Server: Initialize embedded etcd
    K3s->>Server: Start API server
    K3s->>Server: Deploy built-in CNI
    K3s-->>Admin: Control plane ready
    Admin->>Server: Retrieve node token
    Admin->>Server: Install k3s agent on workers
    K3s->>Server: Join workers
    K3s-->>Admin: Cluster ready (5-10 minutes)

Maintenance Requirements

OS Updates

Security and System Updates:

# Automatic updates (dnf-automatic)
sudo dnf install -y dnf-automatic
sudo systemctl enable --now dnf-automatic.timer

# Manual updates
sudo dnf update -y
sudo reboot  # if kernel updated

Frequency:

Security patches: Weekly to monthly
Kernel updates: Monthly (frequent updates)
Major version upgrades: Every ~13 months (Fedora releases)

Version Upgrade:

# Upgrade to next Fedora release
sudo dnf upgrade --refresh
sudo dnf install dnf-plugin-system-upgrade
sudo dnf system-upgrade download --releasever=42
sudo dnf system-upgrade reboot

Kubernetes Upgrades

kubeadm Upgrade:

# Upgrade control plane
sudo dnf update -y kubeadm
sudo kubeadm upgrade apply v1.32.0
sudo dnf update -y kubelet kubectl
sudo systemctl restart kubelet

# Upgrade workers
kubectl drain <node> --ignore-daemonsets
sudo dnf update -y kubeadm kubelet kubectl
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon <node>

k3s Upgrade: Same as Ubuntu (curl script or system-upgrade-controller)

Upgrade Frequency: Kubernetes every 3-6 months, Fedora OS every ~13 months

Resource Overhead

Minimal Installation (Fedora Server + k3s):

RAM: ~600MB (OS) + 512MB (k3s) = 1.2GB total
CPU: 1 core minimum, 2 cores recommended
Disk: 12GB (OS) + 10GB (containers) = 22GB
Network: 1 Gbps recommended

Full Installation (Fedora Server + kubeadm + CRI-O):

RAM: ~700MB (OS) + 1.5GB (Kubernetes) = 2.2GB total
CPU: 2 cores minimum
Disk: 15GB (OS) + 20GB (containers) = 35GB
Network: 1 Gbps recommended

Note: Slightly higher overhead than Ubuntu due to SELinux and newer components.

Security Posture

Strengths:

SELinux enabled by default (stronger than AppArmor)
Latest security patches and kernel (bleeding edge)
CRI-O container runtime (security-focused, used by OpenShift)
Shorter support window = less legacy CVEs
Active security team and rapid response

Attack Surface:

General-purpose OS (larger surface than minimal OS)
More installed packages than minimal server
SELinux can be complex to configure for Kubernetes

Hardening Steps:

# Configure firewall (firewalld default on Fedora)
sudo firewall-cmd --permanent --add-port=6443/tcp  # API server
sudo firewall-cmd --permanent --add-port=10250/tcp # Kubelet
sudo firewall-cmd --reload

# SELinux configuration for Kubernetes
sudo setenforce 0  # Permissive (Kubernetes not fully SELinux-ready)
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

# Disable unnecessary services
sudo systemctl disable bluetooth.service

Learning Curve

Ease of Adoption: ⭐⭐⭐⭐ (Good)

Familiar for RHEL/CentOS/Alma/Rocky users
DNF package manager (similar to APT)
Excellent documentation
SELinux learning curve can be steep

Required Knowledge:

RPM-based system administration (dnf, systemd)
SELinux basics (or willingness to use permissive mode)
Kubernetes concepts
Firewalld configuration

Differences from Ubuntu:

DNF vs APT package manager
SELinux vs AppArmor
Firewalld vs UFW
Faster release cycle (more frequent upgrades)

Community Support

Ecosystem Maturity: ⭐⭐⭐⭐ (Good)

Documentation: Excellent official docs, Red Hat resources
Community: Large user base, active forums
Commercial Support: RHEL support available (paid)
Third-Party Tools: Good compatibility with Kubernetes tools
Tutorials: Abundant resources, especially for RHEL ecosystem

Resources:

Pros and Cons Summary

Pros

Good, because latest kernel and software packages (bleeding edge)
Good, because SELinux enabled by default (stronger MAC than AppArmor)
Good, because native CRI-O support (same as RHEL/OpenShift)
Good, because upstream for RHEL (enterprise compatibility)
Good, because multiple Kubernetes installation options
Good, because k3s simplifies setup dramatically
Good, because strong security focus and rapid CVE response
Good, because familiar to RHEL/CentOS ecosystem
Good, because automatic updates available (dnf-automatic)
Neutral, because shorter support cycle (13 months) ensures latest features

Cons

Bad, because short support cycle requires frequent OS upgrades (every ~13 months)
Bad, because bleeding-edge packages can introduce instability
Bad, because SELinux configuration for Kubernetes is complex (often set to permissive)
Bad, because smaller community than Ubuntu (though still large)
Bad, because general-purpose OS has larger attack surface than minimal OS
Bad, because more resource overhead than purpose-built Kubernetes OS
Bad, because OS upgrade every 13 months adds maintenance burden
Bad, because less beginner-friendly than Ubuntu
Bad, because managing OS + Kubernetes lifecycle separately
Neutral, because rapid release cycle can be pro or con depending on preference

Recommendations

Best for:

Users familiar with RHEL/CentOS/Rocky/Alma ecosystem
Teams wanting latest kernel and software features
Environments requiring SELinux (compliance, enterprise standards)
Learning OpenShift/OKD ecosystem (Fedora CoreOS foundation)
Users comfortable with frequent OS upgrades

Best Installation Method:

Homelab/Learning: k3s (simplest, lightweight)
Enterprise-like: kubeadm + CRI-O (OpenShift compatibility)
Advanced: OKD (if resources available, 16GB+ RAM)

Avoid if:

Prefer long-term stability (choose Ubuntu LTS)
Want minimal maintenance (frequent Fedora upgrades required)
Seeking minimal attack surface (consider Talos Linux)
Uncomfortable with SELinux complexity
Want infrastructure-as-code for OS (consider Talos Linux)

Comparison with Ubuntu

Aspect	Fedora	Ubuntu LTS
Support Period	13 months	5 years (10 with Pro)
Kernel	Latest (6.11+)	LTS (6.8+)
Security	SELinux	AppArmor
Package Manager	DNF/RPM	APT/DEB
Release Cycle	6 months	2 years (LTS)
Upgrade Frequency	Every 13 months	Every 2-5 years
Community Size	Large	Very Large
Enterprise Upstream	RHEL	N/A
Stability	Bleeding edge	Stable/Conservative
Learning Curve	Moderate	Easy

Verdict: Fedora is excellent for those wanting latest features and comfortable with frequent upgrades. Ubuntu LTS is better for long-term stability and minimal maintenance.

1.1.3 - Talos Linux Analysis

Analysis of Talos Linux for Kubernetes homelab infrastructure

Overview

Talos Linux is a modern operating system designed specifically for running Kubernetes. It is API-driven, immutable, and minimal, with no SSH access, shell, or package manager. All configuration is done via a declarative API.

Key Facts:

Latest Version: Talos 1.9 (supports Kubernetes 1.31)
Support: Community-driven, commercial support available from Sidero Labs
Kernel: Linux 6.6+ LTS
Architecture: Immutable, API-driven, no shell access
Management: talosctl CLI + Kubernetes API

Kubernetes Installation Methods

Talos Linux has built-in Kubernetes - there is only one installation method.

Built-in Kubernetes (Only Option)

Installation Process:

Boot Talos ISO/PXE (maintenance mode)
Apply machine configuration via talosctl
Bootstrap Kubernetes via talosctl bootstrap

Machine Configuration (YAML):

# controlplane.yaml
version: v1alpha1
machine:
  type: controlplane
  install:
    disk: /dev/sda
  network:
    hostname: control-plane-1
    interfaces:
      - interface: eth0
        dhcp: false
        addresses:
          - 192.168.1.10/24
        routes:
          - network: 0.0.0.0/0
            gateway: 192.168.1.1
cluster:
  clusterName: homelab
  controlPlane:
    endpoint: https://192.168.1.10:6443
  network:
    cni:
      name: custom
      urls:
        - https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

Cluster Initialization:

# Generate machine configs
talosctl gen config homelab https://192.168.1.10:6443

# Apply config to control plane node (booted from ISO)
talosctl apply-config --insecure --nodes 192.168.1.10 --file controlplane.yaml

# Wait for install to complete, then bootstrap
talosctl bootstrap --nodes 192.168.1.10 --endpoints 192.168.1.10

# Retrieve kubeconfig
talosctl kubeconfig --nodes 192.168.1.10 --endpoints 192.168.1.10

# Apply config to worker nodes
talosctl apply-config --insecure --nodes 192.168.1.11 --file worker.yaml

Pros:

Kubernetes built-in, no separate installation
Declarative configuration (GitOps-friendly)
Extremely minimal attack surface (no shell, no SSH)
Immutable infrastructure (config changes require reboot)
Automatic updates via Talos controller
Designed from ground up for Kubernetes

Cons:

Steep learning curve (completely different paradigm)
No SSH/shell access (all via API)
Troubleshooting requires different mindset
Limited to Kubernetes workloads only (not general-purpose)
Smaller community than traditional distros

Cluster Initialization Sequence

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Talos as Talos Linux
    participant K8s as Kubernetes Components
    
    Admin->>Server: Boot Talos ISO (PXE or USB)
    Server->>Talos: Start in maintenance mode
    Talos-->>Admin: API endpoint ready (no shell)
    Admin->>Admin: Generate configs (talosctl gen config)
    Admin->>Talos: talosctl apply-config (controlplane.yaml)
    Talos->>Server: Partition disk
    Talos->>Server: Install Talos to /dev/sda
    Talos->>Server: Write machine config
    Server->>Server: Reboot from disk
    Talos->>Talos: Load machine config
    Talos->>K8s: Start kubelet
    Talos->>K8s: Start etcd
    Talos->>K8s: Start API server
    Admin->>Talos: talosctl bootstrap
    Talos->>K8s: Initialize cluster
    K8s->>Talos: Start controller-manager
    K8s->>Talos: Start scheduler
    K8s-->>Admin: Control plane ready
    Admin->>K8s: Apply CNI (via talosctl or kubectl)
    K8s->>Talos: Deploy CNI pods
    Admin->>Talos: Apply worker configs
    Talos->>K8s: Join workers to cluster
    K8s-->>Admin: Cluster ready (10-15 minutes)

Maintenance Requirements

OS Updates

Declarative Upgrades:

# Upgrade Talos version (rolling upgrade)
talosctl upgrade --nodes 192.168.1.10 --image ghcr.io/siderolabs/installer:v1.9.0

# Kubernetes version upgrade (also declarative)
talosctl upgrade-k8s --nodes 192.168.1.10 --to 1.32.0

Automatic Updates (via Talos System Extensions):

# machine config with auto-update extension
machine:
  install:
    extensions:
      - image: ghcr.io/siderolabs/system-upgrade-controller

Frequency:

Talos releases: Every 2-3 months
Kubernetes upgrades: Follow upstream cadence (quarterly)
Security patches: Built into Talos releases
No traditional OS patching (immutable system)

Configuration Changes

All changes via machine config:

# Edit machine config YAML
vim controlplane.yaml

# Apply updated config (triggers reboot if needed)
talosctl apply-config --nodes 192.168.1.10 --file controlplane.yaml

No manual package installs - everything declarative.

Resource Overhead

Minimal Footprint (Talos Linux + Kubernetes):

RAM: ~256MB (OS) + 512MB (Kubernetes) = 768MB total
CPU: 1 core minimum, 2 cores recommended
Disk: ~500MB (OS) + 10GB (container images/etcd) = 10-15GB total
Network: 1 Gbps recommended

Comparison:

Ubuntu + k3s: ~1GB RAM
Talos: ~768MB RAM (lighter)
Ubuntu + kubeadm: ~2GB RAM
Talos: ~768MB RAM (much lighter)

Minimal install size: ~500MB (vs 10GB+ for Ubuntu/Fedora)

Security Posture

Strengths: ⭐⭐⭐⭐⭐ (Excellent)

No SSH access - attack surface eliminated
No shell - cannot install malware
No package manager - no additional software installation
Immutable filesystem - rootfs read-only
Minimal components: Only Kubernetes and essential services
API-only access - mTLS-authenticated talosctl
KSPP compliance: Kernel Self-Protection Project standards
Signed images: Cryptographically signed Talos images
Secure Boot support: UEFI Secure Boot compatible

Attack Surface:

Smallest possible: Only Kubernetes API, kubelet, and Talos API
~30 running processes (vs 100+ on Ubuntu/Fedora)
~200MB filesystem (vs 5-10GB on Ubuntu/Fedora)

No hardening needed - secure by default.

Security Features:

# Built-in security (example config)
machine:
  sysctls:
    kernel.kptr_restrict: "2"
    kernel.yama.ptrace_scope: "1"
  kernel:
    modules:
      - name: br_netfilter
  features:
    kubernetesTalosAPIAccess:
      enabled: true
      allowedRoles:
        - os:reader

Learning Curve

Ease of Adoption: ⭐⭐ (Challenging)

Paradigm shift: No shell/SSH, API-only management
Requires understanding of declarative infrastructure
Talosctl CLI has learning curve
Excellent documentation helps
Different troubleshooting approach (logs via API)

Required Knowledge:

Kubernetes fundamentals (critical)
YAML configuration syntax
Networking basics (especially CNI)
GitOps concepts helpful
Comfort with “infrastructure as code”

Debugging without shell:

# View logs via API
talosctl logs --nodes 192.168.1.10 kubelet

# Get system metrics
talosctl dashboard --nodes 192.168.1.10

# Interactive mode (limited shell in emergency)
talosctl dashboard --nodes 192.168.1.10

# Service status
talosctl service --nodes 192.168.1.10

Community Support

Ecosystem Maturity: ⭐⭐⭐ (Growing)

Documentation: Excellent official docs
Community: Smaller but very active (Slack, GitHub Discussions)
Commercial Support: Available from Sidero Labs
Third-Party Tools: Growing ecosystem (Cluster API, GitOps tools)
Tutorials: Increasing number of community guides

Resources:

Community Size: Smaller than Ubuntu/Fedora, but dedicated and helpful.

Pros and Cons Summary

Pros

Good, because Kubernetes is built-in (no separate installation)
Good, because minimal attack surface (no SSH, shell, or package manager)
Good, because immutable infrastructure (config drift impossible)
Good, because API-driven management (GitOps-friendly)
Good, because extremely low resource overhead (~768MB RAM)
Good, because automatic security patches via Talos upgrades
Good, because declarative configuration (version-controlled)
Good, because secure by default (no hardening required)
Good, because smallest disk footprint (~500MB OS)
Good, because designed specifically for Kubernetes (opinionated and optimized)
Good, because UEFI Secure Boot support
Good, because upgrades are simple and declarative (talosctl upgrade)

Cons

Bad, because steep learning curve (no shell/SSH paradigm shift)
Bad, because limited to Kubernetes workloads only (not general-purpose)
Bad, because troubleshooting without shell requires different approach
Bad, because smaller community than Ubuntu/Fedora
Bad, because relatively new (less mature than traditional distros)
Bad, because no escape hatch for manual intervention
Bad, because requires comfort with declarative infrastructure
Bad, because debugging is harder for beginners
Neutral, because opinionated design (pro for K8s-only, con for general use)

Recommendations

Best for:

Kubernetes-dedicated infrastructure (no general-purpose workloads)
Security-focused environments (minimal attack surface)
GitOps workflows (declarative configuration)
Immutable infrastructure advocates
Teams comfortable with API-driven management
Production Kubernetes clusters (once team is trained)

Best Installation Method:

Only option: Built-in Kubernetes via talosctl

Avoid if:

Need general-purpose server functionality (SSH, cron jobs, etc.)
Team unfamiliar with Kubernetes (too steep a learning curve)
Require shell access for troubleshooting comfort
Want traditional package management (apt, dnf)
Prefer familiar Linux administration tools

Comparison with Ubuntu and Fedora

Aspect	Talos Linux	Ubuntu + k3s	Fedora + kubeadm
K8s Installation	Built-in	Single command	Manual (kubeadm)
Attack Surface	Minimal (~30 processes)	Medium (~100)	Medium (~100)
Resource Overhead	768MB RAM	1GB RAM	2.2GB RAM
Disk Footprint	500MB	10GB	15GB
Security Model	Immutable, no shell	AppArmor, shell	SELinux, shell
Management	API-only (talosctl)	SSH + kubectl	SSH + kubectl
Learning Curve	Steep	Easy	Moderate
Community Size	Small (growing)	Very Large	Large
Support Period	Rolling releases	5-10 years	13 months
Use Case	Kubernetes only	General-purpose	General-purpose
Upgrades	Declarative, simple	Manual OS + K8s	Manual OS + K8s
Configuration	Declarative YAML	Imperative + YAML	Imperative + YAML
Troubleshooting	API logs/metrics	SSH + logs	SSH + logs
GitOps-Friendly	Excellent	Good	Good
Best for	K8s-dedicated infra	Homelabs, learning	RHEL ecosystem

Verdict: Talos is the most secure and efficient option for Kubernetes-only infrastructure, but requires team buy-in to API-driven, immutable paradigm. Ubuntu/Fedora better for general-purpose servers or teams wanting shell access.

Advanced Features

Talos System Extensions

Extend Talos functionality with extensions:

machine:
  install:
    extensions:
      - image: ghcr.io/siderolabs/intel-ucode:20240312
      - image: ghcr.io/siderolabs/iscsi-tools:v0.1.4

Cluster API Integration

Talos works natively with Cluster API:

# Install Cluster API + Talos provider
clusterctl init --infrastructure talos

# Create cluster from template
clusterctl generate cluster homelab --infrastructure talos > cluster.yaml
kubectl apply -f cluster.yaml

Image Factory

Custom Talos images with extensions:

# Build custom image
curl -X POST https://factory.talos.dev/image \
  -d '{"talos_version":"v1.9.0","extensions":["siderolabs/intel-ucode"]}'

Disaster Recovery

Talos supports etcd backup/restore:

# Backup etcd
talosctl etcd snapshot --nodes 192.168.1.10

# Restore from snapshot
talosctl bootstrap --recover-from ./etcd-snapshot.db

Production Readiness

Production Use: ✅ Yes (many companies run Talos in production)

High Availability:

3+ control plane nodes recommended
External etcd supported
Load balancer for API server

Monitoring:

Prometheus metrics built-in
Talos dashboard for health
Standard Kubernetes observability tools

Example Production Clusters:

Sidero Metal (bare metal provisioning)
Various cloud providers (AWS, GCP, Azure)
Edge deployments (minimal footprint)

1.1.4 - Harvester Analysis

Analysis of Harvester HCI for Kubernetes homelab infrastructure

Overview

Harvester is a Hyperconverged Infrastructure (HCI) platform built on Kubernetes, designed to provide VM and container management on a unified platform. It combines compute, storage, and networking with built-in K3s for orchestration.

Key Facts:

Latest Version: Harvester 1.4 (based on K3s 1.30+)
Foundation: Built on RancherOS 2.0, K3s, and KubeVirt
Support: Supported by SUSE (acquired Rancher)
Architecture: HCI platform with VM + container workloads
Management: Web UI + kubectl + Rancher integration

Kubernetes Installation Methods

Harvester includes K3s as its foundation - Kubernetes is built-in.

Built-in K3s (Only Option)

Installation Process:

Boot Harvester ISO (interactive installer or PXE)
Complete installation wizard (web UI or console)
Create cluster (automatic K3s deployment)
Access via web UI or kubectl

Interactive Installation:

# Boot from Harvester ISO
1. Choose "Create a new Harvester cluster"
2. Configure:
   - Cluster token
   - Node role (management/worker/witness)
   - Network interface (management network)
   - VIP (Virtual IP for cluster access)
   - Storage disk (Longhorn persistent storage)
3. Install completes (15-20 minutes)
4. Access web UI at https://<VIP>

Configuration (cloud-init for automated install):

# config.yaml
token: my-cluster-token
os:
  hostname: harvester-node-1
  modules:
    - kvm
  kernel_parameters:
    - intel_iommu=on
install:
  mode: create
  device: /dev/sda
  iso_url: https://releases.rancher.com/harvester/v1.4.0/harvester-v1.4.0-amd64.iso
  vip: 192.168.1.100
  vip_mode: static
  networks:
    harvester-mgmt:
      interfaces:
        - name: eth0
      default_route: true
      ip: 192.168.1.10
      subnet_mask: 255.255.255.0
      gateway: 192.168.1.1

Pros:

Complete HCI solution (VMs + containers)
Web UI for management (no CLI required)
Built-in storage (Longhorn CSI)
Built-in networking (multus, SR-IOV)
VM live migration
Rancher integration for multi-cluster management
K3s built-in (no separate Kubernetes install)

Cons:

Heavy resource requirements (8GB+ RAM per node)
Complex architecture (steep learning curve)
Larger attack surface than minimal OS
Overkill for container-only workloads
Requires 3+ nodes for production HA

Cluster Initialization Sequence

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Harvester as Harvester HCI
    participant K3s as K3s / KubeVirt
    participant Storage as Longhorn Storage
    
    Admin->>Server: Boot Harvester ISO
    Server->>Harvester: Start installation wizard
    Harvester-->>Admin: Interactive console/web UI
    Admin->>Harvester: Configure cluster (token, VIP, storage)
    Harvester->>Server: Partition disks (OS + Longhorn storage)
    Harvester->>Server: Install RancherOS 2.0 base
    Harvester->>Server: Install K3s components
    Server->>Server: Reboot
    Harvester->>K3s: Start K3s server
    K3s->>Server: Initialize control plane
    K3s->>Server: Deploy Harvester operators
    K3s->>Storage: Deploy Longhorn for persistent storage
    K3s->>Server: Deploy KubeVirt for VM management
    K3s->>Server: Deploy multus CNI (multi-network)
    Harvester-->>Admin: Web UI ready at https://<VIP>
    Admin->>Harvester: Add additional nodes (join cluster)
    Harvester->>K3s: Join nodes to cluster
    K3s->>Storage: Replicate storage across nodes
    Harvester-->>Admin: Cluster ready (20-30 minutes)
    Admin->>Harvester: Create VMs or deploy containers

Maintenance Requirements

OS Updates

Harvester Upgrades (includes OS + K3s):

# Via Web UI:
# Settings → Upgrade → Select version → Start upgrade

# Via kubectl (after downloading upgrade image):
kubectl apply -f https://releases.rancher.com/harvester/v1.4.0/version.yaml

# Monitor upgrade progress
kubectl get upgrades -n harvester-system

Frequency:

Harvester releases: Every 2-3 months (minor versions)
Security patches: Included in Harvester releases
K3s upgrades: Bundled with Harvester upgrades
No separate OS patching (managed by Harvester)

Kubernetes Upgrades

K3s is upgraded with Harvester - no separate upgrade process.

Version Compatibility:

Harvester 1.4.x → K3s 1.30+
Harvester 1.3.x → K3s 1.28+
Harvester 1.2.x → K3s 1.26+

Upgrade Process:

Web UI or kubectl to trigger upgrade
Rolling upgrade of nodes (one at a time)
VM live migration during node upgrades
Automatic rollback on failure

Resource Overhead

Single Node (Harvester HCI):

RAM: 8GB minimum (16GB recommended for VMs)
CPU: 4 cores minimum (8 cores recommended)
Disk: 250GB minimum (SSD recommended)
- 100GB for OS/Harvester components
- 150GB+ for Longhorn storage (VM disks)
Network: 1 Gbps minimum (10 Gbps for production)

Three-Node Cluster (Production HA):

RAM: 32GB per node (64GB for VM-heavy workloads)
CPU: 8 cores per node minimum
Disk: 500GB+ per node (NVMe SSD recommended)
Network: 10 Gbps recommended (separate storage network ideal)

Comparison:

Ubuntu + k3s: 1GB RAM
Talos: 768MB RAM
Harvester: 8GB+ RAM (much heavier)

Note: Harvester is designed for multi-node HCI, not single-node homelabs.

Security Posture

Strengths:

SELinux-based (RancherOS 2.0 foundation)
Immutable OS layer (similar to Talos)
RBAC built-in (Kubernetes + Rancher)
Network segmentation (multus CNI)
VM isolation (KubeVirt)
Signed images and secure boot support

Attack Surface:

Larger than Talos/k3s: Includes web UI, VM management, storage layer
KubeVirt adds additional components
Web UI is additional attack vector
More processes than minimal OS (~50+ services)

Security Features:

# VM network isolation example
apiVersion: network.harvesterhci.io/v1beta1
kind: VlanConfig
metadata:
  name: production-vlan
spec:
  vlanID: 100
  uplink:
    linkAttributes: 1500

Hardening:

Firewall rules (web UI or kubectl)
RBAC policies (restrict VM/namespace access)
Network policies (isolate workloads)
Rancher authentication integration (LDAP, SAML)

Learning Curve

Ease of Adoption: ⭐⭐⭐ (Moderate)

Web UI simplifies management (no CLI required for basic tasks)
Requires understanding of VMs + containers
Kubernetes knowledge helpful but not required initially
Longhorn storage concepts (replicas, snapshots)
KubeVirt for VM management (learning curve)

Required Knowledge:

Basic Kubernetes concepts (pods, services)
VM management (KubeVirt/libvirt)
Storage concepts (Longhorn, CSI)
Networking (VLANs, SR-IOV optional)
Web UI navigation

Debugging:

# Access via kubectl (kubeconfig from web UI)
kubectl get nodes -n harvester-system

# View Harvester logs
kubectl logs -n harvester-system <pod-name>

# VM console access (via web UI or virtctl)
virtctl console <vm-name>

# Storage debugging
kubectl get volumes -A

Community Support

Ecosystem Maturity: ⭐⭐⭐⭐ (Good)

Documentation: Excellent official docs
Community: Active Slack, GitHub Discussions, forums
Commercial Support: Available from SUSE/Rancher
Third-Party Tools: Rancher ecosystem integration
Tutorials: Growing number of guides and videos

Resources:

Pros and Cons Summary

Pros

Good, because unified platform for VMs + containers (no separate hypervisor)
Good, because built-in K3s (Kubernetes included)
Good, because web UI simplifies management (no CLI required)
Good, because built-in persistent storage (Longhorn CSI)
Good, because VM live migration (no downtime during maintenance)
Good, because multi-network support (multus CNI, SR-IOV)
Good, because Rancher integration (multi-cluster management)
Good, because automatic upgrades (OS + K3s + components)
Good, because commercial support available (SUSE)
Good, because designed for bare-metal HCI (no cloud dependencies)
Neutral, because immutable OS layer (similar to Talos benefits)

Cons

Bad, because very heavy resource requirements (8GB+ RAM minimum)
Bad, because complex architecture (KubeVirt, Longhorn, multus, etc.)
Bad, because overkill for container-only workloads (use k3s/Talos instead)
Bad, because larger attack surface than minimal OS (web UI, VM layer)
Bad, because requires 3+ nodes for production HA (not single-node friendly)
Bad, because steep learning curve for full feature set (VMs + storage + networking)
Bad, because relatively new platform (less mature than Ubuntu/Fedora)
Bad, because limited to Rancher ecosystem (vendor lock-in)
Bad, because slower to adopt latest Kubernetes versions (depends on K3s bundle)
Neutral, because opinionated HCI design (pro for VM use cases, con for simplicity)

Recommendations

Best for:

Hybrid workloads (VMs + containers on same platform)
Homelab users wanting to consolidate VM hypervisor + Kubernetes
Teams familiar with Rancher ecosystem
Multi-node clusters (3+ nodes)
Environments requiring VM live migration
Users wanting web UI for infrastructure management
Replacing VMware/Proxmox + Kubernetes with unified platform

Best Installation Method:

Only option: Interactive ISO install or PXE with cloud-init

Avoid if:

Running container-only workloads (use k3s or Talos instead)
Limited resources (< 8GB RAM per node)
Single-node homelab (Harvester designed for multi-node)
Want minimal attack surface (use Talos)
Prefer traditional Linux shell access (use Ubuntu/Fedora)
Need latest Kubernetes versions immediately (Harvester lags upstream)

Comparison with Other Options

Aspect	Harvester	Talos Linux	Ubuntu + k3s	Fedora + kubeadm
Primary Use Case	VMs + Containers	Containers only	General-purpose	General-purpose
Resource Overhead	8GB+ RAM	768MB RAM	1GB RAM	2.2GB RAM
Kubernetes	Built-in K3s	Built-in	Install k3s	Install kubeadm
Management	Web UI + kubectl	API-only (talosctl)	SSH + kubectl	SSH + kubectl
Storage	Built-in Longhorn	External CSI	External CSI	External CSI
VM Support	Native (KubeVirt)	No	Via KubeVirt	Via KubeVirt
Learning Curve	Moderate	Steep	Easy	Moderate
Attack Surface	Large	Minimal	Medium	Medium
Multi-Node	Designed for	Supports	Supports	Supports
Single-Node	Not ideal	Excellent	Excellent	Good
Best for	VM + K8s hybrid	K8s-only	Homelab/learning	RHEL ecosystem

Verdict: Harvester is excellent for VM + container hybrid workloads with 3+ nodes, but overkill for container-only infrastructure. Use Talos or k3s for Kubernetes-only clusters, Ubuntu/Fedora for general-purpose servers.

Advanced Features

VM Management (KubeVirt)

Create VMs via YAML:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: ubuntu-vm
spec:
  running: true
  template:
    spec:
      domain:
        devices:
          disks:
            - name: root
              disk:
                bus: virtio
        resources:
          requests:
            memory: 4Gi
            cpu: 2
      volumes:
        - name: root
          containerDisk:
            image: docker.io/harvester/ubuntu:22.04

Live Migration

Move VMs between nodes:

# Via web UI: VM → Actions → Migrate

# Via kubectl
kubectl patch vm ubuntu-vm --type merge -p '{"spec":{"running":false}}'
kubectl patch vm ubuntu-vm --type merge -p '{"spec":{"running":true}}'

Backup and Restore

Harvester supports VM backups:

# Configure S3 backup target (web UI)
# Create VM snapshot
# Restore from snapshot or backup

Rancher Integration

Manage multiple clusters:

# Import Harvester cluster into Rancher
# Deploy workloads across clusters
# Central authentication and RBAC

Use Case Examples

Use Case 1: Replace VMware + Kubernetes

Scenario: Currently running VMware ESXi for VMs + separate Kubernetes cluster

Harvester Solution:

Consolidate to 3-node Harvester cluster
Migrate VMs to KubeVirt
Deploy containers on same cluster
Save VMware licensing costs

Benefits:

Single platform for VMs + containers
Unified management (web UI + kubectl)
Built-in HA and live migration

Use Case 2: Homelab with Mixed Workloads

Scenario: Need Windows VMs + Linux containers + storage server

Harvester Solution:

Windows VMs via KubeVirt (GPU passthrough supported)
Linux containers via K3s workloads
Longhorn for persistent storage (NFS export supported)

Benefits:

No need for separate Proxmox/ESXi
Kubernetes-native management
Learn enterprise HCI platform

Use Case 3: Edge Computing

Scenario: Deploy compute at remote sites (3-5 nodes each)

Harvester Solution:

Harvester cluster at each edge location
Rancher for central management
VM + container workloads

Benefits:

Autonomous operation (no cloud dependency)
Rancher multi-cluster management
Built-in storage and networking

Production Readiness

Production Use: ✅ Yes (used in enterprise environments)

High Availability:

3+ nodes required for HA
Witness node for even-node clusters
VM live migration during maintenance
Longhorn 3-replica storage

Monitoring:

Built-in Prometheus + Grafana
Rancher monitoring integration
Alerting and notifications

Disaster Recovery:

VM backups to S3
Cluster backups (etcd + config)
Restore to new cluster

Enterprise Features:

Rancher authentication (LDAP, SAML, OAuth)
Multi-tenancy (namespaces, RBAC)
Audit logging
Network policies

1.2 - Amazon Web Services Analysis

Technical analysis of Amazon Web Services capabilities for hosting network boot infrastructure

This section contains detailed analysis of Amazon Web Services (AWS) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.

Overview

Amazon Web Services is Amazon’s comprehensive cloud computing platform, offering compute, storage, networking, and managed services. This analysis focuses on AWS’s capabilities to support the network boot architecture decided in ADR-0002.

Key Services Evaluated

EC2: Virtual machine instances for hosting boot server
VPN / VPC: Network connectivity and VPN capabilities
Elastic Load Balancing: Application and Network Load Balancers
NAT Gateway: Network address translation for outbound connectivity
VPC: Virtual Private Cloud networking and routing

Documentation Sections

Network Boot Support - Analysis of TFTP, HTTP, and HTTPS routing capabilities
WireGuard Support - Evaluation of WireGuard VPN integration options

1.2.1 - AWS Network Boot Protocol Support

Analysis of Amazon Web Services support for TFTP, HTTP, and HTTPS routing for network boot infrastructure

Network Boot Protocol Support on Amazon Web Services

This document analyzes AWS’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.

TFTP (Trivial File Transfer Protocol) Support

Native Support

Status: ❌ Not natively supported by Elastic Load Balancing

AWS’s Elastic Load Balancing services do not support TFTP protocol natively:

Application Load Balancer (ALB): HTTP/HTTPS only (Layer 7)
Network Load Balancer (NLB): TCP/UDP support, but not TFTP-aware
Classic Load Balancer: Deprecated, similar limitations

TFTP operates on UDP port 69 with unique protocol semantics (variable block sizes, retransmissions, port negotiation) that standard load balancers cannot parse.

Implementation Options

Option 1: Direct EC2 Instance Access (Recommended for VPN Scenario)

Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from an EC2 instance:

Approach: Run TFTP server (e.g., tftpd-hpa, dnsmasq) on an EC2 instance
Access: Home lab connects via VPN tunnel to instance’s private IP
Security Group: Allow UDP/69 from VPN subnet/security group
Pros:
- Simple implementation
- No load balancer needed (single boot server sufficient for home lab)
- TFTP traffic encrypted through VPN tunnel
- Direct instance-to-client communication
Cons:
- Single point of failure (no HA)
- Manual failover if instance fails

Option 2: Network Load Balancer (NLB) UDP Passthrough

While NLB doesn’t understand TFTP protocol, it can forward UDP traffic:

Approach: Configure NLB to forward UDP/69 to target group
Limitations:
- No TFTP-specific health checks
- Health checks would use TCP or different protocol
- Adds cost and complexity without significant benefit for single server
Use Case: Only relevant for multi-AZ HA deployment (overkill for home lab)

TFTP Security Considerations

Encryption: TFTP itself is unencrypted, but VPN tunnel provides encryption
Security Groups: Restrict UDP/69 to VPN security group or CIDR only
File Access Control: Configure TFTP server with restricted file access
Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads

HTTP Support

Native Support

Status: ✅ Fully supported

AWS provides comprehensive HTTP support through multiple services:

Elastic Load Balancing - Application Load Balancer

Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (preview)
Port: Any port (typically 80 for HTTP)
Routing: Path-based, host-based, query string, header-based routing
Health Checks: HTTP health checks with configurable paths and response codes
SSL Offloading: Terminate SSL at ALB and use HTTP to backend
Backend: EC2 instances, ECS, EKS, Lambda

EC2 Direct Access

For VPN scenario, HTTP can be served directly from EC2 instance:

Approach: Run HTTP server (nginx, Apache, custom service) on EC2
Access: Home lab accesses via VPN tunnel to private IP
Security Group: Allow TCP/80 from VPN security group
Pros: Simpler than ALB for single boot server

HTTP Boot Flow for Network Boot

PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
iPXE → HTTP: iPXE chainloads kernel/initrd via HTTP
Kernel/Initrd: Large boot files served efficiently over HTTP

Performance Considerations

Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
Compression: gzip compression for text-based configs
CloudFront: Optional CDN for caching boot files (probably overkill for VPN scenario)
TCP Optimization: AWS network optimized for low-latency TCP

HTTPS Support

Native Support

Status: ✅ Fully supported with advanced features

AWS provides enterprise-grade HTTPS support:

Elastic Load Balancing - Application Load Balancer

Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 (preview)
SSL/TLS Termination: Terminate SSL at ALB
Certificate Management:
- AWS Certificate Manager (ACM) - free SSL certificates with automatic renewal
- Import custom certificates
- Integration with private CA via ACM Private CA
TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable via security policy)
Cipher Suites: Predefined security policies (modern, compatible, legacy)
SNI Support: Multiple certificates on single load balancer

AWS Certificate Manager (ACM)

Free Certificates: No cost for public SSL certificates used with AWS services
Automatic Renewal: ACM automatically renews certificates before expiration
Private CA: ACM Private CA for internal PKI (additional cost)
Integration: Native integration with ALB, CloudFront, API Gateway

HTTPS for Network Boot

Use Case

Modern UEFI firmware and iPXE support HTTPS boot:

iPXE HTTPS: iPXE compiled with DOWNLOAD_PROTO_HTTPS can fetch over HTTPS
UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot
Security: Boot file integrity verified via HTTPS chain of trust

Implementation on AWS

Certificate Provisioning:
- Use ACM certificate for public domain (free, auto-renewed)
- Use self-signed certificate for VPN-only access (add to iPXE trust store)
- Use ACM Private CA for internal PKI ($400/month - expensive for home lab)
ALB Configuration:
- HTTPS listener on port 443
- Target group pointing to EC2 boot server
- Security policy with TLS 1.2+ minimum
Alternative: Direct EC2 HTTPS:
- Run nginx/Apache with TLS on EC2 instance
- Access via VPN tunnel to private IP with HTTPS
- Simpler setup for VPN-only scenario
- Use Let’s Encrypt or self-signed certificate

Mutual TLS (mTLS) Support

AWS ALB supports mutual TLS authentication (as of 2022):

Client Certificates: Require client certificates for authentication
Trust Store: Upload trusted CA certificates to ALB
Use Case: Ensure only authorized home lab servers can access boot files
Integration: Combine with VPN for defense-in-depth
Passthrough Mode: ALB can pass client cert to backend for validation

Routing and Load Balancing Capabilities

VPC Routing

Route Tables: Define routes to direct traffic through VPN gateway
Route Propagation: BGP route propagation for VPN connections
Transit Gateway: Advanced multi-VPC/VPN routing (overkill for home lab)

Security Groups

Stateful Firewall: Automatic return traffic handling
Ingress/Egress Rules: Fine-grained control by protocol, port, source/destination
Security Group Chaining: Reference security groups in rules (elegant for VPN setup)
VPN Subnet Restriction: Allow traffic only from VPN-connected subnet

Network ACLs (Optional)

Stateless Firewall: Subnet-level access control
Defense in Depth: Additional layer beyond security groups
Use Case: Probably unnecessary for simple VPN boot server

Cost Implications

Data Transfer Costs

VPN Traffic: Data transfer through VPN gateway charged at standard rates
Intra-Region: Free for traffic within same region/VPC
Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.14/month (US East egress)

Load Balancing Costs

Application Load Balancer: ~~$0.0225/hour + $0.008 per LCU-hour (~~$16-20/month minimum)
Network Load Balancer: ~~$0.0225/hour + $0.006 per NLCU-hour (~~$16-18/month minimum)
For VPN Scenario: Load balancer unnecessary (single EC2 instance sufficient)

Compute Costs

t3.micro Instance: ~$7.50/month (on-demand pricing, US East)
t4g.micro Instance: ~$6.00/month (ARM-based, cheaper, sufficient for boot server)
Reserved Instances: Up to 72% savings with 1-year or 3-year commitment
Savings Plans: Flexible discounts for consistent compute usage

ACM Certificate Costs

Public Certificates: Free when used with AWS services
Private CA: $400/month (too expensive for home lab)

Comparison with Requirements

Requirement	AWS Support	Implementation
TFTP	⚠️ Via EC2, not ELB	Direct EC2 access via VPN
HTTP	✅ Full support	EC2 or ALB
HTTPS	✅ Full support	EC2 or ALB with ACM
VPN Integration	✅ Native VPN	Site-to-Site VPN or self-managed
Load Balancing	✅ ALB, NLB	Optional for HA
Certificate Mgmt	✅ ACM (free)	Automatic renewal
Cost Efficiency	✅ Low-cost instances	t4g.micro sufficient

Recommendations

For VPN-Based Architecture (per ADR-0002)

EC2 Instance: Deploy single t4g.micro or t3.micro instance with:
- TFTP server (tftpd-hpa or dnsmasq)
- HTTP server (nginx or simple Python HTTP server)
- Optional HTTPS with Let’s Encrypt or self-signed certificate
VPN Connection: Connect home lab to AWS via:
- Site-to-Site VPN (IPsec) - managed service, higher cost (~$36/month)
- Self-managed WireGuard on EC2 - lower cost, more control
Security Groups: Restrict access to:
- UDP/69 (TFTP) from VPN security group only
- TCP/80 (HTTP) from VPN security group only
- TCP/443 (HTTPS) from VPN security group only
No Load Balancer: For home lab scale, direct EC2 access is sufficient
Health Monitoring: Use CloudWatch for instance and service health

If HA Required (Future Enhancement)

Deploy multi-AZ EC2 instances with Network Load Balancer
Use S3 as backend for boot files with EC2 serving as cache
Implement auto-recovery with Auto Scaling Group (min=max=1)

References

1.2.2 - AWS WireGuard VPN Support

Analysis of WireGuard VPN deployment options on Amazon Web Services for secure site-to-site connectivity

WireGuard VPN Support on Amazon Web Services

This document analyzes options for deploying WireGuard VPN on AWS to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.

WireGuard Overview

WireGuard is a modern VPN protocol that provides:

Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
Performance: High throughput with low overhead
Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
Configuration: Simple key-based configuration
Kernel Integration: Mainline Linux kernel support since 5.6

AWS Native VPN Support

Site-to-Site VPN (IPsec)

Status: ❌ WireGuard not natively supported

AWS’s managed Site-to-Site VPN supports:

IPsec VPN: IKEv1, IKEv2 with pre-shared keys
Redundancy: Two VPN tunnels per connection for high availability
BGP Support: Dynamic routing via BGP
Transit Gateway: Scalable multi-VPC VPN hub

Limitation: Site-to-Site VPN does not support WireGuard protocol natively.

Cost: Site-to-Site VPN

VPN Connection: ~$0.05/hour = ~$36/month
Data Transfer: Standard data transfer out rates (~$0.09/GB for first 10TB)
Total Estimate: ~$36-50/month for managed IPsec VPN

Self-Managed WireGuard on EC2

Implementation Approach

Since AWS doesn’t offer managed WireGuard, deploy WireGuard on an EC2 instance:

Status: ✅ Fully supported via EC2

Architecture

graph LR
    A[Home Lab] -->|WireGuard Tunnel| B[AWS EC2 Instance]
    B -->|VPC Network| C[Boot Server EC2]
    B -->|IP Forwarding| C
    
    subgraph "Home Network"
        A
        D[UDM Pro]
        D -.WireGuard Client.- A
    end
    
    subgraph "AWS VPC"
        B[WireGuard Gateway EC2]
        C[Boot Server EC2]
    end

EC2 Configuration

WireGuard Gateway Instance:
- Instance Type: t4g.micro or t3.micro ($6-7.50/month)
- OS: Ubuntu 22.04 LTS or Amazon Linux 2023 (native WireGuard support)
- Source/Dest Check: Disable to allow IP forwarding
- Elastic IP: Allocate Elastic IP for stable WireGuard endpoint
- Security Group: Allow UDP port 51820 from home lab public IP
Boot Server Instance:
- Network: Same VPC as WireGuard gateway
- Private IP Only: No Elastic IP (accessed via VPN)
- Route Traffic: Through WireGuard gateway instance

Installation Steps

# On EC2 Instance (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools

# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key

# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf

Example /etc/wireguard/wg0.conf on AWS EC2:

[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24

Corresponding config on UDM Pro:

[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>

[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <AWS_ELASTIC_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.0.0.0/16
PersistentKeepalive = 25

Enable and Start WireGuard

# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0

# Verify status
sudo wg show

AWS VPC Configuration

Security Groups

Create security group for WireGuard gateway:

aws ec2 create-security-group \
    --group-name wireguard-gateway-sg \
    --description "WireGuard VPN gateway" \
    --vpc-id vpc-xxxxxx

aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxx \
    --protocol udp \
    --port 51820 \
    --cidr <HOME_LAB_PUBLIC_IP>/32

Allow SSH for management (optional, restrict to trusted IP):

aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxx \
    --protocol tcp \
    --port 22 \
    --cidr <TRUSTED_IP>/32

Disable Source/Destination Check

Required for IP forwarding to work:

aws ec2 modify-instance-attribute \
    --instance-id i-xxxxxx \
    --no-source-dest-check

Elastic IP Allocation

Allocate and associate Elastic IP for stable endpoint:

aws ec2 allocate-address --domain vpc

aws ec2 associate-address \
    --instance-id i-xxxxxx \
    --allocation-id eipalloc-xxxxxx

Cost: Elastic IP is free when associated with running instance, but charged ~$3.60/month if unattached.

Route Table Configuration

Add route to direct home lab subnet traffic through WireGuard gateway:

aws ec2 create-route \
    --route-table-id rtb-xxxxxx \
    --destination-cidr-block 192.168.1.0/24 \
    --instance-id i-xxxxxx

This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway instance.

UDM Pro WireGuard Integration

Native Support

Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)

The UniFi Dream Machine Pro includes native WireGuard VPN support:

GUI Configuration: Web UI for WireGuard VPN setup
Site-to-Site: Support for site-to-site VPN tunnels
Performance: Hardware acceleration for encryption (if available)
Routing: Automatic route injection for remote subnets

Configuration Steps on UDM Pro

Network Settings → VPN:
- Create new VPN connection
- Select “WireGuard”
- Generate key pair or import existing
Peer Configuration:
- Peer Public Key: AWS EC2 WireGuard instance’s public key
- Endpoint: AWS Elastic IP address
- Port: 51820
- Allowed IPs: AWS VPC CIDR (e.g., 10.0.0.0/16)
- Persistent Keepalive: 25 seconds
Route Injection:
- UDM Pro automatically adds routes to AWS subnets
- Home lab servers can reach AWS boot server via VPN
Firewall Rules:
- Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN

Alternative: Manual WireGuard on UDM Pro

If native support is insufficient, use wireguard-go via udm-utilities:

Repository: boostchicken/udm-utilities
Script: on_boot.d script to start WireGuard on boot
Persistence: Survives firmware updates with on-boot script

Performance Considerations

Throughput

WireGuard on EC2 performance varies by instance type:

t4g.micro (2 vCPU, ARM): ~100-300 Mbps
t3.micro (2 vCPU, x86): ~100-300 Mbps
t3.small (2 vCPU): ~500-800 Mbps
t3.medium (2 vCPU): ~1+ Gbps

For network boot (typical boot = 50-200MB), even t4g.micro is sufficient:

Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
Recommendation: t4g.micro adequate and most cost-effective

Latency

VPN Overhead: WireGuard adds minimal latency (~1-5ms)
AWS Network: Low-latency network infrastructure
Total Latency: Primarily dependent on home ISP and AWS region proximity

CPU Usage

Encryption: ChaCha20 is CPU-efficient
Kernel Module: Minimal CPU overhead in kernel space
t4g.micro: Sufficient CPU for home lab VPN throughput
ARM Advantage: t4g instances use Graviton processors (better price/performance)

Security Considerations

Key Management

Private Keys: Store securely, never commit to version control
Key Rotation: Rotate keys periodically (e.g., annually)
Secrets Manager: Store WireGuard private keys in AWS Secrets Manager
- Retrieve at instance startup via user data script
- Avoid storing in AMIs or instance metadata
IAM Role: Grant EC2 instance IAM role to read secret

Firewall Hardening

Security Group Restriction: Limit WireGuard port to home lab public IP only
Least Privilege: Boot server security group allows only VPN security group
No Public Access: Boot server has no Elastic IP or public route

Monitoring and Alerts

CloudWatch Logs: Stream WireGuard logs to CloudWatch
CloudWatch Alarms: Alert on VPN tunnel down (no recent handshakes)
VPC Flow Logs: Monitor VPN traffic patterns

DDoS Protection

UDP Amplification: WireGuard resistant to DDoS amplification attacks
AWS Shield: Basic DDoS protection included free on all AWS resources
Shield Advanced: Optional ($3,000/month - overkill for VPN endpoint)

High Availability Options

Multi-AZ Failover

Deploy WireGuard gateways in multiple Availability Zones:

Primary: us-east-1a WireGuard instance
Secondary: us-east-1b WireGuard instance
Failover: UDM Pro switches endpoints if primary fails
Cost: Doubles instance costs (~$12-15/month for 2 instances)

Auto Scaling Group (Single Instance)

Use Auto Scaling Group with min=max=1 for auto-recovery:

Health Checks: EC2 status checks
Auto-Recovery: ASG replaces failed instance automatically
Elastic IP: Reassociate Elastic IP to new instance via Lambda/script
Limitation: Brief downtime during recovery (~2-5 minutes)

Health Monitoring

Monitor WireGuard tunnel health with CloudWatch custom metrics:

# On EC2 instance, run periodically via cron
#!/bin/bash
HANDSHAKE=$(wg show wg0 latest-handshakes | awk '{print $2}')
NOW=$(date +%s)
AGE=$((NOW - HANDSHAKE))

aws cloudwatch put-metric-data \
    --namespace WireGuard \
    --metric-name TunnelAge \
    --value $AGE \
    --unit Seconds

Alert if handshake age exceeds threshold (e.g., 180 seconds).

User Data Script for Auto-Configuration

EC2 user data script to configure WireGuard on launch:

#!/bin/bash
# Install WireGuard
apt update && apt install -y wireguard wireguard-tools

# Retrieve private key from Secrets Manager
aws secretsmanager get-secret-value \
    --secret-id wireguard-server-key \
    --query SecretString \
    --output text > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key

# Configure interface (full config omitted for brevity)
# ...

# Enable and start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0

Requires IAM instance role with secretsmanager:GetSecretValue permission.

Cost Analysis

Self-Managed WireGuard on EC2

Component	Cost (US East)
t4g.micro instance (730 hrs/month)	~$6.00
Elastic IP (attached)	$0.00
Data transfer out (1GB/month)	~$0.09
Monthly Total	~$6.09
Annual Total	~$73

With Reserved Instance (1-year, no upfront):

Component	Cost
t4g.micro RI (1-year)	~$3.50/month
Elastic IP	$0.00
Data transfer	~$0.09
Monthly Total	~$3.59
Annual Total	~$43

Site-to-Site VPN (IPsec - if WireGuard not used)

Component	Cost
VPN Connection (2 tunnels)	~$36
Data transfer (1GB/month)	~$0.09
Monthly Total	~$36
Annual Total	~$432

Cost Savings: Self-managed WireGuard saves ~$360/year vs Site-to-Site VPN (or ~$390/year with Reserved Instance).

Comparison with Requirements

Requirement	AWS Support	Implementation
WireGuard Protocol	✅ Via EC2	Self-managed on instance
Site-to-Site VPN	✅ Yes	WireGuard tunnel
UDM Pro Integration	✅ Native support	WireGuard peer config
Cost Efficiency	✅ Very low cost	t4g.micro ~$6/month (on-demand)
Performance	✅ Sufficient	100+ Mbps on t4g.micro
Security	✅ Modern crypto	ChaCha20, Curve25519
HA (optional)	⚠️ Manual setup	Multi-AZ or ASG

Recommendations

For Home Lab VPN (per ADR-0002)

Self-Managed WireGuard: Deploy on EC2 t4g.micro instance
- Cost: ~$6/month on-demand, ~$3.50/month with Reserved Instance
- Performance: Sufficient for network boot traffic
- Simplicity: Easy to configure and maintain
Single AZ Deployment: Unless HA required, single instance adequate
- Region Selection: Choose region closest to home lab for lowest latency
- AZ: Single AZ sufficient (boot server not mission-critical)
UDM Pro Native WireGuard: Use built-in WireGuard client
- Configuration: Add AWS instance as WireGuard peer in UDM Pro UI
- Route Injection: UDM Pro automatically routes AWS subnets
Security Best Practices:
- Store WireGuard private key in Secrets Manager
- Restrict security group to home lab public IP only
- Use user data script to retrieve key and configure on boot
- Enable CloudWatch logging for VPN events
- Assign IAM instance role with minimal permissions
Monitoring: Set up CloudWatch alarms for:
- Instance status check failures
- High CPU usage
- VPN tunnel age (custom metric)

Cost Optimization

Reserved Instance: Commit to 1-year Reserved Instance for ~40% savings
Spot Instance: Consider Spot for even lower cost (~70% savings), but adds complexity (handle interruptions)
ARM Architecture: Use t4g (Graviton) for 20% better price/performance vs t3

Future Enhancements

HA Setup: Deploy secondary WireGuard instance in different AZ
Automated Failover: Lambda function to reassociate Elastic IP on failure
IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
Mesh VPN: Expand to mesh topology if multiple sites added

References

1.3 - Google Cloud Platform Analysis

Technical analysis of Google Cloud Platform capabilities for hosting network boot infrastructure

This section contains detailed analysis of Google Cloud Platform (GCP) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.

Overview

Google Cloud Platform is Google’s suite of cloud computing services, offering compute, storage, networking, and managed services. This analysis focuses on GCP’s capabilities to support the network boot architecture decided in ADR-0002.

Key Services Evaluated

Compute Engine: Virtual machine instances for hosting boot server
Cloud VPN / VPC: Network connectivity and VPN capabilities
Cloud Load Balancing: Layer 4 and Layer 7 load balancing for HTTP/HTTPS
Cloud NAT: Network address translation for outbound connectivity
VPC Network: Software-defined networking and routing

Documentation Sections

Network Boot Support - Analysis of TFTP, HTTP, and HTTPS routing capabilities
WireGuard Support - Evaluation of WireGuard VPN integration options

1.3.1 - Cloud Storage FUSE (gcsfuse)

Analysis of Google Cloud Storage FUSE for mounting GCS buckets as local filesystems in network boot infrastructure

Overview

Cloud Storage FUSE (gcsfuse) is a FUSE-based filesystem adapter that allows Google Cloud Storage (GCS) buckets to be mounted and accessed as local filesystems on Linux systems. This enables applications to interact with object storage using standard filesystem operations (open, read, write, etc.) rather than requiring GCS-specific APIs.

Project: GoogleCloudPlatform/gcsfuse License: Apache 2.0 Status: Generally Available (GA) Latest Version: v2.x (as of 2024)

How gcsfuse Works

gcsfuse translates filesystem operations into GCS API calls:

Mount Operation: gcsfuse bucket-name /mount/point maps a GCS bucket to a local directory
Directory Structure: Interprets / in object names as directory separators
File Operations: Translates read(), write(), open(), etc. into GCS API requests
Metadata: Maintains file attributes (size, modification time) via GCS metadata
Caching: Optional stat, type, list, and file caching to reduce API calls

Example:

GCS object: gs://boot-assets/kernels/talos-v1.6.0.img
Mounted path: /mnt/boot-assets/kernels/talos-v1.6.0.img

Relevance to Network Boot Infrastructure

In the context of ADR-0005 Network Boot Infrastructure, gcsfuse offers a potential approach for serving boot assets from Cloud Storage without custom integration code.

Potential Use Cases

Boot Asset Storage: Mount gs://boot-assets/ to /var/lib/boot-server/assets/
Configuration Sync: Access boot profiles and machine mappings from GCS as local files
Matchbox Integration: Mount GCS bucket to /var/lib/matchbox/ for assets/profiles/groups
Simplified Development: Eliminate custom Cloud Storage SDK integration in boot server code

Architecture Pattern

┌─────────────────────────┐
│   Boot Server Process   │
│  (Cloud Run/Compute)    │
└───────────┬─────────────┘
            │ filesystem operations
            │ (read, open, stat)
            ▼
┌─────────────────────────┐
│   gcsfuse mount point   │
│   /var/lib/boot-assets  │
└───────────┬─────────────┘
            │ FUSE layer
            │ (translates to GCS API)
            ▼
┌─────────────────────────┐
│  Cloud Storage Bucket   │
│   gs://boot-assets/     │
└─────────────────────────┘

Performance Characteristics

Latency

Much higher latency than local filesystem: Every operation requires GCS API call(s)
No default caching: Without caching enabled, every read re-fetches from GCS
Network round-trip: Minimum ~10-50ms latency per operation (depending on region)

Throughput

Single Large File:

Read: ~4.1 MiB/s (individual file), up to 63.3 MiB/s (archive files)
Write: Comparable to gsutil cp for large files
With parallel downloads: Up to 9x faster for single-threaded reads of large files

Small Files:

Poor performance for random I/O on small files
Bulk operations on many small files create significant bottlenecks
ls on directories with thousands of objects can take minutes

Concurrent Access:

Performance degrades significantly with parallel readers (8 instances: ~30 hours vs 16 minutes with local data)
Not recommended for high-concurrency scenarios (web servers, NAS)

Performance Improvements (Recent Features)

Streaming Writes (default): Upload data directly to GCS as written
- Up to 40% faster for large sequential writes
- Reduces local disk usage (no staging file)
Parallel Downloads: Download large files using multiple workers
- Up to 9x faster model load times
- Best for single-threaded reads of large files
File Cache: Cache file contents locally (Local SSD, Persistent Disk, or tmpfs)
- Up to 2.3x faster training time (AI/ML workloads)
- Up to 3.4x higher throughput
- Requires explicit cache directory configuration
Metadata Cache: Cache stat, type, and list operations
- Stat and type caches enabled by default
- Configurable TTL (default: 60s, set -1 for unlimited)

Caching Configuration

gcsfuse provides four types of caching:

1. Stat Cache

Caches file attributes (size, modification time, existence).

# Enable with unlimited size and TTL
gcsfuse \
  --stat-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Reduces API calls for repeated stat() operations (e.g., checking file existence).

2. Type Cache

Caches file vs directory type information.

gcsfuse \
  --type-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Speeds up directory traversal and ls operations.

3. List Cache

Caches directory listing results.

gcsfuse \
  --max-conns-per-host=100 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Improves performance for applications that repeatedly list directory contents.

4. File Cache

Caches actual file contents locally.

gcsfuse \
  --file-cache-max-size-mb=-1 \
  --cache-dir=/mnt/local-ssd \
  --file-cache-cache-file-for-range-read=true \
  --file-cache-enable-parallel-downloads=true \
  bucket-name /mount/point

Use case: Essential for AI/ML training, repeated reads of large files.

Recommended cache storage:

Local SSD: Fastest, but ephemeral (data lost on restart)
Persistent Disk: Persistent but slower than Local SSD
tmpfs (RAM disk): Fastest but limited by memory

Production Configuration Example

# config.yaml for gcsfuse
metadata-cache:
  ttl-secs: -1  # Never expire (use only if bucket is read-only or single-writer)
  stat-cache-max-size-mb: -1
  type-cache-max-size-mb: -1

file-cache:
  max-size-mb: -1  # Unlimited (limited by disk space)
  cache-file-for-range-read: true
  enable-parallel-downloads: true
  parallel-downloads-per-file: 16
  download-chunk-size-mb: 50

write:
  create-empty-file: false  # Streaming writes (default)

logging:
  severity: info
  format: json

gcsfuse --config-file=config.yaml boot-assets /mnt/boot-assets

Limitations and Considerations

Filesystem Semantics

gcsfuse provides approximate POSIX semantics but is not fully POSIX-compliant:

No atomic rename: Rename operations are copy-then-delete (not atomic)
No hard links: GCS doesn’t support hard links
No file locking: flock() is a no-op
Limited permissions: GCS has simpler ACLs than POSIX permissions
No sparse files: Writes always materialize full file content

Performance Anti-Patterns

❌ Avoid:

Serving web content or acting as NAS (concurrent connections)
Random I/O on many small files (image datasets, text corpora)
Reading during ML training loops (download first, then train)
High-concurrency workloads (multiple parallel readers/writers)

✅ Good for:

Sequential reads of large files (models, checkpoints, kernels)
Infrequent writes of entire files
Read-mostly workloads with caching enabled
Single-writer scenarios

Consistency Trade-offs

With caching enabled:

Stale reads possible if cache TTL > 0 and external modifications occur
Safe only for:
- Read-only buckets
- Single-writer, single-mount scenarios
- Workloads tolerant of eventual consistency

Without caching:

Strong consistency (every read fetches latest from GCS)
Much slower performance

Resource Requirements

Disk space: File cache and streaming writes require local storage
- File cache: Size of cached files (can be large for ML datasets)
- Streaming writes: Temporary staging (proportional to concurrent writes)
Memory: Metadata caches consume RAM
File handles: Can exceed system limits with high concurrency
Network bandwidth: All data transfers via GCS API

Installation

On Compute Engine (Container-Optimized OS)

# Install gcsfuse (Container-Optimized OS doesn't include package managers)
export GCSFUSE_VERSION=2.x.x
curl -L -O https://github.com/GoogleCloudPlatform/gcsfuse/releases/download/v${GCSFUSE_VERSION}/gcsfuse_${GCSFUSE_VERSION}_amd64.deb
sudo dpkg -i gcsfuse_${GCSFUSE_VERSION}_amd64.deb

On Debian/Ubuntu

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

sudo apt-get update
sudo apt-get install gcsfuse

In Docker/Cloud Run

FROM ubuntu:22.04

# Install gcsfuse
RUN apt-get update && apt-get install -y \
    curl \
    gnupg \
    lsb-release \
  && export GCSFUSE_REPO=gcsfuse-$(lsb_release -c -s) \
  && echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list \
  && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
  && apt-get update \
  && apt-get install -y gcsfuse \
  && rm -rf /var/lib/apt/lists/*

# Create mount point
RUN mkdir -p /mnt/boot-assets

# Mount gcsfuse at startup
CMD gcsfuse --foreground boot-assets /mnt/boot-assets & \
    /usr/local/bin/boot-server

Note: Cloud Run does not support FUSE filesystems (requires privileged mode). gcsfuse only works on Compute Engine or GKE.

Network Boot Infrastructure Evaluation

Applicability to ADR-0005

Based on the analysis, gcsfuse is not recommended for the network boot infrastructure for the following reasons:

❌ Cloud Run Incompatibility

gcsfuse requires FUSE kernel module and privileged containers
Cloud Run does not support FUSE or privileged mode
ADR-0005 prefers Cloud Run deployment (HTTP-only boot enables serverless)
Impact: Blocks Cloud Run deployment, forcing Compute Engine VM

❌ Boot Latency Requirements

Boot file requests target < 100ms latency (ADR-0005 confirmation criteria)
gcsfuse adds 10-50ms+ latency per operation (network round-trips)
Kernel/initrd downloads are latency-sensitive (network boot timeout)
Impact: May exceed boot timeout thresholds

❌ No Caching for Read-Write Workloads

Boot server needs to write new assets and read existing ones
File cache with unlimited TTL requires read-only or single-writer assumption
Multiple boot server instances (autoscaling) violate single-writer constraint
Impact: Either accept stale reads or disable caching (slow)

❌ Small File Performance

Machine mapping configs, boot scripts, profiles are small files (KB range)
gcsfuse performs poorly on small, random I/O
ls operations on directories with many profiles can be slow
Impact: Slow boot configuration lookups

✅ Alternative: Direct Cloud Storage SDK

Using cloud.google.com/go/storage SDK directly offers:

Lower latency: Direct API calls without FUSE overhead
Cloud Run compatible: No kernel module or privileged mode required
Better control: Explicit caching, parallel downloads, streaming
Simpler deployment: No mount management, no FUSE dependencies
Cost: Similar API call costs to gcsfuse

Recommended approach (from ADR-0005):

// Custom boot server using Cloud Storage SDK
storage := storage.NewClient(ctx)
bucket := storage.Bucket("boot-assets")

// Stream kernel to boot client
obj := bucket.Object("kernels/talos-v1.6.0.img")
reader, _ := obj.NewReader(ctx)
defer reader.Close()
io.Copy(w, reader)  // Stream to HTTP response

When gcsfuse MIGHT Be Useful

Despite the above limitations, gcsfuse could be considered for:

Matchbox on Compute Engine:
- Matchbox expects filesystem paths for assets (/var/lib/matchbox/assets/)
- Compute Engine VM supports FUSE
- Read-heavy workload (boot assets rarely change)
- Could mount gs://boot-assets/ to /var/lib/matchbox/assets/ with file cache
Development/Testing:
- Quick prototyping without writing Cloud Storage integration
- Local development with production bucket access
- Not recommended for production deployment
Low-Throughput Scenarios:
- Home lab scale (< 10 boots/hour)
- File cache enabled with Local SSD
- Single Compute Engine VM (not autoscaled)

Configuration for Matchbox + gcsfuse:

#!/bin/bash
# Mount boot assets for Matchbox

BUCKET="boot-assets"
MOUNT_POINT="/var/lib/matchbox/assets"
CACHE_DIR="/mnt/disks/local-ssd/gcsfuse-cache"

mkdir -p "$MOUNT_POINT" "$CACHE_DIR"

gcsfuse \
  --stat-cache-max-size-mb=-1 \
  --type-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  --file-cache-max-size-mb=-1 \
  --cache-dir="$CACHE_DIR" \
  --file-cache-cache-file-for-range-read=true \
  --file-cache-enable-parallel-downloads=true \
  --implicit-dirs \
  --foreground \
  "$BUCKET" "$MOUNT_POINT"

Monitoring and Troubleshooting

Metrics

gcsfuse exposes Prometheus metrics:

gcsfuse --prometheus --prometheus-port=9101 bucket /mnt/point

Key metrics:

gcs_read_count: Number of GCS read operations
gcs_write_count: Number of GCS write operations
gcs_read_bytes: Bytes read from GCS
gcs_write_bytes: Bytes written to GCS
fs_ops_count: Filesystem operations by type (open, read, write, etc.)
fs_ops_error_count: Filesystem operation errors

Logging

# JSON logging for Cloud Logging integration
gcsfuse --log-format=json --log-file=/var/log/gcsfuse.log bucket /mnt/point

Common Issues

Issue: ls on large directories takes minutes

Solution:

Enable list caching with --metadata-cache-ttl-secs=-1
Reduce directory depth (flatten object hierarchy)
Consider prefix-based filtering instead of full listings

Issue: Stale reads after external bucket modifications

Solution:

Reduce --metadata-cache-ttl-secs (default 60s)
Disable caching entirely for strong consistency
Use versioned object names (immutable assets)

Issue: Transport endpoint is not connected errors

Solution:

Unmount cleanly before remounting: fusermount -u /mnt/point
Check GCS bucket permissions (IAM roles)
Verify network connectivity to storage.googleapis.com

Issue: High memory usage

Solution:

Limit metadata cache sizes: --stat-cache-max-size-mb=1024
Disable file cache if not needed
Monitor with --prometheus metrics

Comparison to Alternatives

gcsfuse vs Direct Cloud Storage SDK

Aspect	gcsfuse	Cloud Storage SDK
Latency	Higher (FUSE overhead + GCS API)	Lower (direct GCS API)
Cloud Run	❌ Not supported	✅ Fully supported
Development Effort	Low (standard filesystem code)	Medium (SDK integration)
Performance	Slower (filesystem abstraction)	Faster (optimized for use case)
Caching	Built-in (stat, type, list, file)	Manual (application-level)
Streaming	Automatic	Explicit (`io.Copy`)
Dependencies	FUSE kernel module, privileged mode	None (pure Go library)

Recommendation: Use Cloud Storage SDK directly for production network boot infrastructure.

gcsfuse vs rsync/gsutil Sync

Periodic sync pattern:

# Sync bucket to local disk every 5 minutes
*/5 * * * * gsutil -m rsync -r gs://boot-assets /var/lib/boot-assets

Aspect	gcsfuse	rsync/gsutil sync
Consistency	Eventual (with caching)	Strong (within sync interval)
Disk Usage	Minimal (file cache optional)	Full copy of assets
Latency	GCS API per request	Local disk (fast)
Sync Lag	Real-time (no caching) or TTL	Sync interval (minutes)
Deployment	Requires FUSE	Simple cron job

Recommendation: For read-heavy, infrequent-write workloads on Compute Engine, rsync/gsutil sync is simpler and faster than gcsfuse.

Conclusion

Cloud Storage FUSE (gcsfuse) provides a convenient filesystem abstraction over GCS buckets, but is not recommended for the network boot infrastructure due to:

Cloud Run incompatibility (requires FUSE kernel module)
Added latency (FUSE overhead + network round-trips)
Poor performance for small files and concurrent access
Caching trade-offs (consistency vs performance)

Recommended alternatives:

Custom Boot Server: Direct Cloud Storage SDK integration (cloud.google.com/go/storage)
Matchbox on Compute Engine: rsync/gsutil sync to local disk
Cloud Run Deployment: Direct SDK (no gcsfuse possible)

gcsfuse may be useful for development/testing or Matchbox prototyping on Compute Engine, but production deployments should use direct SDK integration or periodic sync for optimal performance and Cloud Run compatibility.

References

1.3.2 - GCP Network Boot Protocol Support

Analysis of Google Cloud Platform’s support for TFTP, HTTP, and HTTPS routing for network boot infrastructure

Network Boot Protocol Support on Google Cloud Platform

This document analyzes GCP’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.

TFTP (Trivial File Transfer Protocol) Support

Native Support

Status: ❌ Not natively supported by Cloud Load Balancing

GCP’s Cloud Load Balancing services (Application Load Balancer, Network Load Balancer) do not support TFTP protocol natively. TFTP operates on UDP port 69 and has unique protocol requirements that are not compatible with GCP’s load balancing services.

Implementation Options

Option 1: Direct VM Access (Recommended for VPN Scenario)

Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from a Compute Engine VM without load balancing:

Approach: Run TFTP server (e.g., tftpd-hpa, dnsmasq) on a Compute Engine VM
Access: Home lab connects via VPN tunnel to the VM’s private IP
Routing: VPC firewall rules allow UDP/69 from VPN subnet
Pros:
- Simple implementation
- No need for load balancing (single boot server sufficient)
- TFTP traffic encrypted through VPN tunnel
- Direct VM-to-client communication
Cons:
- Single point of failure (no load balancing/HA)
- Manual failover required if VM fails

Option 2: Network Load Balancer (NLB) Passthrough

While NLB doesn’t parse TFTP protocol, it can forward UDP traffic:

Approach: Configure Network Load Balancer for UDP/69 passthrough
Limitations:
- No protocol-aware health checks for TFTP
- Health checks would use TCP or HTTP on alternate port
- Adds complexity without significant benefit for single boot server
Use Case: Only relevant for multi-region HA deployment (overkill for home lab)

TFTP Security Considerations

Encryption: TFTP protocol itself is unencrypted, but VPN tunnel provides encryption
Firewall Rules: Restrict UDP/69 to VPN subnet only (no public access)
File Access Control: Configure TFTP server with restricted file access
Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads

HTTP Support

Native Support

Status: ✅ Fully supported

GCP provides comprehensive HTTP support through multiple services:

Cloud Load Balancing - Application Load Balancer

Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (QUIC)
Port: Any port (typically 80 for HTTP)
Routing: URL-based routing, host-based routing, path-based routing
Health Checks: HTTP health checks with configurable paths
SSL Offloading: Can terminate SSL at load balancer and use HTTP backend
Backend: Compute Engine VMs, instance groups, Cloud Run, GKE

Compute Engine Direct Access

For VPN scenario, HTTP can be served directly from VM:

Approach: Run HTTP server (nginx, Apache, custom service) on Compute Engine VM
Access: Home lab accesses via VPN tunnel to private IP
Firewall: VPC firewall rules allow TCP/80 from VPN subnet
Pros: Simpler than load balancer for single boot server

HTTP Boot Flow for Network Boot

PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
iPXE → HTTP: iPXE chainloads boot files via HTTP from same server
Kernel/Initrd: Large boot files served efficiently over HTTP

Performance Considerations

Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
Compression: gzip compression for text-based boot configs
Caching: Cloud CDN can cache boot files for faster delivery
TCP Optimization: GCP’s network optimized for low-latency TCP

HTTPS Support

Native Support

Status: ✅ Fully supported with advanced features

GCP provides enterprise-grade HTTPS support:

Cloud Load Balancing - Application Load Balancer

Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 with QUIC
SSL/TLS Termination: Terminate SSL at load balancer
Certificate Management:
- Google-managed SSL certificates (automatic renewal)
- Self-managed certificates (bring your own)
- Certificate Map for multiple domains
TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable minimum version)
Cipher Suites: Modern, compatible, or custom cipher suites
mTLS Support: Mutual TLS authentication (client certificates)

Certificate Manager

Managed Certificates: Automatic provisioning and renewal via Let’s Encrypt integration
Private CA: Integration with Google Cloud Certificate Authority Service
Certificate Maps: Route different domains to different backends based on SNI
Certificate Monitoring: Automatic alerts before expiration

HTTPS for Network Boot

Use Case

Modern UEFI firmware and iPXE support HTTPS boot:

iPXE HTTPS: iPXE compiled with DOWNLOAD_PROTO_HTTPS can fetch over HTTPS
UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot (RFC 3720 iSCSI boot)
Security: Boot file integrity verified via HTTPS chain of trust

Implementation on GCP

Certificate Provisioning:
- Use Google-managed certificate for public domain (if boot server has public DNS)
- Use self-signed certificate for VPN-only access (add to iPXE trust store)
- Use private CA for internal PKI
Load Balancer Configuration:
- HTTPS frontend (port 443)
- Backend service to Compute Engine VM running boot server
- SSL policy with TLS 1.2+ minimum
Alternative: Direct VM HTTPS:
- Run nginx/Apache with TLS on Compute Engine VM
- Access via VPN tunnel to private IP with HTTPS
- Simpler setup for VPN-only scenario

mTLS Support for Enhanced Security

GCP’s Application Load Balancer supports mutual TLS authentication:

Client Certificates: Require client certificates for additional authentication
Certificate Validation: Validate client certificates against trusted CA
Use Case: Ensure only authorized home lab servers can access boot files
Integration: Combine with VPN for defense-in-depth

Routing and Load Balancing Capabilities

VPC Routing

Custom Routes: Define routes to direct traffic through VPN gateway
Route Priority: Configure route priorities for failover scenarios
BGP Support: Dynamic routing with Cloud Router (for advanced VPN setups)

Firewall Rules

Ingress/Egress Rules: Fine-grained control over traffic
Source/Destination Filters: IP ranges, tags, service accounts
Protocol Filtering: Allow specific protocols (UDP/69, TCP/80, TCP/443)
VPN Subnet Restriction: Limit access to VPN-connected home lab subnet

Cloud Armor (Optional)

For additional security if boot server has public access:

DDoS Protection: Layer 3/4 DDoS mitigation
WAF Rules: Application-level filtering
IP Allowlisting: Restrict to known public IPs
Rate Limiting: Prevent abuse

Cost Implications

Network Egress Costs

VPN Traffic: Egress to VPN endpoint charged at standard internet egress rates
Intra-Region: Free for traffic within same region
Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.18/month (US egress)

Load Balancing Costs

Application Load Balancer: ~$0.025/hour + $0.008 per LCU-hour
Network Load Balancer: ~$0.025/hour + data processing charges
For VPN Scenario: Load balancer likely unnecessary (single VM sufficient)

Compute Costs

e2-micro Instance: ~$6-7/month (suitable for boot server)
f1-micro Instance: ~$4-5/month (even smaller, might suffice)
Reserved/Committed Use: Discounts for long-term commitment

Comparison with Requirements

Requirement	GCP Support	Implementation
TFTP	⚠️ Via VM, not LB	Direct VM access via VPN
HTTP	✅ Full support	VM or ALB
HTTPS	✅ Full support	VM or ALB with Certificate Manager
VPN Integration	✅ Native VPN	Cloud VPN or self-managed WireGuard
Load Balancing	✅ ALB, NLB	Optional for HA
Certificate Mgmt	✅ Managed certs	Certificate Manager
Cost Efficiency	✅ Low-cost VMs	e2-micro sufficient

Recommendations

For VPN-Based Architecture (per ADR-0002)

Compute Engine VM: Deploy single e2-micro VM with:
- TFTP server (tftpd-hpa or dnsmasq)
- HTTP server (nginx or simple Python HTTP server)
- Optional HTTPS with self-signed certificate
VPN Tunnel: Connect home lab to GCP via:
- Cloud VPN (IPsec) - easier setup, higher cost
- Self-managed WireGuard on Compute Engine - lower cost, more control
VPC Firewall: Restrict access to:
- UDP/69 (TFTP) from VPN subnet only
- TCP/80 (HTTP) from VPN subnet only
- TCP/443 (HTTPS) from VPN subnet only
No Load Balancer: For home lab scale, direct VM access is sufficient
Health Monitoring: Use Cloud Monitoring for VM and service health

If HA Required (Future Enhancement)

Deploy multi-zone VMs with Network Load Balancer
Use Cloud Storage as backend for boot files with VM serving as cache
Implement failover automation with Cloud Functions

References

1.3.3 - GCP WireGuard VPN Support

Analysis of WireGuard VPN deployment options on Google Cloud Platform for secure site-to-site connectivity

WireGuard VPN Support on Google Cloud Platform

This document analyzes options for deploying WireGuard VPN on GCP to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.

WireGuard Overview

WireGuard is a modern VPN protocol that provides:

Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
Performance: High throughput with low overhead
Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
Configuration: Simple key-based configuration
Kernel Integration: Mainline Linux kernel support since 5.6

GCP Native VPN Support

Cloud VPN (IPsec)

Status: ❌ WireGuard not natively supported

GCP’s managed Cloud VPN service supports:

IPsec VPN: IKEv1, IKEv2 with PSK or certificate authentication
HA VPN: Highly available VPN with 99.99% SLA
Classic VPN: Single-tunnel VPN (deprecated)

Limitation: Cloud VPN does not support WireGuard protocol natively.

Cost: Cloud VPN

HA VPN: ~$0.05/hour per tunnel × 2 tunnels = ~$73/month
Egress: Standard internet egress rates (~$0.12/GB for first 1TB)
Total Estimate: ~$75-100/month for managed VPN

Self-Managed WireGuard on Compute Engine

Implementation Approach

Since GCP doesn’t offer managed WireGuard, deploy WireGuard on a Compute Engine VM:

Status: ✅ Fully supported via Compute Engine

Architecture

graph LR
    A[Home Lab] -->|WireGuard Tunnel| B[GCP Compute Engine VM]
    B -->|Private VPC Network| C[Boot Server VM]
    B -->|IP Forwarding| C
    
    subgraph "Home Network"
        A
        D[UDM Pro]
        D -.WireGuard Client.- A
    end
    
    subgraph "GCP VPC"
        B[WireGuard Gateway VM]
        C[Boot Server VM]
    end

VM Configuration

WireGuard Gateway VM:
- Instance Type: e2-micro or f1-micro ($4-7/month)
- OS: Ubuntu 22.04 LTS or Debian 12 (native WireGuard kernel support)
- IP Forwarding: Enable IP forwarding to route traffic to other VMs
- External IP: Static external IP for stable WireGuard endpoint
- Firewall: Allow UDP port 51820 (WireGuard) from home lab public IP
Boot Server VM:
- Network: Same VPC as WireGuard gateway
- Private IP Only: No external IP (accessed via VPN)
- Route Traffic: Through WireGuard gateway VM

Installation Steps

# On GCP Compute Engine VM (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools

# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key

# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf

Example /etc/wireguard/wg0.conf on GCP VM:

[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o ens4 -j MASQUERADE

[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24

Corresponding config on UDM Pro:

[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>

[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <GCP_VM_EXTERNAL_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.128.0.0/20
PersistentKeepalive = 25

Enable and Start WireGuard

# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0

# Verify status
sudo wg show

GCP VPC Configuration

Firewall Rules

Create VPC firewall rule to allow WireGuard:

gcloud compute firewall-rules create allow-wireguard \
    --direction=INGRESS \
    --priority=1000 \
    --network=default \
    --action=ALLOW \
    --rules=udp:51820 \
    --source-ranges=<HOME_LAB_PUBLIC_IP>/32 \
    --target-tags=wireguard-gateway

Tag the WireGuard VM:

gcloud compute instances add-tags wireguard-gateway-vm \
    --tags=wireguard-gateway \
    --zone=us-central1-a

Static External IP

Reserve static IP for stable WireGuard endpoint:

gcloud compute addresses create wireguard-gateway-ip \
    --region=us-central1

gcloud compute instances delete-access-config wireguard-gateway-vm \
    --access-config-name="external-nat" \
    --zone=us-central1-a

gcloud compute instances add-access-config wireguard-gateway-vm \
    --access-config-name="external-nat" \
    --address=wireguard-gateway-ip \
    --zone=us-central1-a

Cost: Static IP ~$3-4/month if VM is always running (free if attached to running VM in some regions).

Route Configuration

For traffic from boot server to reach home lab via WireGuard VM:

gcloud compute routes create route-to-homelab \
    --network=default \
    --priority=100 \
    --destination-range=192.168.1.0/24 \
    --next-hop-instance=wireguard-gateway-vm \
    --next-hop-instance-zone=us-central1-a

This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway VM.

UDM Pro WireGuard Integration

Native Support

Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)

The UniFi Dream Machine Pro includes native WireGuard VPN support:

GUI Configuration: Web UI for WireGuard VPN setup
Site-to-Site: Support for site-to-site VPN tunnels
Performance: Hardware acceleration for encryption (if available)
Routing: Automatic route injection for remote subnets

Configuration Steps on UDM Pro

Network Settings → VPN:
- Create new VPN connection
- Select “WireGuard”
- Generate key pair or import existing
Peer Configuration:
- Peer Public Key: GCP WireGuard VM’s public key
- Endpoint: GCP VM’s static external IP
- Port: 51820
- Allowed IPs: GCP VPC subnet (e.g., 10.128.0.0/20)
- Persistent Keepalive: 25 seconds
Route Injection:
- UDM Pro automatically adds routes to GCP subnets
- Home lab servers can reach GCP boot server via VPN
Firewall Rules:
- Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN

Alternative: Manual WireGuard on UDM Pro

If native support is insufficient, use wireguard-go via udm-utilities:

Repository: boostchicken/udm-utilities
Script: on_boot.d script to start WireGuard
Persistence: Survives firmware updates with on-boot script

Performance Considerations

Throughput

WireGuard on Compute Engine performance:

e2-micro (2 vCPU, shared core): ~100-300 Mbps
e2-small (2 vCPU): ~500-800 Mbps
e2-medium (2 vCPU): ~1+ Gbps

For network boot (typical boot = 50-200MB), even e2-micro is sufficient:

Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
Recommendation: e2-micro adequate for home lab scale

Latency

VPN Overhead: WireGuard adds minimal latency (~1-5ms overhead)
GCP Network: Low-latency network to most regions
Total Latency: Primarily dependent on home ISP and GCP region proximity

CPU Usage

Encryption: ChaCha20 is CPU-efficient
Kernel Module: Minimal CPU overhead in kernel space
e2-micro: Sufficient CPU for home lab VPN throughput

Security Considerations

Key Management

Private Keys: Store securely, never commit to version control
Key Rotation: Rotate keys periodically (e.g., annually)
Secret Manager: Store WireGuard private keys in GCP Secret Manager
- Retrieve at VM startup via startup script
- Avoid storing in VM metadata or disk images

Firewall Hardening

Source IP Restriction: Limit WireGuard port to home lab public IP only
Least Privilege: Boot server firewall allows only VPN subnet
No Public Access: Boot server has no external IP

Monitoring and Alerts

Cloud Logging: Log WireGuard connection events
Cloud Monitoring: Alert on VPN tunnel down
Metrics: Monitor handshake failures, data transfer

DDoS Protection

UDP Amplification: WireGuard resistant to DDoS amplification
Cloud Armor: Optional layer for additional DDoS protection (overkill for VPN)

High Availability Options

Multi-Region Failover

Deploy WireGuard gateways in multiple regions:

Primary: us-central1 WireGuard VM
Secondary: us-east1 WireGuard VM
Failover: UDM Pro switches endpoints if primary fails
Cost: Doubles VM costs (~$8-14/month for 2 VMs)

Health Checks

Monitor WireGuard tunnel health:

# On UDM Pro (via SSH)
wg show wg0 latest-handshakes

# If handshake timestamp old (>3 minutes), tunnel may be down

Automate failover with script on UDM Pro or external monitoring.

Startup Scripts for Auto-Healing

GCP VM startup script to ensure WireGuard starts on boot:

#!/bin/bash
# /etc/startup-script.sh

# Retrieve WireGuard private key from Secret Manager
gcloud secrets versions access latest --secret="wireguard-server-key" > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key

# Start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0

Attach as metadata:

gcloud compute instances add-metadata wireguard-gateway-vm \
    --metadata-from-file startup-script=/path/to/startup-script.sh \
    --zone=us-central1-a

Cost Analysis

Self-Managed WireGuard on Compute Engine

Component	Cost
e2-micro VM (730 hrs/month)	~$6.50
Static External IP	~$3.50
Egress (1GB/month boot traffic)	~$0.12
Monthly Total	~$10.12
Annual Total	~$121

Cloud VPN (IPsec - if WireGuard not used)

Component	Cost
HA VPN Gateway (2 tunnels)	~$73
Egress (1GB/month)	~$0.12
Monthly Total	~$73
Annual Total	~$876

Cost Savings: Self-managed WireGuard saves ~$755/year vs Cloud VPN.

Comparison with Requirements

Requirement	GCP Support	Implementation
WireGuard Protocol	✅ Via Compute Engine	Self-managed on VM
Site-to-Site VPN	✅ Yes	WireGuard tunnel
UDM Pro Integration	✅ Native support	WireGuard peer config
Cost Efficiency	✅ Low cost	e2-micro ~$10/month
Performance	✅ Sufficient	100+ Mbps on e2-micro
Security	✅ Modern crypto	ChaCha20, Curve25519
HA (optional)	⚠️ Manual setup	Multi-region VMs

Recommendations

For Home Lab VPN (per ADR-0002)

Self-Managed WireGuard: Deploy on Compute Engine e2-micro VM
- Cost: ~$10/month (vs ~$73/month for Cloud VPN)
- Performance: Sufficient for network boot traffic
- Simplicity: Easy to configure and maintain
Single Region Deployment: Unless HA required, single VM adequate
- Region Selection: Choose region closest to home lab for lowest latency
- Zone: Single zone sufficient (boot server not mission-critical)
UDM Pro Native WireGuard: Use built-in WireGuard client
- Configuration: Add GCP VM as WireGuard peer in UDM Pro UI
- Route Injection: UDM Pro automatically routes GCP subnets
Security Best Practices:
- Store WireGuard private key in Secret Manager
- Restrict WireGuard port to home public IP only
- Use startup script to configure VM on boot
- Enable Cloud Logging for VPN events
Monitoring: Set up Cloud Monitoring alerts for:
- VM down
- High CPU usage (indicates traffic spike or issue)
- Firewall rule blocks (indicates misconfiguration)

Future Enhancements

HA Setup: Deploy secondary WireGuard VM in different region
Automated Failover: Script on UDM Pro to switch endpoints
IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
Mesh VPN: Expand to mesh topology if multiple sites added

References

1.4 - HP ProLiant DL360 Gen9 Analysis

Technical analysis of HP ProLiant DL360 Gen9 server capabilities with focus on network boot support

This section contains detailed analysis of the HP ProLiant DL360 Gen9 server platform, including hardware specifications, network boot capabilities, and configuration guidance for home lab deployments.

Overview

The HP ProLiant DL360 Gen9 is a 1U rack-mountable server released by HPE as part of their Generation 9 (Gen9) product line, introduced in 2014. It’s a popular choice for home labs due to its balance of performance, density, and relative power efficiency compared to earlier generations.

Key Features

Form Factor: 1U rack-mountable
Processor Support: Dual Intel Xeon E5-2600 v3/v4 processors (Haswell/Broadwell)
Memory: Up to 768GB DDR4 RAM (24 DIMM slots)
Storage: Flexible SFF/LFF drive configurations
Network: Integrated quad-port 1GbE or 10GbE FlexibleLOM options
Management: iLO 4 (Integrated Lights-Out) with remote KVM and virtual media
Boot Options: UEFI and Legacy BIOS support with extensive network boot capabilities

Documentation Sections

Network Boot Capabilities - Detailed analysis of PXE, iPXE, and UEFI HTTP boot support
Hardware Specifications - Complete hardware configuration details
Configuration Guide - Setup and optimization recommendations

1.4.1 - Configuration Guide

Setup, optimization, and configuration recommendations for HP ProLiant DL360 Gen9 in home lab environments

Initial Setup

Hardware Assembly

Install Processors:
- Use thermal paste (HPE thermal grease recommended)
- Align CPU carefully with socket (LGA 2011-3)
- Secure heatsink with proper torque (hand-tighten screws in cross pattern)
- Install both CPUs for dual-socket configuration
Install Memory:
- Populate channels evenly (see Memory Configuration below)
- Seat DIMMs firmly until retention clips engage
- Verify all DIMMs recognized in POST
Install Storage:
- Insert drives into hot-swap caddies
- Label drives clearly for identification
- Configure RAID controller (see Storage Configuration below)
Install Network Cards:
- FlexibleLOM: Slide into dedicated slot until seated
- PCIe cards: Ensure low-profile brackets, secure with screw
- Note MAC addresses for DHCP reservations
Connect Power:
- Install PSUs (both for redundancy)
- Connect power cords
- Verify PSU LEDs indicate proper operation
Initial Power-On:
- Press power button
- Monitor POST on screen or via iLO remote console
- Address any POST errors before proceeding

iLO 4 Initial Configuration

Physical iLO Connection

Connect Ethernet cable to dedicated iLO port (not FlexibleLOM)
Default iLO IP: Obtains via DHCP, or use temporary address via RBSU
Check DHCP server logs for iLO MAC and assigned IP

Access iLO web interface: https://<ilo-ip>
Default credentials:
- Username: Administrator
- Password: On label on server pull-out tab (or rear label)
Immediately change default password (Administration > Access Settings)

Essential iLO Settings

Network Configuration (Administration > Network):

Set static IP or DHCP reservation
Configure DNS servers
Set hostname (e.g., ilo-dl360-01)
Enable SNTP time sync

Security (Administration > Security):

Enforce HTTPS only (disable HTTP)
Configure SSH key authentication if using CLI
Set strong password policy
Enable iLO Security features

Access (Administration > Access Settings):

Configure iLO username/password for automation
Create additional user accounts (separation of duties)
Set session timeout (default: 30 minutes)

Date and Time (Administration > Date and Time):

Set NTP servers for accurate timestamps
Configure timezone

Licenses (Administration > Licensing):

Install iLO Advanced license key (required for full virtual media)
License can be purchased or acquired from secondary market

iLO Firmware Update

Before production use, update iLO to latest version:

Download latest iLO 4 firmware from HPE Support Portal
Administration > Firmware > Update Firmware
Upload .bin file, apply update
iLO will reboot automatically (system stays running)

System ROM (BIOS/UEFI) Configuration

Accessing RBSU

Local: Press F9 during POST
Remote: iLO Remote Console > Power > Momentary Press > Press F9 when prompted

Boot Mode Selection

System Configuration > BIOS/Platform Configuration (RBSU) > Boot Mode:

UEFI Mode (recommended for modern OS):
- Supports GPT partitions (>2TB disks)
- Required for Secure Boot
- Better UEFI HTTP boot support
- IPv6 PXE boot support
Legacy BIOS Mode:
- For older OS or compatibility
- MBR partition tables only
- Traditional PXE boot

Recommendation: Use UEFI Mode unless legacy compatibility required

Boot Order Configuration

System Configuration > BIOS/Platform Configuration (RBSU) > Boot Options > UEFI Boot Order:

Recommended order for network boot deployment:

Network Boot: FlexibleLOM or PCIe NIC
Internal Storage: RAID controller or disk
Virtual Media: iLO virtual CD/DVD (for installation media)
USB: For rescue/recovery

Enable Network Boot:

System Configuration > BIOS/Platform Configuration (RBSU) > Network Options > Network Boot
Set to “Enabled”

Performance and Power Settings

System Configuration > BIOS/Platform Configuration (RBSU) > Power Management:

Power Regulator Mode:
- HP Dynamic Power Savings: Balanced power/performance (recommended for home lab)
- HP Static High Performance: Maximum performance, higher power draw
- HP Static Low Power: Minimize power, reduced performance
- OS Control: Let OS manage (e.g., Linux cpufreq)
Collaborative Power Control: Disabled (for standalone servers)
Minimum Processor Idle Power Core C-State: C6 (lower idle power)
Energy/Performance Bias: Balanced Performance (or Maximum Performance for compute workloads)

Recommendation: Start with “Dynamic Power Savings” and adjust based on workload

Memory Configuration

Optimal Population (dual-CPU configuration):

For maximum performance, populate all channels before adding second DIMM per channel:

64GB (8x 8GB):

CPU1: Slots 1, 4, 7, 10 and CPU2: Slots 1, 4, 7, 10
Result: 4 channels per CPU, 1 DIMM per channel

128GB (8x 16GB):

Same as above with 16GB DIMMs

192GB (12x 16GB):

CPU1: Slots 1, 4, 7, 10, 2, 5 and CPU2: Slots 1, 4, 7, 10, 2, 5
Result: 4 channels per CPU, some with 2 DIMMs per channel

768GB (24x 32GB):

All slots populated

Check Configuration: RBSU > System Information > Memory Information

Processor Options

System Configuration > BIOS/Platform Configuration (RBSU) > Processor Options:

Intel Hyperthreading: Enabled (recommended for most workloads)
- Doubles logical cores (e.g., 12-core CPU shows as 24 cores)
- Benefits most virtualization and multi-threaded workloads
- Disable only for specific security compliance (e.g., some cloud providers)
Intel Virtualization Technology (VT-x): Enabled (required for hypervisors)
Intel VT-d (IOMMU): Enabled (required for PCI passthrough, SR-IOV)
Turbo Boost: Enabled (allows CPU to exceed base clock)
Cores Enabled: All (or reduce to lower power/heat if needed)

Integrated Devices

System Configuration > BIOS/Platform Configuration (RBSU) > System Options > Integrated Devices:

Embedded SATA Controller: Enabled (if using SATA drives)
Embedded RAID Controller: Enabled (for Smart Array controllers)
SR-IOV: Enabled (if using virtual network interfaces with VMs)

Network Controller Options

For each NIC (FlexibleLOM, PCIe):

System Configuration > BIOS/Platform Configuration (RBSU) > Network Options > [Adapter]:

Network Boot: Enabled (for network boot on that NIC)
PXE/iSCSI: Select PXE for standard network boot
Link Speed: Auto-Negotiation (recommended) or force 1G/10G
IPv4: Enabled (for IPv4 PXE boot)
IPv6: Enabled (if using IPv6 PXE boot)

Boot Order: Configure which NIC boots first if multiple are enabled

Secure Boot Configuration

System Configuration > BIOS/Platform Configuration (RBSU) > Boot Options > Secure Boot:

Secure Boot: Disabled (for unsigned boot loaders, custom kernels)
Secure Boot: Enabled (for signed boot loaders, Windows, some Linux distros)

Note: If using PXE with unsigned images (e.g., custom iPXE), Secure Boot must be disabled

Firmware Updates

Update System ROM to latest version:

Via iLO:
- iLO web > Administration > Firmware > Update Firmware
- Upload System ROM .fwpkg or .bin file
- Server reboots automatically to apply
Via Service Pack for ProLiant (SPP):
- Download SPP ISO from HPE Support Portal
- Mount via iLO Virtual Media
- Boot server from SPP ISO
- Smart Update Manager (SUM) runs in Linux environment
- Select components to update (System ROM, iLO, controller firmware, NIC firmware)
- Apply updates, reboot

Recommendation: Use SPP for comprehensive updates on initial setup, then iLO for individual component updates

Storage Configuration

Smart Array Controller Setup

Access Smart Array Configuration

During POST: Press F5 when “Smart Array Configuration Utility” message appears
Via RBSU: System Configuration > BIOS/Platform Configuration (RBSU) > System Options > ROM-Based Setup Utility > Smart Array Configuration

Create RAID Arrays

Delete Existing Arrays (if reconfiguring):
- Select controller > Configuration > Delete Array
- Confirm deletion (data loss warning)
Create New Array:
- Select controller > Configuration > Create Array
- Select physical drives to include
- Choose RAID level:
  - RAID 0: Striping, no redundancy (maximum performance, maximum capacity)
  - RAID 1: Mirroring (redundancy, half capacity, good for boot drives)
  - RAID 5: Striping + parity (redundancy, n-1 capacity, balanced)
  - RAID 6: Striping + double parity (dual-drive failure tolerance, n-2 capacity)
  - RAID 10: Mirror + stripe (high performance + redundancy, half capacity)
- Configure spare drives (hot spares for automatic rebuild)
- Create logical drive
- Set bootable flag if boot drive
Recommended Configurations:
- Boot/OS: 2x SSD in RAID 1 (redundancy, fast boot)
- Data (performance): 4-6x SSD in RAID 10 (fast, redundant)
- Data (capacity): 4-8x HDD in RAID 6 (capacity, dual-drive tolerance)

Controller Settings

Cache Settings:
- Write Cache: Enabled (requires battery/flash-backed cache)
- Read Cache: Enabled
- No-Battery Write Cache: Disabled (data safety) or Enabled (performance, risk)
Rebuild Priority: Medium or High (faster rebuild, may impact performance)
Surface Scan Delay: 3-7 days (periodic integrity check)

HBA Mode (Non-RAID)

For software RAID (ZFS, mdadm, Ceph):

Access Smart Array Configuration (F5 during POST)
Controller > Configuration > Enable HBA Mode
Confirm (RAID arrays will be deleted)
Reboot

Note: Not all Smart Array controllers support HBA mode. Check compatibility. Alternative: Use separate LSI HBA in PCIe slot.

Network Configuration for Boot

DHCP Server Setup

For PXE/UEFI network boot, configure DHCP server with appropriate options:

ISC DHCP Example (/etc/dhcp/dhcpd.conf):

# Define subnet
subnet 192.168.10.0 netmask 255.255.255.0 {
    range 192.168.10.100 192.168.10.200;
    option routers 192.168.10.1;
    option domain-name-servers 192.168.10.1;
    
    # PXE boot options
    next-server 192.168.10.5;  # TFTP server IP
    
    # Differentiate UEFI vs BIOS
    if exists user-class and option user-class = "iPXE" {
        # iPXE boot script
        filename "http://boot.example.com/boot.ipxe";
    } elsif option arch = 00:07 or option arch = 00:09 {
        # UEFI (x86-64)
        filename "bootx64.efi";
    } else {
        # Legacy BIOS
        filename "undionly.kpxe";
    }
}

# Static reservation for DL360
host dl360-01 {
    hardware ethernet xx:xx:xx:xx:xx:xx;  # FlexibleLOM MAC
    fixed-address 192.168.10.50;
    option host-name "dl360-01";
}

FlexibleLOM Configuration

Configure FlexibleLOM NIC for network boot:

RBSU > Network Options > FlexibleLOM
Enable “Network Boot”
Select PXE or iSCSI
Configure IPv4/IPv6 as needed
Set as first boot device in boot order

Multi-NIC Boot Priority

If multiple NICs have network boot enabled:

RBSU > Network Options > Network Boot Order
Drag/drop to prioritize NIC boot order
First NIC in list attempts boot first

Recommendation: Enable network boot on one NIC (typically FlexibleLOM port 1) to avoid confusion

Operating System Installation

Traditional Installation (Virtual Media)

Download OS ISO (e.g., Ubuntu Server, ESXi, Proxmox)
Upload ISO to HTTP/HTTPS server or local file
iLO Remote Console > Virtual Devices > Image File CD-ROM/DVD
Browse to ISO location, click “Insert Media”
Set boot order to prioritize virtual media
Reboot server, boot from virtual CD/DVD
Proceed with OS installation

Network Installation (PXE)

See Network Boot Capabilities for detailed PXE/UEFI boot setup

Quick workflow:

Configure DHCP server with PXE options
Setup TFTP server with boot files
Enable network boot in BIOS
Reboot, server PXE boots
Select OS installer from PXE menu
Automated installation proceeds (Kickstart/Preseed/Ignition)

Optimization for Specific Workloads

Virtualization (ESXi, Proxmox, Hyper-V)

BIOS Settings:

Hyperthreading: Enabled
VT-x: Enabled
VT-d: Enabled
Power Management: Dynamic or OS Control
Turbo Boost: Enabled

Hardware:

Maximum memory (384GB+ recommended)
Fast storage (SSD RAID 10 for VM storage)
10GbE networking for VM traffic

Configuration:

Pass through NICs to VMs (SR-IOV or PCI passthrough)
Use storage controller in HBA mode for direct disk access to VM storage (ZFS, Ceph)

Kubernetes/Container Platforms

BIOS Settings:

Hyperthreading: Enabled
VT-x/VT-d: Enabled (for nested virtualization, kata containers)
Power Management: Dynamic or High Performance

Hardware:

128GB+ RAM for multi-tenant workloads
Fast local NVMe/SSD for container image cache and ephemeral storage
10GbE for pod networking

OS Recommendations:

Talos Linux: Network-bootable, immutable k8s OS
Flatcar Container Linux: Auto-updating, minimal OS
Ubuntu Server: Broad compatibility, snap/docker native

Storage Server (NAS, SAN)

BIOS Settings:

Disable Hyperthreading (slight performance improvement for ZFS)
VT-d: Enabled (if passing through HBA to VM)
Power Management: High Performance

Hardware:

Maximum drive bays (8-10 SFF)
HBA mode or separate LSI HBA controller
10GbE or bonded 1GbE for network storage traffic
ECC memory (critical for ZFS)

Software:

TrueNAS SCALE (Linux-based, k8s apps)
OpenMediaVault (Debian-based, plugins)
Ubuntu + ZFS (custom setup)

Compute/HPC Workloads

BIOS Settings:

Hyperthreading: Depends on workload (test both)
Turbo Boost: Enabled
Power Management: Maximum Performance
C-States: Disabled (reduce latency)

Hardware:

High core count CPUs (E5-2680 v4, 2690 v4)
Maximum memory bandwidth (populate all channels)
Fast local scratch storage (NVMe)

Monitoring and Maintenance

iLO Health Monitoring

Information > System Information:

CPU temperature and status
Memory status
Drive status (via controller)
Fan speeds
PSU status
Overall system health LED status

Alerting (Administration > Alerting):

Configure email alerts for:
- Fan failures
- Temperature warnings
- Drive failures
- Memory errors
- PSU failures
Set up SNMP traps for integration with monitoring systems (Nagios, Zabbix, Prometheus)

Integrated Management Log (IML)

Information > Integrated Management Log:

View hardware events and errors
Filter by severity (Informational, Caution, Critical)
Export log for troubleshooting

Regular Checks:

Review IML weekly for early warning signs
Address caution-level events before they become critical

Firmware Update Cadence

Recommendation:

iLO: Update quarterly or when security advisories released
System ROM: Update annually or for bug fixes
Storage Controller: Update when issues arise or annually
NIC Firmware: Update when issues arise

Method: Use SPP for annual comprehensive updates, iLO web interface for individual component updates

Physical Maintenance

Monthly:

Check fan noise (increased noise may indicate clogged air filters or failing fan)
Verify PSU and drive LEDs (no amber lights)
Check iLO for alerts

Quarterly:

Clean air filters (if accessible, depends on rack airflow)
Verify backup of iLO configuration
Test iLO Virtual Media functionality

Annually:

Update all firmware via SPP
Verify RAID battery/flash-backed cache status
Review and update BIOS settings as workload evolves

Troubleshooting Common Issues

Server Won’t Power On

Check PSU power cords connected
Verify PSU LEDs indicate power
Press iLO power button via web interface
Check iLO IML for power-related errors
Reseat PSUs, check for blown fuses

POST Errors

Memory Errors:

Reseat memory DIMMs
Test with minimal configuration (1 DIMM per CPU)
Replace failing DIMMs identified in POST

CPU Errors:

Verify heatsink properly seated
Check thermal paste application
Reseat CPU (careful with pins)

Drive Errors:

Check drive connection to caddy
Verify controller recognizes drive
Replace failing drive

No Network Boot

See Network Boot Troubleshooting for detailed diagnostics

Quick checks:

Verify NIC link light
Confirm network boot enabled in BIOS
Check DHCP server logs for PXE request
Test TFTP server accessibility

iLO Not Accessible

Check physical Ethernet connection to iLO port
Verify switch port active
Reset iLO: Press and hold iLO NMI button (rear) for 5 seconds
Factory reset iLO via jumper (see maintenance guide)
Check iLO firmware version, update if outdated

High Fan Noise

Check ambient temperature (<25°C recommended)
Verify airflow not blocked (front/rear clearance)
Clean dust from intake (compressed air)
Check iLO temperature sensors for elevated temps
Lower CPU TDP if temperatures excessive (lower power CPUs)
Verify all fans operational (replace failed fans)

Security Hardening

iLO Security

Change Default Credentials: Immediately on first boot
Disable Unused Services: SSH, IPMI if not needed
Use HTTPS Only: Disable HTTP (Administration > Network > HTTP Port)
Network Isolation: Dedicated management VLAN, firewall iLO access
Update Firmware: Apply security patches promptly
Account Management: Use separate accounts, least privilege

BIOS/UEFI Security

BIOS Password: Set administrator password (RBSU > System Options > BIOS Admin Password)
Secure Boot: Enable if using signed boot loaders
Boot Order Lock: Prevent unauthorized boot device changes
TPM: Enable if using BitLocker or LUKS disk encryption

Operating System Security

Minimal Installation: Install only required packages
Firewall: Enable host firewall (iptables, firewalld, ufw)
SSH Hardening: Key-based auth, disable password auth, non-standard port
Automatic Updates: Enable for security patches
Monitoring: Deploy intrusion detection (fail2ban, OSSEC)

Conclusion

Proper configuration of the HP ProLiant DL360 Gen9 ensures optimal performance, reliability, and manageability for home lab and production deployments. The combination of UEFI boot capabilities, iLO remote management, and flexible hardware configuration makes the DL360 Gen9 a versatile platform for virtualization, containerization, storage, and compute workloads.

Key takeaways:

Update firmware early (iLO, System ROM, controllers)
Configure iLO for remote management and monitoring
Choose boot mode (UEFI recommended) and configure network boot appropriately
Optimize BIOS settings for specific workload (virtualization, storage, compute)
Implement security hardening (iLO, BIOS, OS)
Establish monitoring and maintenance schedule

For network boot-specific configuration, refer to the Network Boot Capabilities guide.

1.4.2 - Hardware Specifications

Detailed hardware specifications and configuration options for HP ProLiant DL360 Gen9

System Overview

The HP ProLiant DL360 Gen9 is a dual-socket 1U rack server designed for data center and enterprise deployments, also popular in home lab environments due to its performance and manageability.

Generation: Gen9 (2014-2017 product cycle)
Form Factor: 1U rack-mountable (19-inch standard rack)
Dimensions: 43.46 x 67.31 x 4.29 cm (17.1 x 26.5 x 1.69 in)

Processor Support

Supported CPU Families

The DL360 Gen9 supports Intel Xeon E5-2600 v3 and v4 series processors:

E5-2600 v3 (Haswell-EP): Released Q3 2014
- Process: 22nm
- Cores: 4-18 per socket
- TDP: 55W-145W
- Max Memory Speed: DDR4-2133
E5-2600 v4 (Broadwell-EP): Released Q1 2016
- Process: 14nm
- Cores: 4-22 per socket
- TDP: 55W-145W
- Max Memory Speed: DDR4-2400

Popular CPU Options

Value: E5-2620 v3/v4 (6 cores, 15MB cache, 85W)
Balanced: E5-2650 v3/v4 (10-12 cores, 25-30MB cache, 105W)
Performance: E5-2680 v3/v4 (12-14 cores, 30-35MB cache, 120W)
High Core Count: E5-2699 v4 (22 cores, 55MB cache, 145W)

Configuration Options

Single Processor: One CPU socket populated (budget option)
Dual Processor: Both sockets populated (full performance)

Note: Memory and I/O performance scales with processor count. Single-CPU configuration limits memory channels and PCIe lanes.

Memory Architecture

Memory Specifications

Type: DDR4 RDIMM or LRDIMM
Speed: DDR4-2133 (v3) or DDR4-2400 (v4)
Slots: 24 DIMM slots (12 per processor)
Maximum Capacity:
- 768GB with 32GB RDIMMs
- 1.5TB with 64GB LRDIMMs (v4 processors)
Minimum: 8GB (1x 8GB DIMM)

Memory Configuration Rules

Channels per CPU: 4 channels, 3 DIMMs per channel
Population: Populate channels evenly for optimal bandwidth
Mixing: Do not mix RDIMM and LRDIMM types
Speed: All DIMMs run at speed of slowest DIMM

Recommended Configurations

Basic Home Lab (Single CPU):

4x 16GB = 64GB (one DIMM per channel on both memory boards)

Standard (Dual CPU):

8x 16GB = 128GB (one DIMM per channel)
12x 16GB = 192GB (two DIMMs per channel on primary channels)

High Capacity (Dual CPU):

24x 32GB = 768GB (all slots populated, RDIMM)

Performance Priority: Populate all channels before adding second DIMM per channel

Storage Options

Drive Bay Configurations

The DL360 Gen9 offers multiple drive bay configurations:

8 SFF (2.5-inch): Most common configuration
10 SFF: Extended bay version
4 LFF (3.5-inch): Less common in 1U form factor

Drive Types Supported

SAS: 12Gb/s, 6Gb/s (enterprise-grade)
SATA: 6Gb/s, 3Gb/s (value option)
SSD: SAS/SATA SSD, NVMe (with appropriate controller)

Storage Controllers

Smart Array Controllers (HPE proprietary RAID):

P440ar: Entry-level, 2GB FBWC (Flash-Backed Write Cache), RAID 0/1/5/6/10
P840ar: High-performance, 4GB FBWC, RAID 0/1/5/6/10/50/60
P440: PCIe card version, 2GB FBWC
P840: PCIe card version, 4GB FBWC

HBA Mode (non-RAID pass-through):

Smart Array controllers in HBA mode for software RAID (ZFS, mdadm)
Limited support; check firmware version

Alternative Controllers:

LSI/Broadcom HBA controllers in PCIe slots
H240ar (12Gb/s HBA mode)

Boot Drive Options

For network-focused deployments:

Minimal Local Storage: 2x SSD in RAID 1 for hypervisor/OS
USB/SD Boot: iLO supports USB boot, SD card (internal USB)
Diskless: Pure network boot (subject of network-boot.md)

Network Connectivity

Integrated FlexibleLOM

The DL360 Gen9 includes a FlexibleLOM slot for swappable network adapters:

Common FlexibleLOM Options:

HPE 366FLR: 4x 1GbE (Broadcom BCM5719)
- Most common, good for general use
- Supports PXE, UEFI network boot, SR-IOV
HPE 560FLR-SFP+: 2x 10GbE SFP+ (Intel X710)
- High performance, fiber or DAC
- Supports PXE, UEFI boot, SR-IOV, RDMA (RoCE)
HPE 361i: 2x 1GbE (Intel I350)
- Entry-level, good driver support

PCIe Expansion Slots

Slot Configuration:

Slot 1: PCIe 3.0 x16 (low-profile)
Slot 2: PCIe 3.0 x8 (low-profile)
Slot 3: PCIe 3.0 x8 (low-profile) - optional, depends on riser

Network Card Options:

Intel X520/X710 (10GbE)
Mellanox ConnectX-3/ConnectX-4 (10/25/40GbE, InfiniBand)
Broadcom NetXtreme (1/10/25GbE)

Note: Ensure cards are low-profile for 1U chassis compatibility

Power Supply

PSU Options

500W: Single PSU, non-redundant (not recommended)
800W: Common, supports dual CPU + moderate expansion
1400W: High-power, dual CPU with high TDP + GPUs
Redundancy: 1+1 redundant hot-plug recommended

Power Configuration

Platinum Efficiency: 94%+ at 50% load
Hot-Plug: Replace without powering down
Auto-Switching: 100-240V AC, 50/60Hz

Home Lab Power Draw (typical):

Idle (dual E5-2650 v3, 128GB RAM): 100-130W
Load: 200-350W depending on CPU and drive configuration

Power Management

HPE Dynamic Power Capping: Limit max power via iLO
Collaborative Power: Share power budget across chassis in blade environments
Energy Efficient Ethernet (EEE): Reduce NIC power during low utilization

Cooling and Acoustics

Fan Configuration

6x Hot-Plug Fans: Front-mounted, redundant (N+1)
Variable Speed: Controlled by System ROM based on thermal sensors
iLO Management: Monitor fan speed, temperature via iLO

Thermal Management

Temperature Range: 10-35°C (50-95°F) operating
Altitude: Up to 3,050m (10,000 ft) at reduced temperature
Airflow: Front-to-back, ensure clear intake and exhaust

Noise Level

Idle: ~45 dBA (quiet for 1U server)
Load: 55-70 dBA depending on thermal demand
Home Lab Consideration: Audible but acceptable in dedicated space; louder than desktop workstation

Noise Reduction:

Run lower TDP CPUs (e.g., E5-2620 series)
Maintain ambient temperature <25°C
Ensure adequate airflow (not in enclosed cabinet without ventilation)

Management - iLO 4

iLO 4 Features

The Integrated Lights-Out 4 (iLO 4) provides out-of-band management:

Web Interface: HTTPS management console
Remote Console: HTML5 or Java-based KVM
Virtual Media: Mount ISOs/images remotely
Power Control: Power on/off, reset, cold boot
Monitoring: Sensors, event logs, hardware health
Alerting: Email alerts, SNMP traps, syslog
Scripting: RESTful API (Redfish standard)

iLO Licensing

iLO Standard (included): Basic management, remote console
iLO Advanced (license required):
- Virtual media
- Remote console performance improvements
- Directory integration (LDAP/AD)
- Graphical remote console
iLO Advanced Premium (license required):
- Insight Remote Support
- Federation
- Jitter smoothing

Home Lab: iLO Advanced license highly recommended for virtual media and full remote console features

iLO Network Configuration

Dedicated iLO Port: Separate 1GbE management port (recommended)
Shared LOM: Share FlexibleLOM port with OS (not recommended for isolation)

Security: Isolate iLO on dedicated management VLAN, disable if not needed

BIOS and Firmware

System ROM (BIOS/UEFI)

Firmware Type: UEFI 2.31 or later
Boot Modes: UEFI, Legacy BIOS, or hybrid
Configuration: RBSU (ROM-Based Setup Utility) accessible via F9

Firmware Update Methods

Service Pack for ProLiant (SPP): Comprehensive bundle of all firmware
iLO Online Flash: Update via web interface
Online ROM Flash: Linux utility for online updates
USB Flash: Boot from USB with firmware update utility

Recommended Practice: Update to latest SPP for security patches and feature improvements

Secure Boot

UEFI Secure Boot: Supported, validates boot loader signatures
TPM: Optional Trusted Platform Module 1.2 or 2.0
Boot Order Protection: Prevent unauthorized boot device changes

Expansion and Modularity

GPU Support

Limited GPU support due to 1U form factor and power constraints:

Low-Profile GPUs: Nvidia T4, AMD Instinct MI25 (may require custom cooling)
Power: Consider 1400W PSU for high-power GPUs
Not Ideal: For GPU-heavy workloads, consider 2U+ servers (e.g., DL380 Gen9)

USB Ports

Front: 1x USB 3.0
Rear: 2x USB 3.0
Internal: 1x USB 2.0 (for SD/USB boot device)

Serial Port

Rear serial port for legacy console access
Useful for network equipment serial console, debug

Home Lab Considerations

Pros for Home Lab

Density: 1U form factor saves rack space
iLO Management: Enterprise remote management without KVM
Network Boot: Excellent PXE/UEFI boot support (see network-boot.md)
Serviceability: Hot-swap drives, PSU, fans
Documentation: Extensive HPE documentation and community support
Parts Availability: Common on secondary market, affordable

Cons for Home Lab

Noise: Louder than tower servers or workstations
Power: Higher idle power than consumer hardware (100-130W idle)
1U Limitations: Limited GPU, PCIe expansion vs 2U/4U chassis
Firmware: Requires HPE account for SPP downloads (free but registration required)

Recommended Home Lab Configuration

Budget (~$500-800 used):

Dual E5-2620 v3 or v4 (6 cores each, 85W TDP)
128GB RAM (8x 16GB DDR4)
2x SSD (boot), 4-6x HDD/SSD (data)
HPE 366FLR (4x 1GbE)
Dual 500W or 800W PSU (redundant)
iLO Advanced license

Performance (~$1000-1500 used):

Dual E5-2680 v4 (14 cores each, 120W TDP)
256GB RAM (16x 16GB DDR4)
2x NVMe SSD (boot/cache), 6-8x SSD (data)
HPE 560FLR-SFP+ (2x 10GbE) + PCIe 4x1GbE card
Dual 800W PSU
iLO Advanced license

Comparison with Other Generations

vs Gen8 (Previous)

Gen9 Advantages:

DDR4 vs DDR3 (lower power, higher capacity)
Better UEFI support and HTTP boot
Newer processor architecture (Haswell/Broadwell vs Sandy Bridge/Ivy Bridge)
iLO 4 vs iLO 3 (better HTML5 console)

Gen8 Advantages:

Lower cost on secondary market
Adequate for light workloads

vs Gen10 (Next)

Gen10 Advantages:

Newer CPUs (Skylake-SP/Cascade Lake)
More PCIe lanes
Better UEFI firmware and security features
DDR4-2666/2933 support

Gen9 Advantages:

Lower cost (mature product cycle)
Excellent value for performance/dollar
Still well-supported by modern OS and firmware

Technical Resources

QuickSpecs: HPE ProLiant DL360 Gen9 Server QuickSpecs
User Guide: HPE ProLiant DL360 Gen9 Server User Guide
Maintenance and Service Guide: Detailed disassembly and part replacement
Firmware Downloads: HPE Support Portal (requires free account)

Summary

The HP ProLiant DL360 Gen9 remains an excellent choice for home labs and small deployments in 2024-2025. Its balance of performance (dual Xeon v4, 768GB RAM capacity), manageability (iLO 4), and network boot capabilities make it particularly well-suited for virtualization, container hosting, and infrastructure automation workflows. While not the latest generation, it offers strong value with robust firmware support and wide secondary market availability.

Best For:

Virtualization hosts (ESXi, Proxmox, Hyper-V)
Kubernetes/container platforms
Network boot/diskless deployments
Storage servers (with appropriate controller)
General compute workloads

Avoid For:

GPU-intensive workloads (1U constraints)
Noise-sensitive environments (unless isolated)
Extreme low-power requirements (100W+ idle)

1.4.3 - Network Boot Capabilities

Comprehensive analysis of network boot support on HP ProLiant DL360 Gen9

Overview

The HP ProLiant DL360 Gen9 provides robust network boot capabilities through multiple protocols and firmware interfaces. This makes it particularly well-suited for diskless deployments, automated provisioning, and infrastructure-as-code workflows.

Supported Network Boot Protocols

PXE (Preboot Execution Environment)

The DL360 Gen9 fully supports PXE boot via both legacy BIOS and UEFI firmware modes:

Legacy BIOS PXE: Traditional PXE implementation using TFTP
- Protocol: PXEv2 (PXE 2.1)
- Network Stack: IPv4 only in legacy mode
- Boot files: pxelinux.0, undionly.kpxe, or custom NBP
- DHCP options: Standard options 66 (TFTP server) and 67 (boot filename)
UEFI PXE: Modern UEFI network boot implementation
- Protocol: PXEv2 with UEFI extensions
- Network Stack: IPv4 and IPv6 support
- Boot files: bootx64.efi, grubx64.efi, shimx64.efi
- Architecture: x64 (EFI BC)
- DHCP Architecture ID: 0x0007 (EFI BC) or 0x0009 (EFI x86-64)

iPXE Support

The DL360 Gen9 can boot iPXE, enabling advanced features:

Chainloading: Boot standard PXE, then chainload iPXE for enhanced capabilities
HTTP/HTTPS Boot: Download kernels and images over HTTP(S) instead of TFTP
SAN Boot: iSCSI and AoE (ATA over Ethernet) support
Scripting: Conditional boot logic and dynamic configuration
Embedded Scripts: iPXE can be compiled with embedded boot scripts

Implementation Methods:

Chainload from standard PXE: DHCP points to undionly.kpxe or ipxe.efi
Flash iPXE to FlexibleLOM option ROM (advanced, requires care)
Boot iPXE from USB, then continue network boot

UEFI HTTP Boot

Native UEFI HTTP boot is supported on Gen9 servers with recent firmware:

Protocol: RFC 7230 HTTP/1.1
Requirements:
- UEFI firmware version 2.40 or later (check via iLO)
- DHCP option 60 (vendor class identifier) = “HTTPClient”
- DHCP option 67 pointing to HTTP(S) URL
Advantages:
- No TFTP server required
- Faster transfers than TFTP
- Support for HTTPS with certificate validation
- Better suited for large images (kernels, initramfs)
Limitations:
- UEFI mode only (not available in legacy BIOS)
- Requires DHCP server with HTTP URL support

HTTP(S) Boot Configuration

For UEFI HTTP boot on DL360 Gen9:

# Example ISC DHCP configuration for UEFI HTTP boot
class "httpclients" {
    match if substring(option vendor-class-identifier, 0, 10) = "HTTPClient";
}

pool {
    allow members of "httpclients";
    option vendor-class-identifier "HTTPClient";
    # Point to HTTP boot URI
    filename "http://boot.example.com/boot/efi/bootx64.efi";
}

Network Interface Options

The DL360 Gen9 supports multiple network adapter configurations for boot:

FlexibleLOM (LOM = LAN on Motherboard)

HPE FlexibleLOM slot supports:

HPE 366FLR: Quad-port 1GbE (Broadcom BCM5719)
HPE 560FLR-SFP+: Dual-port 10GbE (Intel X710)
HPE 361i: Dual-port 1GbE (Intel I350)

All FlexibleLOM adapters support PXE and UEFI network boot. The option ROM can be configured via BIOS/UEFI settings.

PCIe Network Adapters

Standard PCIe network cards with PXE/UEFI boot ROM support:

Intel X520, X710 series (10GbE)
Broadcom NetXtreme series
Mellanox ConnectX-3/4 (with appropriate firmware)

Boot Priority: Configure via System ROM > Network Boot Options to select which NIC boots first.

Firmware Configuration

Accessing Boot Configuration

RBSU (ROM-Based Setup Utility): Press F9 during POST
iLO 4 Remote Console: Access via network, then virtual F9
UEFI System Utilities: Modern interface for UEFI firmware settings

Key Settings

Navigate to: System Configuration > BIOS/Platform Configuration (RBSU) > Network Boot Options

Network Boot: Enable/Disable
Boot Mode: UEFI or Legacy BIOS
IPv4/IPv6: Enable protocol support
Boot Retry: Number of attempts before falling back to next boot device
Boot Order: Prioritize network boot in boot sequence

Per-NIC Configuration

In RBSU > Network Options:

Option ROM: Enable/Disable per adapter
Link Speed: Force speed/duplex or auto-negotiate
VLAN: VLAN tagging for boot (if supported by DHCP/PXE environment)
PXE Menu: Enable interactive PXE menu (Ctrl+S during PXE boot)

iLO 4 Integration

The DL360 Gen9’s iLO 4 provides additional network boot features:

Virtual Media Network Boot

Mount ISO images remotely via iLO Virtual Media
Boot from network-attached ISO without physical media
Useful for OS installation or diagnostics

Workflow:

Upload ISO to HTTP/HTTPS server or use SMB/NFS share
iLO Remote Console > Virtual Devices > Image File CD-ROM/DVD
Set boot order to prioritize virtual optical drive
Reboot server

Scripted Deployment via iLO

iLO 4 RESTful API allows:

Setting one-time boot to network via API call
Automating PXE boot for provisioning pipelines
Integration with tools like Terraform, Ansible

Example using iLO RESTful API:

curl -k -u admin:password -X PATCH \
  https://ilo-hostname/redfish/v1/Systems/1/ \
  -d '{"Boot":{"BootSourceOverrideTarget":"Pxe","BootSourceOverrideEnabled":"Once"}}'

Boot Process Flow

Legacy BIOS PXE Boot

Server powers on, initializes NICs
NIC sends DHCPDISCOVER with PXE vendor options
DHCP server responds with IP, TFTP server (option 66), boot file (option 67)
NIC downloads NBP (Network Bootstrap Program) via TFTP
NBP executes (e.g., pxelinux.0 loads syslinux menu)
User selects boot target or automated script continues
Kernel and initramfs download and boot

UEFI PXE Boot

UEFI firmware initializes network stack
UEFI PXE driver sends DHCPv4/v6 DISCOVER
DHCP responds with boot file (e.g., bootx64.efi)
UEFI downloads boot file via TFTP
UEFI loads and executes boot loader (GRUB2, systemd-boot, iPXE)
Boot loader may download additional files (kernel, initrd, config)
OS boots

UEFI HTTP Boot

UEFI firmware with HTTP Boot support enabled
DHCP request includes “HTTPClient” vendor class
DHCP responds with HTTP(S) URL in option 67
UEFI HTTP client downloads boot file over HTTP(S)
Execution continues as with UEFI PXE

Performance Considerations

TFTP vs HTTP

TFTP: Slow for large files (typical: 1-5 MB/s)
- Use for small boot loaders only
- Chainload to iPXE or HTTP boot for better performance
HTTP: 10-100x faster depending on network and server
- Recommended for kernels, initramfs, live OS images
- iPXE or UEFI HTTP boot required

Network Speed Impact

DL360 Gen9 boot performance by NIC speed:

1GbE: Adequate for most PXE deployments (100-125 MB/s theoretical max)
10GbE: Significant improvement for large image downloads (1-2 GB/s)
Bonding/Teaming: Not typically used for boot (single NIC boots)

Recommendation: For production diskless nodes or frequent re-provisioning, 10GbE with HTTP boot provides best performance.

Common Use Cases

1. Automated OS Provisioning

Boot into installer via PXE:

Kickstart (RHEL/CentOS/Rocky)
Preseed (Debian/Ubuntu)
Ignition (Fedora CoreOS, Flatcar)

2. Diskless Boot

Boot OS entirely from network/RAM:

Network root: NFS or iSCSI root filesystem
Overlay: Persistent storage via network overlay
Stateless: Boot identical image, no local state

3. Rescue and Diagnostics

Boot live environments:

SystemRescue
Clonezilla
Memtest86+
Hardware diagnostics (HPE Service Pack for ProLiant)

4. Kubernetes/Container Hosts

PXE boot immutable OS images:

Talos Linux: API-driven, diskless k8s nodes
Flatcar Container Linux: Automated updates
k3OS: Lightweight k8s OS

Troubleshooting

PXE Boot Fails

Symptoms: “PXE-E51: No DHCP or proxy DHCP offers received” or timeout

Checks:

Verify NIC link light and switch port status
Confirm DHCP server is responding (check DHCP logs)
Ensure DHCP options 66 and 67 are set correctly
Test TFTP server accessibility (tftp -i <server> GET <file>)
Check BIOS/UEFI network boot is enabled
Verify boot order prioritizes network boot
Disable Secure Boot if using unsigned boot files

UEFI Network Boot Not Available

Symptoms: Network boot option missing in UEFI boot menu

Resolution:

Enter RBSU (F9), navigate to Network Options
Ensure at least one NIC has “Option ROM” enabled
Verify Boot Mode is set to UEFI (not Legacy)
Update System ROM to latest version if option is missing
Some FlexibleLOM cards require firmware update for UEFI boot support

HTTP Boot Fails

Symptoms: UEFI HTTP boot option present but fails to download

Checks:

Verify firmware version supports HTTP boot (>=2.40)
Ensure DHCP option 67 contains valid HTTP(S) URL
Test URL accessibility from another client
Check DNS resolution if using hostname in URL
For HTTPS: Verify certificate is trusted (or disable cert validation in test)

Slow PXE Boot

Symptoms: Boot process takes minutes instead of seconds

Optimizations:

Switch from TFTP to HTTP (chainload iPXE or use UEFI HTTP boot)
Increase TFTP server block size (tftp-hpa --blocksize 1468)
Tune DHCP response times (reduce lease query delays)
Use local network segment for boot server (avoid WAN/VPN)
Enable NIC interrupt coalescing in BIOS for 10GbE

Security Considerations

Secure Boot

DL360 Gen9 supports UEFI Secure Boot:

Validates signed boot loaders (shim, GRUB, kernel)
Prevents unsigned code execution during boot
Required for some compliance scenarios

Configuration: RBSU > Boot Options > Secure Boot = Enabled

Implications for Network Boot:

Must use signed boot loaders (e.g., shim.efi signed by Microsoft/vendor)
Custom kernels require signing or disabling Secure Boot
iPXE must be signed or chainloaded from signed shim

Network Security

Risks:

PXE/TFTP is unencrypted and unauthenticated
Attacker on network can serve malicious boot images
DHCP spoofing can redirect to malicious boot server

Mitigations:

Network Segmentation: Isolate PXE boot to management VLAN
DHCP Snooping: Prevent rogue DHCP servers on switch
HTTPS Boot: Use UEFI HTTP boot with TLS and certificate validation
iPXE with HTTPS: Chainload iPXE, then use HTTPS for all downloads
Signed Images: Use Secure Boot with signed boot chain
802.1X: Require network authentication before DHCP (complex for PXE)

iLO Security

Change default iLO password immediately
Use TLS for iLO web interface and API
Restrict iLO network access (firewall, separate VLAN)
Disable iLO Virtual Media if not needed
Enable iLO Security Override for extra security during boot

Firmware and Driver Resources

Required Firmware Versions

For optimal network boot support:

System ROM: v2.60 or later (latest recommended)
iLO 4 Firmware: v2.80 or later
NIC Firmware: Latest for specific FlexibleLOM/PCIe card

Check current versions: iLO web interface > Information > Firmware Information

Updating Firmware

Methods:

HPE Service Pack for ProLiant (SPP): Comprehensive update bundle
- Boot from SPP ISO (via iLO Virtual Media or USB)
- Runs Smart Update Manager (SUM) in Linux environment
- Updates all firmware, drivers, system ROM automatically
iLO Web Interface: Individual component updates
- System ROM: Administration > Firmware > Update Firmware
- Upload .fwpkg or .bin files from HPE support site
Online Flash Component: Linux Online ROM Flash utility
- Install hp-firmware-* packages
- Run updates while OS is running (requires reboot to apply)

Download Source: https://support.hpe.com/connect/s/product?language=en_US&kmpmoid=1010026910 (requires HPE Passport account, free registration)

Best Practices

Use UEFI Mode: Better security, IPv6 support, larger disk support
Enable HTTP Boot: Faster and more reliable than TFTP for large files
Chainload iPXE: Flexibility of iPXE with standard PXE infrastructure
Update Firmware: Keep System ROM and iLO current for bug fixes and features
Isolate Boot Network: Use dedicated management VLAN for PXE/provisioning
Test Failover: Configure multiple DHCP servers and boot mirrors for redundancy
Document Configuration: Record BIOS settings, DHCP config, and boot infrastructure
Monitor iLO Logs: Track boot failures and hardware issues via iLO event log

References

HPE ProLiant DL360 Gen9 Server User Guide
HPE UEFI System Utilities User Guide
iLO 4 User Guide (firmware version 2.80)
Intel PXE Specification v2.1
UEFI Specification v2.8 (HTTP Boot)
iPXE Documentation: https://ipxe.org/

Conclusion

The HP ProLiant DL360 Gen9 provides enterprise-grade network boot capabilities suitable for both traditional PXE deployments and modern UEFI HTTP boot scenarios. Its flexible configuration options, mature firmware support, and iLO integration make it an excellent platform for automated provisioning, diskless computing, and infrastructure-as-code workflows in home lab environments.

For home lab use, the recommended configuration is:

UEFI boot mode with Secure Boot disabled (unless required)
iPXE chainloading for flexibility and HTTP performance
iLO 4 configured for remote management and scripted provisioning
Latest firmware for stability and feature support

1.5 - Matchbox Analysis

Analysis of Matchbox network boot service capabilities and architecture

Matchbox Network Boot Analysis

This section contains a comprehensive analysis of Matchbox, a network boot service for provisioning bare-metal machines.

Overview

Matchbox is an HTTP and gRPC service developed by Poseidon that automates bare-metal machine provisioning through network booting. It matches machines to configuration profiles based on hardware attributes and serves boot configurations, kernel images, and provisioning configs.

Primary Repository: poseidon/matchbox
Documentation: https://matchbox.psdn.io/
License: Apache 2.0

Key Features

Network Boot Support: iPXE, PXELINUX, GRUB2 chainloading
OS Provisioning: Fedora CoreOS, Flatcar Linux, RHEL CoreOS
Configuration Management: Ignition v3.x configs, Butane transpilation
Machine Matching: Label-based matching (MAC, UUID, hostname, serial, custom)
API: Read-only HTTP API + authenticated gRPC API
Asset Serving: Local caching of OS images for faster deployment
Templating: Go template support for dynamic configuration

Use Cases

Bare-metal Kubernetes clusters - Provision CoreOS nodes for k8s
Lab/development environments - Quick PXE boot for testing
Datacenter provisioning - Automate OS installation across fleets
Immutable infrastructure - Declarative machine provisioning via Terraform

Analysis Contents

Network Boot Architecture - Deep dive into PXE, iPXE, and GRUB support
Configuration Model - Profiles, Groups, and templating system
Deployment Patterns - Installation options and operational considerations

Quick Architecture

┌─────────────┐
│   Machine   │ PXE Boot
│  (BIOS/UEFI)│───┐
└─────────────┘   │
                  │
┌─────────────┐   │ DHCP/TFTP
│   dnsmasq   │◄──┘ (chainload to iPXE)
│  DHCP+TFTP  │
└─────────────┘
       │
       │ HTTP
       ▼
┌─────────────────────────┐
│      Matchbox           │
│  ┌──────────────────┐   │
│  │  HTTP Endpoints  │   │ /boot.ipxe, /ignition
│  └──────────────────┘   │
│  ┌──────────────────┐   │
│  │   gRPC API       │   │ Terraform provider
│  └──────────────────┘   │
│  ┌──────────────────┐   │
│  │ Profile/Group    │   │ Match machines
│  │   Matcher        │   │ to configs
│  └──────────────────┘   │
└─────────────────────────┘

Technology Stack

Language: Go
Config Formats: Ignition JSON, Butane YAML
Boot Protocols: PXE, iPXE, GRUB2
APIs: HTTP (read-only), gRPC (authenticated)
Deployment: Binary, container (Podman/Docker), Kubernetes

Integration Points

Terraform: terraform-provider-matchbox for declarative provisioning
Ignition/Butane: CoreOS provisioning configs
dnsmasq: Reference DHCP/TFTP/DNS implementation (quay.io/poseidon/dnsmasq)
Asset sources: Can serve local or remote (HTTPS) OS images

1.5.1 - Configuration Model

Analysis of Matchbox’s profile, group, and templating system

Matchbox Configuration Model

Matchbox uses a flexible configuration model based on Profiles (what to provision) and Groups (which machines get which profile), with support for templating and metadata.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    Matchbox Store                           │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐            │
│  │  Profiles  │  │   Groups   │  │   Assets   │            │
│  └────────────┘  └────────────┘  └────────────┘            │
│        │               │                                    │
│        │               │                                    │
│        ▼               ▼                                    │
│  ┌─────────────────────────────────────┐                   │
│  │       Matcher Engine                │                   │
│  │  (Label-based group selection)      │                   │
│  └─────────────────────────────────────┘                   │
│                    │                                        │
│                    ▼                                        │
│  ┌─────────────────────────────────────┐                   │
│  │    Template Renderer                │                   │
│  │  (Go templates + metadata)          │                   │
│  └─────────────────────────────────────┘                   │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
            Rendered Config (iPXE, Ignition, etc.)

Data Directory Structure

Matchbox uses a FileStore (default) that reads from -data-path (default: /var/lib/matchbox):

/var/lib/matchbox/
├── groups/              # Machine group definitions (JSON)
│   ├── default.json
│   ├── node1.json
│   └── us-west.json
├── profiles/            # Profile definitions (JSON)
│   ├── worker.json
│   ├── controller.json
│   └── etcd.json
├── ignition/            # Ignition configs (.ign) or Butane (.yaml)
│   ├── worker.ign
│   ├── controller.ign
│   └── butane-example.yaml
├── cloud/               # Cloud-Config templates (DEPRECATED)
│   └── legacy.yaml.tmpl
├── generic/             # Arbitrary config templates
│   ├── setup.cfg
│   └── metadata.yaml.tmpl
└── assets/              # Static files (kernel, initrd)
    ├── fedora-coreos/
    └── flatcar/

Version control: Poseidon recommends keeping /var/lib/matchbox under git for auditability and rollback.

Profiles

Profiles define what to provision: network boot settings (kernel, initrd, args) and config references (Ignition, Cloud-Config, generic).

Profile Schema

{
  "id": "worker",
  "name": "Fedora CoreOS Worker Node",
  "boot": {
    "kernel": "/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-kernel-x86_64",
    "initrd": [
      "--name main /assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img"
    ],
    "args": [
      "initrd=main",
      "coreos.live.rootfs_url=http://matchbox.example.com:8080/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-rootfs.x86_64.img",
      "coreos.inst.install_dev=/dev/sda",
      "coreos.inst.ignition_url=http://matchbox.example.com:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
    ]
  },
  "ignition_id": "worker.ign",
  "cloud_id": "",
  "generic_id": ""
}

Profile Fields

Field	Type	Required	Description
`id`	string	✅	Unique profile identifier (referenced by groups)
`name`	string	❌	Human-readable description
`boot`	object	❌	Network boot configuration
`boot.kernel`	string	❌	Kernel URL (HTTP/HTTPS or /assets path)
`boot.initrd`	array	❌	Initrd URLs (can specify `--name` for multi-initrd)
`boot.args`	array	❌	Kernel command-line arguments
`ignition_id`	string	❌	Ignition/Butane config filename in `ignition/`
`cloud_id`	string	❌	Cloud-Config filename in `cloud/` (deprecated)
`generic_id`	string	❌	Generic config filename in `generic/`

Boot Configuration Patterns

Pattern 1: Live PXE (RAM-based, ephemeral)

Boot and run OS entirely from RAM, no disk install:

{
  "boot": {
    "kernel": "/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-kernel-x86_64",
    "initrd": [
      "--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img"
    ],
    "args": [
      "initrd=main",
      "coreos.live.rootfs_url=http://matchbox/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-rootfs.x86_64.img",
      "ignition.config.url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
    ]
  }
}

Use case: Diskless workers, testing, ephemeral compute

Pattern 2: Disk Install (persistent)

PXE boot live image, install to disk, reboot to disk:

{
  "boot": {
    "kernel": "/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-kernel-x86_64",
    "initrd": [
      "--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img"
    ],
    "args": [
      "initrd=main",
      "coreos.live.rootfs_url=http://matchbox/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-rootfs.x86_64.img",
      "coreos.inst.install_dev=/dev/sda",
      "coreos.inst.ignition_url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
    ]
  }
}

Key difference: coreos.inst.install_dev triggers disk install before reboot

Pattern 3: Multi-initrd (layered)

Multiple initrds can be loaded (e.g., base + drivers):

{
  "initrd": [
    "--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img",
    "--name drivers /assets/drivers/custom-drivers.img"
  ],
  "args": [
    "initrd=main,drivers",
    "..."
  ]
}

Config References

Ignition Configs

Direct Ignition (.ign files):

{
  "ignition_id": "worker.ign"
}

File: /var/lib/matchbox/ignition/worker.ign

{
  "ignition": { "version": "3.3.0" },
  "systemd": {
    "units": [{
      "name": "example.service",
      "enabled": true,
      "contents": "[Service]\nType=oneshot\nExecStart=/usr/bin/echo Hello\n\n[Install]\nWantedBy=multi-user.target"
    }]
  }
}

Butane Configs (transpiled to Ignition):

{
  "ignition_id": "worker.yaml"
}

File: /var/lib/matchbox/ignition/worker.yaml

variant: fcos
version: 1.5.0
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - ssh-ed25519 AAAA...
systemd:
  units:
    - name: etcd.service
      enabled: true

Matchbox automatically:

Detects Butane format (file doesn’t end in .ign or .ignition)
Transpiles Butane → Ignition using embedded library
Renders templates with group metadata
Serves as Ignition v3.3.0

Generic Configs

For non-Ignition configs (scripts, YAML, arbitrary data):

{
  "generic_id": "setup-script.sh.tmpl"
}

File: /var/lib/matchbox/generic/setup-script.sh.tmpl

#!/bin/bash
# Rendered with group metadata
NODE_NAME={{.node_name}}
CLUSTER_ID={{.cluster_id}}
echo "Provisioning ${NODE_NAME} in cluster ${CLUSTER_ID}"

Access via: GET /generic?uuid=...&mac=...

Groups

Groups match machines to profiles using selectors (label matching) and provide metadata for template rendering.

Group Schema

{
  "id": "node1-worker",
  "name": "Worker Node 1",
  "profile": "worker",
  "selector": {
    "mac": "52:54:00:89:d8:10",
    "uuid": "550e8400-e29b-41d4-a716-446655440000"
  },
  "metadata": {
    "node_name": "worker-01",
    "cluster_id": "prod-cluster",
    "etcd_endpoints": "https://10.0.1.10:2379,https://10.0.1.11:2379",
    "ssh_authorized_keys": [
      "ssh-ed25519 AAAA...",
      "ssh-rsa AAAA..."
    ]
  }
}

Group Fields

Field	Type	Required	Description
`id`	string	✅	Unique group identifier
`name`	string	❌	Human-readable description
`profile`	string	✅	Profile ID to apply
`selector`	object	❌	Label match criteria (omit for default group)
`metadata`	object	❌	Key-value data for template rendering

Selector Matching

Reserved selectors (automatically populated from machine attributes):

Selector	Source	Example	Normalized
`uuid`	SMBIOS UUID	`550e8400-e29b-41d4-a716-446655440000`	Lowercase
`mac`	Primary NIC MAC	`52:54:00:89:d8:10`	Colon-separated
`hostname`	Network hostname	`node1.example.com`	As reported
`serial`	Hardware serial	`VMware-42 1a...`	As reported

Custom selectors (passed as query params):

{
  "selector": {
    "region": "us-west",
    "environment": "production",
    "rack": "A23"
  }
}

Matching request: /ipxe?mac=52:54:00:89:d8:10&region=us-west&environment=production&rack=A23

Matching logic:

All selector key-value pairs must match request labels (AND logic)
Most specific group wins (most selector matches)
If multiple groups have same specificity, first match wins (undefined order)
Groups with no selectors = default group (matches all)

Default Groups

Group with empty selector matches all machines:

{
  "id": "default-worker",
  "name": "Default Worker",
  "profile": "worker",
  "metadata": {
    "environment": "dev"
  }
}

⚠️ Warning: Avoid multiple default groups (non-deterministic matching)

Example: Region-based Matching

Group 1: US-West Workers

{
  "id": "us-west-workers",
  "profile": "worker",
  "selector": {
    "region": "us-west"
  },
  "metadata": {
    "etcd_endpoints": "https://etcd-usw.example.com:2379"
  }
}

Group 2: EU Workers

{
  "id": "eu-workers",
  "profile": "worker",
  "selector": {
    "region": "eu"
  },
  "metadata": {
    "etcd_endpoints": "https://etcd-eu.example.com:2379"
  }
}

Group 3: Specific Machine Override

{
  "id": "node-special",
  "profile": "controller",
  "selector": {
    "mac": "52:54:00:89:d8:10",
    "region": "us-west"
  },
  "metadata": {
    "role": "controller"
  }
}

Matching precedence:

Machine with mac=52:54:00:89:d8:10&region=us-west → node-special (2 selectors)
Machine with region=us-west → us-west-workers (1 selector)
Machine with region=eu → eu-workers (1 selector)

Templating System

Matchbox uses Go’s text/template for rendering configs with group metadata.

Template Context

Available variables in Ignition/Butane/Cloud-Config/generic templates:

// Group metadata (all keys from group.metadata)
{{.node_name}}
{{.cluster_id}}
{{.etcd_endpoints}}

// Group selectors (normalized)
{{.mac}}      // e.g., "52:54:00:89:d8:10"
{{.uuid}}     // e.g., "550e8400-..."
{{.region}}   // Custom selector

// Request query params (raw)
{{.request.query.mac}}     // As passed in URL
{{.request.query.foo}}     // Custom query param
{{.request.raw_query}}     // Full query string

// Special functions
{{if index . "ssh_authorized_keys"}}  // Check if key exists
{{range $element := .ssh_authorized_keys}}  // Iterate arrays

Example: Templated Butane Config

Group metadata:

{
  "metadata": {
    "node_name": "worker-01",
    "ssh_authorized_keys": [
      "ssh-ed25519 AAA...",
      "ssh-rsa BBB..."
    ],
    "ntp_servers": ["time1.google.com", "time2.google.com"]
  }
}

Butane template: /var/lib/matchbox/ignition/worker.yaml

variant: fcos
version: 1.5.0

storage:
  files:
    - path: /etc/hostname
      mode: 0644
      contents:
        inline: {{.node_name}}

    - path: /etc/systemd/timesyncd.conf
      mode: 0644
      contents:
        inline: |
          [Time]
          {{range $server := .ntp_servers}}
          NTP={{$server}}
          {{end}}

{{if index . "ssh_authorized_keys"}}
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        {{range $key := .ssh_authorized_keys}}
        - {{$key}}
        {{end}}
{{end}}

Rendered Ignition (simplified):

{
  "ignition": {"version": "3.3.0"},
  "storage": {
    "files": [
      {
        "path": "/etc/hostname",
        "contents": {"source": "data:,worker-01"},
        "mode": 420
      },
      {
        "path": "/etc/systemd/timesyncd.conf",
        "contents": {"source": "data:,%5BTime%5D%0ANTP%3Dtime1.google.com%0ANTP%3Dtime2.google.com"},
        "mode": 420
      }
    ]
  },
  "passwd": {
    "users": [{
      "name": "core",
      "sshAuthorizedKeys": ["ssh-ed25519 AAA...", "ssh-rsa BBB..."]
    }]
  }
}

Template Best Practices

Prefer external rendering: Use Terraform + ct_config provider for complex templates
Validate Butane: Use strict: true in Terraform or fcct --strict
Escape carefully: Go templates use {{}}, Butane uses YAML - mind the interaction
Test rendering: Request /ignition?mac=... directly to inspect output
Version control: Keep templates + groups in git for auditability

Reserved Metadata Keys

Warning: .request is reserved for query param access. Group metadata with "request": {...} will be overwritten.

Reserved keys:

request.query.* - Query parameters
request.raw_query - Raw query string

API Integration

HTTP Endpoints (Read-only)

Endpoint	Purpose	Template Context
`/ipxe`	iPXE boot script	Profile `boot` section
`/grub`	GRUB config	Profile `boot` section
`/ignition`	Ignition config	Group metadata + selectors + query
`/cloud`	Cloud-Config (deprecated)	Group metadata + selectors + query
`/generic`	Generic config	Group metadata + selectors + query
`/metadata`	Key-value env format	Group metadata + selectors + query

Example metadata endpoint response:

GET /metadata?mac=52:54:00:89:d8:10&foo=bar

NODE_NAME=worker-01
CLUSTER_ID=prod
MAC=52:54:00:89:d8:10
REQUEST_QUERY_MAC=52:54:00:89:d8:10
REQUEST_QUERY_FOO=bar
REQUEST_RAW_QUERY=mac=52:54:00:89:d8:10&foo=bar

gRPC API (Authenticated, mutable)

Used by terraform-provider-matchbox for declarative infrastructure:

Terraform example:

provider "matchbox" {
  endpoint    = "matchbox.example.com:8081"
  client_cert = file("~/.matchbox/client.crt")
  client_key  = file("~/.matchbox/client.key")
  ca          = file("~/.matchbox/ca.crt")
}

resource "matchbox_profile" "worker" {
  name   = "worker"
  kernel = "/assets/fedora-coreos/.../kernel"
  initrd = ["--name main /assets/fedora-coreos/.../initramfs.img"]
  args   = [
    "initrd=main",
    "coreos.inst.install_dev=/dev/sda",
    "coreos.inst.ignition_url=${var.matchbox_http_endpoint}/ignition?uuid=$${uuid}&mac=$${mac:hexhyp}"
  ]
  raw_ignition = data.ct_config.worker.rendered
}

resource "matchbox_group" "node1" {
  name    = "node1"
  profile = matchbox_profile.worker.name
  selector = {
    mac = "52:54:00:89:d8:10"
  }
  metadata = {
    node_name = "worker-01"
  }
}

Operations:

CreateProfile, GetProfile, UpdateProfile, DeleteProfile
CreateGroup, GetGroup, UpdateGroup, DeleteGroup

TLS client authentication required (see deployment docs)

Configuration Workflow

Recommended: Terraform + External Configs

┌─────────────────────────────────────────────────────────────┐
│ 1. Write Butane configs (YAML)                             │
│    - worker.yaml, controller.yaml                          │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Terraform ct_config transpiles Butane → Ignition        │
│    data "ct_config" "worker" {                             │
│      content = file("worker.yaml")                         │
│      strict  = true                                        │
│    }                                                        │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Terraform creates profiles + groups in Matchbox         │
│    matchbox_profile.worker → gRPC CreateProfile()          │
│    matchbox_group.node1 → gRPC CreateGroup()               │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Machine PXE boots, queries Matchbox                     │
│    GET /ipxe?mac=... → matches group → returns profile     │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Ignition fetches rendered config                        │
│    GET /ignition?mac=... → Matchbox returns Ignition       │
└─────────────────────────────────────────────────────────────┘

Benefits:

Rich Terraform templating (loops, conditionals, external data sources)
Butane validation before deployment
Declarative infrastructure (can terraform plan before apply)
Version control workflow (git + CI/CD)

Alternative: Manual FileStore

┌─────────────────────────────────────────────────────────────┐
│ 1. Create profile JSON manually                            │
│    /var/lib/matchbox/profiles/worker.json                  │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Create group JSON manually                              │
│    /var/lib/matchbox/groups/node1.json                     │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Write Ignition/Butane config                            │
│    /var/lib/matchbox/ignition/worker.ign                   │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Restart matchbox (to reload FileStore)                  │
│    systemctl restart matchbox                              │
└─────────────────────────────────────────────────────────────┘

Drawbacks:

Manual file management
No validation before deployment
Requires matchbox restart to pick up changes
Error-prone for large fleets

Storage Backends

FileStore (Default)

Config: -data-path=/var/lib/matchbox

Pros:

Simple file-based storage
Easy to version control (git)
Human-readable JSON

Cons:

Requires file system access
Manual reload for gRPC-created resources

Custom Store (Extensible)

Matchbox’s Store interface allows custom backends:

type Store interface {
  ProfileGet(id string) (*Profile, error)
  GroupGet(id string) (*Group, error)
  IgnitionGet(name string) (string, error)
  // ... other methods
}

Potential custom stores:

etcd backend (for HA Matchbox)
Database backend (PostgreSQL, MySQL)
S3/object storage backend

Note: Not officially provided by Matchbox project; requires custom implementation

Security Considerations

gRPC API authentication: Requires TLS client certificates
- ca.crt - CA that signed client certs
- server.crt/server.key - Server TLS identity
- client.crt/client.key - Client credentials (Terraform)
HTTP endpoints are read-only: No auth, machines fetch configs
- Do NOT put secrets in Ignition configs
- Use external secret stores (Vault, GCP Secret Manager)
- Reference secrets via Ignition files.source with auth headers
Network segmentation: Matchbox on provisioning VLAN, isolate from production
Config validation: Validate Ignition/Butane before deployment to avoid boot failures
Audit logging: Version control groups/profiles; log gRPC API changes

Operational Tips

Test groups with curl:

curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10'

List profiles:
```
ls -la /var/lib/matchbox/profiles/
```

Validate Butane:

podman run -i --rm quay.io/coreos/fcct:release --strict < worker.yaml

Check group matching:

# Default group (no selectors)
curl http://matchbox.example.com:8080/ignition

# Specific machine
curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10&uuid=550e8400-e29b-41d4-a716-446655440000'

Backup configs:

tar -czf matchbox-backup-$(date +%F).tar.gz /var/lib/matchbox/{groups,profiles,ignition}

Summary

Matchbox’s configuration model provides:

Separation of concerns: Profiles (what) vs Groups (who/where)
Flexible matching: Label-based, multi-attribute, custom selectors
Template support: Go templates for dynamic configs (but prefer external rendering)
API-driven: Terraform integration for GitOps workflows
Storage options: FileStore (simple) or custom backends (extensible)
OS-agnostic: Works with any Ignition-based distro (FCOS, Flatcar, RHCOS)

Best practice: Use Terraform + external Butane configs for production; manual FileStore for labs/development.

1.5.2 - Deployment Patterns

Matchbox deployment options and operational considerations

Matchbox Deployment Patterns

Analysis of deployment architectures, installation methods, and operational considerations for running Matchbox in production.

Deployment Architectures

Single-Host Deployment

┌─────────────────────────────────────────────────────┐
│           Provisioning Host                         │
│  ┌─────────────┐        ┌─────────────┐            │
│  │  Matchbox   │        │  dnsmasq    │            │
│  │  :8080 HTTP │        │  DHCP/TFTP  │            │
│  │  :8081 gRPC │        │  :67,:69    │            │
│  └─────────────┘        └─────────────┘            │
│         │                      │                    │
│         └──────────┬───────────┘                    │
│                    │                                │
│  /var/lib/matchbox/                                 │
│  ├── groups/                                        │
│  ├── profiles/                                      │
│  ├── ignition/                                      │
│  └── assets/                                        │
└─────────────────────────────────────────────────────┘
              │
              │ Network
              ▼
     ┌──────────────┐
     │ PXE Clients  │
     └──────────────┘

Use case: Lab, development, small deployments (<50 machines)

Pros:

Simple setup
Single service to manage
Minimal resource requirements

Cons:

Single point of failure
No scalability
Downtime during updates

HA Deployment (Multiple Matchbox Instances)

┌─────────────────────────────────────────────────────┐
│              Load Balancer (Ingress/HAProxy)        │
│           :8080 HTTP        :8081 gRPC              │
└─────────────────────────────────────────────────────┘
       │                              │
       ├─────────────┬────────────────┤
       ▼             ▼                ▼
┌──────────┐  ┌──────────┐    ┌──────────┐
│Matchbox 1│  │Matchbox 2│    │Matchbox N│
│ (Pod/VM) │  │ (Pod/VM) │    │ (Pod/VM) │
└──────────┘  └──────────┘    └──────────┘
       │             │                │
       └─────────────┴────────────────┘
                     │
                     ▼
         ┌────────────────────────┐
         │  Shared Storage        │
         │  /var/lib/matchbox     │
         │  (NFS, PV, ConfigMap)  │
         └────────────────────────┘

Use case: Production, datacenter-scale (100+ machines)

Pros:

High availability (no single point of failure)
Rolling updates (zero downtime)
Load distribution

Cons:

Complex storage (shared volume or etcd backend)
More infrastructure required

Storage options:

Kubernetes PersistentVolume (RWX mode)
NFS share mounted on multiple hosts
Custom etcd-backed Store (requires custom implementation)
Git-sync sidecar (read-only, periodic pull)

Kubernetes Deployment

┌─────────────────────────────────────────────────────┐
│              Ingress Controller                     │
│  matchbox.example.com → Service matchbox:8080       │
│  matchbox-rpc.example.com → Service matchbox:8081   │
└─────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────┐
│          Service: matchbox (ClusterIP)              │
│            ports: 8080/TCP, 8081/TCP                │
└─────────────────────────────────────────────────────┘
                     │
         ┌───────────┴───────────┐
         ▼                       ▼
┌─────────────────┐     ┌─────────────────┐
│  Pod: matchbox  │     │  Pod: matchbox  │
│  replicas: 2+   │     │  replicas: 2+   │
└─────────────────┘     └─────────────────┘
         │                       │
         └───────────┬───────────┘
                     ▼
┌─────────────────────────────────────────────────────┐
│    PersistentVolumeClaim: matchbox-data             │
│    /var/lib/matchbox (RWX mode)                     │
└─────────────────────────────────────────────────────┘

Manifest structure:

contrib/k8s/
├── matchbox-deployment.yaml  # Deployment + replicas
├── matchbox-service.yaml     # Service (8080, 8081)
├── matchbox-ingress.yaml     # Ingress (HTTP + gRPC TLS)
└── matchbox-pvc.yaml         # PersistentVolumeClaim

Key configurations:

Secret for gRPC TLS:

kubectl create secret generic matchbox-rpc \
  --from-file=ca.crt \
  --from-file=server.crt \
  --from-file=server.key

Ingress for gRPC (TLS passthrough):

metadata:
  annotations:
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"

Volume mount:

volumes:
  - name: data
    persistentVolumeClaim:
      claimName: matchbox-data
volumeMounts:
  - name: data
    mountPath: /var/lib/matchbox

Use case: Cloud-native deployments, Kubernetes-based infrastructure

Pros:

Native Kubernetes primitives (Deployments, Services, Ingress)
Rolling updates via Deployment strategy
Easy scaling (kubectl scale)
Health checks + auto-restart

Cons:

Requires RWX PersistentVolume or shared storage
Ingress TLS configuration complexity (gRPC passthrough)
Cluster dependency (can’t provision cluster bootstrap nodes)

⚠️ Bootstrap problem: Kubernetes-hosted Matchbox can’t PXE boot its own cluster nodes (chicken-and-egg). Use external Matchbox for initial cluster bootstrap, then migrate.

Installation Methods

1. Binary Installation (systemd)

Recommended for: Bare-metal hosts, VMs, traditional Linux servers

Steps:

Download and verify:

wget https://github.com/poseidon/matchbox/releases/download/v0.10.0/matchbox-v0.10.0-linux-amd64.tar.gz
wget https://github.com/poseidon/matchbox/releases/download/v0.10.0/matchbox-v0.10.0-linux-amd64.tar.gz.asc
gpg --verify matchbox-v0.10.0-linux-amd64.tar.gz.asc

Extract and install:

tar xzf matchbox-v0.10.0-linux-amd64.tar.gz
sudo cp matchbox-v0.10.0-linux-amd64/matchbox /usr/local/bin/

Create user and directories:

sudo useradd -U matchbox
sudo mkdir -p /var/lib/matchbox/{assets,groups,profiles,ignition}
sudo chown -R matchbox:matchbox /var/lib/matchbox

Install systemd unit:

sudo cp contrib/systemd/matchbox.service /etc/systemd/system/

Configure via systemd dropin:

sudo systemctl edit matchbox

[Service]
Environment="MATCHBOX_ADDRESS=0.0.0.0:8080"
Environment="MATCHBOX_RPC_ADDRESS=0.0.0.0:8081"
Environment="MATCHBOX_LOG_LEVEL=debug"

Start service:

sudo systemctl daemon-reload
sudo systemctl start matchbox
sudo systemctl enable matchbox

Pros:

Direct control over service
Easy log access (journalctl -u matchbox)
Native OS integration

Cons:

Manual updates required
OS dependency (package compatibility)

2. Container Deployment (Docker/Podman)

Recommended for: Docker hosts, quick testing, immutable infrastructure

Docker:

mkdir -p /var/lib/matchbox/assets
docker run -d --name matchbox \
  --net=host \
  -v /var/lib/matchbox:/var/lib/matchbox:Z \
  -v /etc/matchbox:/etc/matchbox:Z,ro \
  quay.io/poseidon/matchbox:v0.10.0 \
  -address=0.0.0.0:8080 \
  -rpc-address=0.0.0.0:8081 \
  -log-level=debug

Podman:

podman run -d --name matchbox \
  --net=host \
  -v /var/lib/matchbox:/var/lib/matchbox:Z \
  -v /etc/matchbox:/etc/matchbox:Z,ro \
  quay.io/poseidon/matchbox:v0.10.0 \
  -address=0.0.0.0:8080 \
  -rpc-address=0.0.0.0:8081 \
  -log-level=debug

Volume mounts:

/var/lib/matchbox - Data directory (groups, profiles, configs, assets)
/etc/matchbox - TLS certificates (ca.crt, server.crt, server.key)

Network mode:

--net=host - Required for DHCP/TFTP interaction on same host
Bridge mode possible if Matchbox is on separate host from dnsmasq

Pros:

Immutable deployments
Easy updates (pull new image)
Portable across hosts

Cons:

Volume management complexity
SELinux considerations (:Z flag)

3. Kubernetes Deployment

Recommended for: Kubernetes environments, cloud platforms

Quick start:

# Create TLS secret for gRPC
kubectl create secret generic matchbox-rpc \
  --from-file=ca.crt=~/.matchbox/ca.crt \
  --from-file=server.crt=~/.matchbox/server.crt \
  --from-file=server.key=~/.matchbox/server.key

# Deploy manifests
kubectl apply -R -f contrib/k8s/

# Check status
kubectl get pods -l app=matchbox
kubectl get svc matchbox
kubectl get ingress matchbox matchbox-rpc

Persistence options:

Option 1: emptyDir (ephemeral, dev only):

volumes:
  - name: data
    emptyDir: {}

Option 2: PersistentVolumeClaim (production):

volumes:
  - name: data
    persistentVolumeClaim:
      claimName: matchbox-data

Option 3: ConfigMap (static configs):

volumes:
  - name: groups
    configMap:
      name: matchbox-groups
  - name: profiles
    configMap:
      name: matchbox-profiles

Option 4: Git-sync sidecar (GitOps):

initContainers:
  - name: git-sync
    image: k8s.gcr.io/git-sync:v3.6.3
    env:
      - name: GIT_SYNC_REPO
        value: https://github.com/example/matchbox-configs
      - name: GIT_SYNC_DEST
        value: /var/lib/matchbox
    volumeMounts:
      - name: data
        mountPath: /var/lib/matchbox

Pros:

Native k8s features (scaling, health checks, rolling updates)
Ingress integration
GitOps workflows

Cons:

Complexity (Ingress, PVC, TLS)
Can’t bootstrap own cluster

Network Boot Environment Setup

Matchbox requires separate DHCP/TFTP/DNS services. Options:

Option 1: dnsmasq Container (Quickest)

Use case: Lab, testing, environments without existing DHCP

Full DHCP + TFTP + DNS:

docker run -d --name dnsmasq \
  --cap-add=NET_ADMIN \
  --net=host \
  quay.io/poseidon/dnsmasq:latest \
  -d -q \
  --dhcp-range=192.168.1.3,192.168.1.254,30m \
  --enable-tftp \
  --tftp-root=/var/lib/tftpboot \
  --dhcp-match=set:bios,option:client-arch,0 \
  --dhcp-boot=tag:bios,undionly.kpxe \
  --dhcp-match=set:efi64,option:client-arch,9 \
  --dhcp-boot=tag:efi64,ipxe.efi \
  --dhcp-userclass=set:ipxe,iPXE \
  --dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
  --address=/matchbox.example.com/192.168.1.2 \
  --log-queries \
  --log-dhcp

Proxy DHCP (alongside existing DHCP):

docker run -d --name dnsmasq \
  --cap-add=NET_ADMIN \
  --net=host \
  quay.io/poseidon/dnsmasq:latest \
  -d -q \
  --dhcp-range=192.168.1.1,proxy,255.255.255.0 \
  --enable-tftp \
  --tftp-root=/var/lib/tftpboot \
  --dhcp-userclass=set:ipxe,iPXE \
  --pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe \
  --pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe \
  --log-queries \
  --log-dhcp

Included files: undionly.kpxe, ipxe.efi, grub.efi (bundled in image)

Option 2: Existing DHCP/TFTP Infrastructure

Use case: Enterprise environments with network admin policies

Required DHCP options (ISC DHCP example):

subnet 192.168.1.0 netmask 255.255.255.0 {
  range 192.168.1.10 192.168.1.250;
  
  # BIOS clients
  if option architecture-type = 00:00 {
    filename "undionly.kpxe";
  }
  # UEFI clients
  elsif option architecture-type = 00:09 {
    filename "ipxe.efi";
  }
  # iPXE clients
  elsif exists user-class and option user-class = "iPXE" {
    filename "http://matchbox.example.com:8080/boot.ipxe";
  }
  
  next-server 192.168.1.100;  # TFTP server IP
}

TFTP files (place in tftp root):

Download from http://boot.ipxe.org/undionly.kpxe
Download from http://boot.ipxe.org/ipxe.efi

Option 3: iPXE-only (No PXE Chainload)

Use case: Modern hardware with native iPXE firmware

DHCP config (simpler):

filename "http://matchbox.example.com:8080/boot.ipxe";

No TFTP server needed (iPXE fetches directly via HTTP)

Limitation: Doesn’t support legacy BIOS with basic PXE ROM

TLS Certificate Setup

gRPC API requires TLS client certificates for authentication.

Option 1: Provided cert-gen Script

cd scripts/tls
export SAN=DNS.1:matchbox.example.com,IP.1:192.168.1.100
./cert-gen

Generates:

ca.crt - Self-signed CA
server.crt, server.key - Server credentials
client.crt, client.key - Client credentials (for Terraform)

Install server certs:

sudo mkdir -p /etc/matchbox
sudo cp ca.crt server.crt server.key /etc/matchbox/
sudo chown -R matchbox:matchbox /etc/matchbox

Save client certs for Terraform:

mkdir -p ~/.matchbox
cp client.crt client.key ca.crt ~/.matchbox/

Option 2: Corporate PKI

Preferred for production: Use organization’s certificate authority

Requirements:

Server cert with SAN: DNS:matchbox.example.com
Client cert issued by same CA
CA cert for validation

Matchbox flags:

-ca-file=/etc/matchbox/ca.crt
-cert-file=/etc/matchbox/server.crt
-key-file=/etc/matchbox/server.key

Terraform provider config:

provider "matchbox" {
  endpoint    = "matchbox.example.com:8081"
  client_cert = file("/path/to/client.crt")
  client_key  = file("/path/to/client.key")
  ca          = file("/path/to/ca.crt")
}

Option 3: Let’s Encrypt (HTTP API only)

Note: gRPC requires client cert auth (incompatible with Let’s Encrypt)

Use case: TLS for HTTP endpoints only (read-only API)

Matchbox flags:

-web-ssl=true
-web-cert-file=/etc/letsencrypt/live/matchbox.example.com/fullchain.pem
-web-key-file=/etc/letsencrypt/live/matchbox.example.com/privkey.pem

Limitation: Still need self-signed certs for gRPC API

Configuration Flags

Core Flags

Flag	Default	Description
`-address`	`127.0.0.1:8080`	HTTP API listen address
`-rpc-address`	``	gRPC API listen address (empty = disabled)
`-data-path`	`/var/lib/matchbox`	Data directory (FileStore)
`-assets-path`	`/var/lib/matchbox/assets`	Static assets directory
`-log-level`	`info`	Logging level (debug, info, warn, error)

TLS Flags (gRPC)

Flag	Default	Description
`-ca-file`	`/etc/matchbox/ca.crt`	CA certificate for client verification
`-cert-file`	`/etc/matchbox/server.crt`	Server TLS certificate
`-key-file`	`/etc/matchbox/server.key`	Server TLS private key

TLS Flags (HTTP, optional)

Flag	Default	Description
`-web-ssl`	`false`	Enable TLS for HTTP API
`-web-cert-file`	``	HTTP server TLS certificate
`-web-key-file`	``	HTTP server TLS private key

Environment Variables

All flags can be set via environment variables with MATCHBOX_ prefix:

export MATCHBOX_ADDRESS=0.0.0.0:8080
export MATCHBOX_RPC_ADDRESS=0.0.0.0:8081
export MATCHBOX_LOG_LEVEL=debug
export MATCHBOX_DATA_PATH=/custom/path

Operational Considerations

Firewall Configuration

Matchbox host:

firewall-cmd --permanent --add-port=8080/tcp  # HTTP API
firewall-cmd --permanent --add-port=8081/tcp  # gRPC API
firewall-cmd --reload

dnsmasq host (if separate):

firewall-cmd --permanent --add-service=dhcp
firewall-cmd --permanent --add-service=tftp
firewall-cmd --permanent --add-service=dns  # optional
firewall-cmd --reload

Monitoring

Health check endpoints:

# HTTP API
curl http://matchbox.example.com:8080
# Should return: matchbox

# gRPC API
openssl s_client -connect matchbox.example.com:8081 \
  -CAfile ~/.matchbox/ca.crt \
  -cert ~/.matchbox/client.crt \
  -key ~/.matchbox/client.key

Prometheus metrics: Not built-in; consider adding reverse proxy (e.g., nginx) with metrics exporter

Logs (systemd):

journalctl -u matchbox -f

Logs (container):

docker logs -f matchbox

Backup Strategy

What to backup:

/var/lib/matchbox/{groups,profiles,ignition} - Configs
/etc/matchbox/*.{crt,key} - TLS certificates
Terraform state (if using Terraform provider)

Backup command:

tar -czf matchbox-backup-$(date +%F).tar.gz \
  /var/lib/matchbox/{groups,profiles,ignition} \
  /etc/matchbox

Restore:

tar -xzf matchbox-backup-YYYY-MM-DD.tar.gz -C /
sudo chown -R matchbox:matchbox /var/lib/matchbox
sudo systemctl restart matchbox

GitOps approach: Store configs in git repository for versioning and auditability

Updates

Binary deployment:

# Download new version
wget https://github.com/poseidon/matchbox/releases/download/vX.Y.Z/matchbox-vX.Y.Z-linux-amd64.tar.gz
tar xzf matchbox-vX.Y.Z-linux-amd64.tar.gz

# Replace binary
sudo systemctl stop matchbox
sudo cp matchbox-vX.Y.Z-linux-amd64/matchbox /usr/local/bin/
sudo systemctl start matchbox

Container deployment:

docker pull quay.io/poseidon/matchbox:vX.Y.Z
docker stop matchbox
docker rm matchbox
docker run -d --name matchbox ... quay.io/poseidon/matchbox:vX.Y.Z ...

Kubernetes deployment:

kubectl set image deployment/matchbox matchbox=quay.io/poseidon/matchbox:vX.Y.Z
kubectl rollout status deployment/matchbox

Scaling Considerations

Vertical scaling (single instance):

CPU: Minimal (config rendering is lightweight)
Memory: ~50MB base + asset cache
Disk: Depends on cached assets (100MB - 10GB+)

Horizontal scaling (multiple instances):

Stateless HTTP API (load balance round-robin)
Shared storage required (RWX PV, NFS, or custom backend)
gRPC API can be load-balanced with gRPC-aware LB

Asset serving optimization:

Use CDN or cache proxy for remote assets
Local asset caching for <100 machines
Dedicated HTTP server (nginx) for large deployments (1000+ machines)

Security Best Practices

Don’t store secrets in Ignition configs
- Use Ignition files.source with auth headers to fetch from Vault
- Or provision minimal config, fetch secrets post-boot
Network segmentation
- Provision VLAN isolated from production
- Firewall rules: only allow provisioning traffic
gRPC API access control
- Client cert authentication (mandatory)
- Restrict cert issuance to authorized personnel/systems
- Rotate certs periodically
Audit logging
- Version control groups/profiles (git)
- Log gRPC API changes (Terraform state tracking)
- Monitor HTTP endpoint access
Validate configs before deployment
- fcct --strict for Butane configs
- Terraform plan before apply
- Test in dev environment first

Troubleshooting

Common Issues

1. Machines not PXE booting:

# Check DHCP responses
tcpdump -i eth0 port 67 and port 68

# Verify TFTP files
ls -la /var/lib/tftpboot/
curl tftp://192.168.1.100/undionly.kpxe

# Check Matchbox accessibility
curl http://matchbox.example.com:8080/boot.ipxe

2. 404 Not Found on /ignition:

# Test group matching
curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10'

# Check group exists
ls -la /var/lib/matchbox/groups/

# Check profile referenced by group exists
ls -la /var/lib/matchbox/profiles/

# Verify ignition_id file exists
ls -la /var/lib/matchbox/ignition/

3. gRPC connection refused (Terraform):

# Test TLS connection
openssl s_client -connect matchbox.example.com:8081 \
  -CAfile ~/.matchbox/ca.crt \
  -cert ~/.matchbox/client.crt \
  -key ~/.matchbox/client.key

# Check Matchbox gRPC is listening
sudo ss -tlnp | grep 8081

# Verify firewall
sudo firewall-cmd --list-ports

4. Ignition config validation errors:

# Validate Butane locally
podman run -i --rm quay.io/coreos/fcct:release --strict < config.yaml

# Fetch rendered Ignition
curl 'http://matchbox.example.com:8080/ignition?mac=...' | jq .

# Validate Ignition spec
curl 'http://matchbox.example.com:8080/ignition?mac=...' | \
  podman run -i --rm quay.io/coreos/ignition-validate:latest

Summary

Matchbox deployment considerations:

Architecture: Single-host (dev/lab) vs HA (production) vs Kubernetes
Installation: Binary (systemd), container (Docker/Podman), or Kubernetes manifests
Network boot: dnsmasq container (quick), existing infrastructure (enterprise), or iPXE-only (modern)
TLS: Self-signed (dev), corporate PKI (production), Let’s Encrypt (HTTP only)
Scaling: Vertical (simple) vs horizontal (requires shared storage)
Security: Client cert auth, network segmentation, no secrets in configs
Operations: Backup configs, GitOps workflow, monitoring/logging

Recommendation for production:

HA deployment (2+ instances) with load balancer
Shared storage (NFS or RWX PV on Kubernetes)
Corporate PKI for TLS certificates
GitOps workflow (Terraform + git-controlled configs)
Network segmentation (dedicated provisioning VLAN)
Prometheus/Grafana monitoring

1.5.3 - Network Boot Support

Detailed analysis of Matchbox’s network boot capabilities

Network Boot Support in Matchbox

Matchbox provides comprehensive network boot support for bare-metal provisioning, supporting multiple boot firmware types and protocols.

Overview

Matchbox serves as an HTTP entrypoint for network-booted machines but does not implement DHCP, TFTP, or DNS services itself. Instead, it integrates with existing network infrastructure (or companion services like dnsmasq) to provide a complete PXE boot solution.

Boot Protocol Support

1. PXE (Preboot Execution Environment)

Legacy BIOS support via chainloading to iPXE:

Machine BIOS → DHCP (gets TFTP server) → TFTP (gets undionly.kpxe) 
→ iPXE firmware → HTTP (Matchbox /boot.ipxe)

Key characteristics:

Requires TFTP server to serve undionly.kpxe (iPXE bootloader)
Chainloads from legacy PXE ROM to modern iPXE
Supports older hardware with basic PXE firmware
TFTP only used for initial iPXE bootstrap; subsequent downloads via HTTP

2. iPXE (Enhanced PXE)

Primary boot method supported by Matchbox:

iPXE Client → DHCP (gets boot script URL) → HTTP (Matchbox endpoints)
→ Kernel/initrd download → Boot with Ignition config

Endpoints served by Matchbox:

Endpoint	Purpose
`/boot.ipxe`	Static script that gathers machine attributes (UUID, MAC, hostname, serial)
`/ipxe?<labels>`	Rendered iPXE script with kernel, initrd, and boot args for matched machine
`/assets/`	Optional local caching of kernel/initrd images

Example iPXE flow:

Machine boots with iPXE firmware
DHCP response points to http://matchbox.example.com:8080/boot.ipxe

iPXE fetches /boot.ipxe:

#!ipxe
chain ipxe?uuid=${uuid}&mac=${mac:hexhyp}&domain=${domain}&hostname=${hostname}&serial=${serial}

iPXE makes request to /ipxe?uuid=...&mac=... with machine attributes

Matchbox matches machine to group/profile and renders iPXE script:

#!ipxe
kernel /assets/coreos/VERSION/coreos_production_pxe.vmlinuz \
  coreos.config.url=http://matchbox.foo:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp} \
  coreos.first_boot=1
initrd /assets/coreos/VERSION/coreos_production_pxe_image.cpio.gz
boot

Advantages:

HTTP downloads (faster than TFTP)
Scriptable boot logic
Can fetch configs from HTTP endpoints
Supports HTTPS (if compiled with TLS support)

3. GRUB2

UEFI firmware support:

UEFI Firmware → DHCP (gets GRUB bootloader) → TFTP (grub.efi)
→ GRUB → HTTP (Matchbox /grub endpoint)

Matchbox endpoint: /grub?<labels>

Example GRUB config rendered by Matchbox:

default=0
timeout=1
menuentry "CoreOS" {
  echo "Loading kernel"
  linuxefi "(http;matchbox.foo:8080)/assets/coreos/VERSION/coreos_production_pxe.vmlinuz" \
    "coreos.config.url=http://matchbox.foo:8080/ignition" "coreos.first_boot"
  echo "Loading initrd"
  initrdefi "(http;matchbox.foo:8080)/assets/coreos/VERSION/coreos_production_pxe_image.cpio.gz"
}

Use case:

UEFI systems that prefer GRUB over iPXE
Environments with existing GRUB network boot infrastructure

4. PXELINUX (Legacy, via TFTP)

While not a primary Matchbox target, PXELINUX clients can be configured to chainload iPXE:

# /var/lib/tftpboot/pxelinux.cfg/default
timeout 10
default iPXE
LABEL iPXE
KERNEL ipxe.lkrn
APPEND dhcp && chain http://matchbox.example.com:8080/boot.ipxe

DHCP Configuration Patterns

Matchbox supports two DHCP deployment models:

Pattern 1: PXE-Enabled DHCP

Full DHCP server provides IP allocation + PXE boot options.

Example dnsmasq configuration:

dhcp-range=192.168.1.1,192.168.1.254,30m
enable-tftp
tftp-root=/var/lib/tftpboot

# Legacy BIOS → chainload to iPXE
dhcp-match=set:bios,option:client-arch,0
dhcp-boot=tag:bios,undionly.kpxe

# UEFI → iPXE
dhcp-match=set:efi32,option:client-arch,6
dhcp-boot=tag:efi32,ipxe.efi
dhcp-match=set:efi64,option:client-arch,9
dhcp-boot=tag:efi64,ipxe.efi

# iPXE clients → Matchbox
dhcp-userclass=set:ipxe,iPXE
dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe

# DNS for Matchbox
address=/matchbox.example.com/192.168.1.100

Client architecture detection:

Option 93 (client-arch): Identifies BIOS (0), UEFI32 (6), UEFI64 (9)
User class: Detects iPXE clients to skip TFTP chainloading

Pattern 2: Proxy DHCP

Runs alongside existing DHCP server; provides only boot options (no IP allocation).

Example dnsmasq proxy-DHCP:

dhcp-range=192.168.1.1,proxy,255.255.255.0
enable-tftp
tftp-root=/var/lib/tftpboot

# Chainload legacy PXE to iPXE
pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe
# iPXE clients → Matchbox
dhcp-userclass=set:ipxe,iPXE
pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe

Benefits:

Non-invasive: doesn’t replace existing DHCP
PXE clients receive merged responses from both DHCP servers
Ideal for environments where main DHCP cannot be modified

Network Boot Flow (Complete)

Scenario: BIOS machine with legacy PXE firmware

┌──────────────────────────────────────────────────────────────────┐
│ 1. Machine powers on, BIOS set to network boot                  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 2. NIC PXE firmware broadcasts DHCPDISCOVER (PXEClient)          │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 3. DHCP/proxyDHCP responds with:                                 │
│    - IP address (if full DHCP)                                   │
│    - Next-server: TFTP server IP                                 │
│    - Filename: undionly.kpxe (based on arch=0)                   │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 4. PXE firmware downloads undionly.kpxe via TFTP                 │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 5. Execute iPXE (undionly.kpxe)                                  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 6. iPXE requests DHCP again, identifies as iPXE (user-class)     │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 7. DHCP responds with boot URL (not TFTP):                       │
│    http://matchbox.example.com:8080/boot.ipxe                    │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 8. iPXE fetches /boot.ipxe via HTTP:                             │
│    #!ipxe                                                        │
│    chain ipxe?uuid=${uuid}&mac=${mac:hexhyp}&...                 │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 9. iPXE chains to /ipxe?uuid=XXX&mac=YYY (introspected labels)   │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 10. Matchbox matches machine to group/profile                    │
│     - Finds most specific group matching labels                  │
│     - Retrieves profile (kernel, initrd, args, configs)          │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 11. Matchbox renders iPXE script with:                           │
│     - kernel URL (local asset or remote HTTPS)                   │
│     - initrd URL                                                 │
│     - kernel args (including ignition.config.url)                │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 12. iPXE downloads kernel + initrd (HTTP/HTTPS)                  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 13. iPXE boots kernel with specified args                        │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 14. Fedora CoreOS/Flatcar boots, Ignition runs                   │
│     - Fetches /ignition?uuid=XXX&mac=YYY from Matchbox           │
│     - Matchbox renders Ignition config with group metadata       │
│     - Ignition partitions disk, writes files, creates users      │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│ 15. System reboots (if disk install), boots from disk            │
└──────────────────────────────────────────────────────────────────┘

Asset Serving

Matchbox can serve static assets (kernel, initrd images) from a local directory to reduce bandwidth and increase speed:

Asset directory structure:

/var/lib/matchbox/assets/
├── fedora-coreos/
│   └── 36.20220906.3.2/
│       ├── fedora-coreos-36.20220906.3.2-live-kernel-x86_64
│       ├── fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img
│       └── fedora-coreos-36.20220906.3.2-live-rootfs.x86_64.img
└── flatcar/
    └── 3227.2.0/
        ├── flatcar_production_pxe.vmlinuz
        ├── flatcar_production_pxe_image.cpio.gz
        └── version.txt

HTTP endpoint: http://matchbox.example.com:8080/assets/

Scripts provided:

scripts/get-fedora-coreos - Download/verify Fedora CoreOS images
scripts/get-flatcar - Download/verify Flatcar Linux images

Profile reference:

{
  "boot": {
    "kernel": "/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-kernel-x86_64",
    "initrd": ["--name main /assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img"]
  }
}

Alternative: Profiles can reference remote HTTPS URLs (requires iPXE compiled with TLS support):

{
  "kernel": "https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/36.20220906.3.2/x86_64/fedora-coreos-36.20220906.3.2-live-kernel-x86_64"
}

OS Support

Fedora CoreOS

Boot types:

Live PXE (RAM-only, ephemeral)
Install to disk (persistent, recommended)

Required kernel args:

coreos.inst.install_dev=/dev/sda - Target disk for install
coreos.inst.ignition_url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp} - Provisioning config
coreos.live.rootfs_url=... - Root filesystem image

Ignition fetch: During first boot, ignition.service fetches config from Matchbox

Flatcar Linux

Boot types:

Live PXE (RAM-only)
Install to disk

Required kernel args:

flatcar.first_boot=yes - Marks first boot
flatcar.config.url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp} - Ignition config URL
flatcar.autologin - Auto-login to console (optional, dev/debug)

Ignition support: Flatcar uses Ignition v3.x for provisioning

RHEL CoreOS

Supported as it uses Ignition like Fedora CoreOS. Requires Red Hat-specific image sources.

Machine Matching & Labels

Matchbox matches machines to profiles using labels extracted during boot:

Reserved Label Selectors

Label	Source	Example	Normalized
`uuid`	SMBIOS UUID	`550e8400-e29b-41d4-a716-446655440000`	Lowercase
`mac`	NIC MAC address	`52:54:00:89:d8:10`	Normalized to colons
`hostname`	Network boot program	`node1.example.com`	As-is
`serial`	Hardware serial	`VMware-42 1a...`	As-is

Custom Labels

Groups can match on arbitrary labels passed as query params:

/ipxe?mac=52:54:00:89:d8:10&region=us-west&env=prod

Matching precedence: Most specific group wins (most selector matches)

Firmware Compatibility

Firmware Type	Client Arch	Boot File	Protocol	Matchbox Support
BIOS (legacy PXE)	0	`undionly.kpxe` → iPXE	TFTP → HTTP	✅ Via chainload
UEFI 32-bit	6	`ipxe.efi`	TFTP → HTTP	✅
UEFI (BIOS compat)	7	`ipxe.efi`	TFTP → HTTP	✅
UEFI 64-bit	9	`ipxe.efi`	TFTP → HTTP	✅
Native iPXE	-	N/A	HTTP	✅ Direct
GRUB (UEFI)	-	`grub.efi`	TFTP → HTTP	✅ `/grub` endpoint

Network Requirements

Firewall rules on Matchbox host:

# HTTP API (read-only)
firewall-cmd --add-port=8080/tcp --permanent

# gRPC API (authenticated, Terraform)
firewall-cmd --add-port=8081/tcp --permanent

DNS requirement:

matchbox.example.com must resolve to Matchbox server IP
Can be configured in dnsmasq, corporate DNS, or /etc/hosts on DHCP server

DHCP/TFTP host (if using dnsmasq):

firewall-cmd --add-service=dhcp --permanent
firewall-cmd --add-service=tftp --permanent
firewall-cmd --add-service=dns --permanent  # optional

Troubleshooting Tips

Verify Matchbox endpoints:

curl http://matchbox.example.com:8080
# Should return: matchbox

curl http://matchbox.example.com:8080/boot.ipxe
# Should return iPXE script

Test machine matching:

curl 'http://matchbox.example.com:8080/ipxe?mac=52:54:00:89:d8:10'
# Should return rendered iPXE script with kernel/initrd

Check TFTP files:

ls -la /var/lib/tftpboot/
# Should contain: undionly.kpxe, ipxe.efi, grub.efi

Verify DHCP responses:

tcpdump -i eth0 -n port 67 and port 68
# Watch for DHCP offers with PXE options

iPXE console debugging:
- Press Ctrl+B during iPXE boot to enter console
- Commands: dhcp, ifstat, show net0/ip, chain http://...

Limitations

HTTPS support: iPXE must be compiled with crypto support (larger binary, ~80KB vs ~45KB)
TFTP dependency: Legacy PXE requires TFTP for initial chainload (can’t skip)
No DHCP/TFTP built-in: Must use external services or dnsmasq container
Boot firmware variations: Some vendor PXE implementations have quirks
SecureBoot: iPXE and GRUB must be signed (or SecureBoot disabled)

Reference Implementation: dnsmasq Container

Matchbox project provides quay.io/poseidon/dnsmasq with:

Pre-configured DHCP/TFTP/DNS service
Bundled ipxe.efi, undionly.kpxe, grub.efi
Example configs for PXE-DHCP and proxy-DHCP modes

Quick start (full DHCP):

docker run --rm --cap-add=NET_ADMIN --net=host quay.io/poseidon/dnsmasq \
  -d -q \
  --dhcp-range=192.168.1.3,192.168.1.254 \
  --enable-tftp --tftp-root=/var/lib/tftpboot \
  --dhcp-match=set:bios,option:client-arch,0 \
  --dhcp-boot=tag:bios,undionly.kpxe \
  --dhcp-match=set:efi64,option:client-arch,9 \
  --dhcp-boot=tag:efi64,ipxe.efi \
  --dhcp-userclass=set:ipxe,iPXE \
  --dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
  --address=/matchbox.example.com/192.168.1.2 \
  --log-queries --log-dhcp

Quick start (proxy-DHCP):

docker run --rm --cap-add=NET_ADMIN --net=host quay.io/poseidon/dnsmasq \
  -d -q \
  --dhcp-range=192.168.1.1,proxy,255.255.255.0 \
  --enable-tftp --tftp-root=/var/lib/tftpboot \
  --dhcp-userclass=set:ipxe,iPXE \
  --pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe \
  --pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe \
  --log-queries --log-dhcp

Summary

Matchbox provides robust network boot support through:

Protocol flexibility: iPXE (primary), GRUB2, legacy PXE (via chainload)
Firmware compatibility: BIOS and UEFI
Modern approach: HTTP-based with optional local asset caching
Clean separation: Matchbox handles config rendering; external services handle DHCP/TFTP
Production-ready: Used by Typhoon Kubernetes distributions for bare-metal provisioning

1.5.4 - Use Case Evaluation

Evaluation of Matchbox for specific use cases and comparison with alternatives

Matchbox Use Case Evaluation

Analysis of Matchbox’s suitability for various use cases, strengths, limitations, and comparison with alternative provisioning solutions.

Use Case Fit Analysis

✅ Ideal Use Cases

1. Bare-Metal Kubernetes Clusters

Scenario: Provisioning 10-1000 physical servers for Kubernetes nodes

Why Matchbox Excels:

Ignition-native (perfect for Fedora CoreOS/Flatcar)
Declarative machine provisioning via Terraform
Label-based matching (region, role, hardware type)
Integration with Typhoon Kubernetes distribution
Minimal OS surface (immutable, container-optimized)

Example workflow:

resource "matchbox_profile" "k8s_controller" {
  name   = "k8s-controller"
  kernel = "/assets/fedora-coreos/.../kernel"
  raw_ignition = data.ct_config.controller.rendered
}

resource "matchbox_group" "controllers" {
  profile = matchbox_profile.k8s_controller.name
  selector = {
    role = "controller"
  }
}

Alternatives considered:

Cloud-init + netboot.xyz: Less declarative, no native Ignition support
Foreman: Heavier, more complex for container-centric workloads
Metal³: Kubernetes-native but requires existing cluster

Verdict: ⭐⭐⭐⭐⭐ Matchbox is purpose-built for this

2. Lab/Development Environments

Scenario: Rapid PXE boot testing with QEMU/KVM VMs or homelab servers

Why Matchbox Excels:

Quick setup (binary + dnsmasq container)
No DHCP infrastructure required (proxy-DHCP mode)
Localhost deployment (no external dependencies)
Fast iteration (change configs, re-PXE)
Included examples and scripts

Example setup:

# Start Matchbox locally
docker run -d --net=host -v /var/lib/matchbox:/var/lib/matchbox \
  quay.io/poseidon/matchbox:latest -address=0.0.0.0:8080

# Start dnsmasq on same host
docker run -d --net=host --cap-add=NET_ADMIN \
  quay.io/poseidon/dnsmasq ...

Alternatives considered:

netboot.xyz: Great for manual OS selection, no automation
PiXE server: Simpler but less flexible matching logic
Manual iPXE scripts: No dynamic matching, manual maintenance

Verdict: ⭐⭐⭐⭐⭐ Minimal setup, maximum flexibility

3. Edge/Remote Site Provisioning

Scenario: Provision machines at 10+ remote datacenters or edge locations

Why Matchbox Excels:

Lightweight (single binary, ~20MB)
Declarative region-based matching
Centralized config management (Terraform)
Can run on minimal hardware (ARM support)
HTTP-based (works over WAN with reverse proxy)

Architecture:

Central Matchbox (via Terraform)
  ↓ gRPC API
Regional Matchbox Instances (read-only cache)
  ↓ HTTP
Edge Machines (PXE boot)

Label-based routing:

{
  "selector": {
    "region": "us-west",
    "site": "pdx-1"
  },
  "metadata": {
    "ntp_servers": ["10.100.1.1", "10.100.1.2"]
  }
}

Alternatives considered:

Foreman: Requires more resources per site
Ansible + netboot: No declarative PXE boot, post-install only
Cloud-init datasources: Requires cloud metadata service per site

Verdict: ⭐⭐⭐⭐☆ Good fit, but consider caching strategy for WAN

⚠️ Moderate Fit Use Cases

4. Multi-Tenant Bare-Metal Cloud

Scenario: Provide bare-metal-as-a-service to multiple customers

Matchbox challenges:

No built-in multi-tenancy (single namespace)
No RBAC (gRPC API is all-or-nothing with client certs)
No customer self-service portal

Workarounds:

Deploy separate Matchbox per tenant (isolation via separate instances)
Proxy gRPC API with custom RBAC layer
Use group selectors with customer IDs

Better alternatives:

Metal³ (Kubernetes-native, better multi-tenancy)
OpenStack Ironic (purpose-built for bare-metal cloud)
MAAS (Ubuntu-specific, has RBAC)

Verdict: ⭐⭐☆☆☆ Possible but architecturally challenging

5. Heterogeneous OS Provisioning

Scenario: Need to provision Fedora CoreOS, Ubuntu, RHEL, Windows

Matchbox challenges:

Designed for Ignition-based OSes (FCOS, Flatcar, RHCOS)
No native support for Kickstart (RHEL/CentOS)
No support for Preseed (Ubuntu/Debian)
No Windows unattend.xml support

What works:

Fedora CoreOS ✅
Flatcar Linux ✅
RHEL CoreOS ✅
Container Linux (deprecated but supported) ✅

What requires workarounds:

RHEL/CentOS: Possible via generic configs + Kickstart URLs, but not native
Ubuntu: Can PXE boot and point to autoinstall ISO, but loses Matchbox templating benefits
Debian: Similar to Ubuntu
Windows: Not supported (different PXE boot mechanisms)

Better alternatives for heterogeneous environments:

Foreman (supports Kickstart, Preseed, unattend.xml)
MAAS (Ubuntu-centric but extensible)
Cobbler (older but supports many OS types)

Verdict: ⭐⭐☆☆☆ Stick to Ignition-based OSes or use different tool

❌ Poor Fit Use Cases

6. Windows PXE Boot

Why Matchbox doesn’t fit:

No WinPE support
No unattend.xml rendering
Different PXE boot chain (WDS/SCCM model)

Recommendation: Use Microsoft WDS or SCCM

Verdict: ⭐☆☆☆☆ Not designed for this

7. BIOS/Firmware Updates

Why Matchbox doesn’t fit:

Focused on OS provisioning, not firmware
No vendor-specific tooling (Dell iDRAC, HP iLO integration)

Recommendation: Use vendor tools or Ansible with ipmi/redfish modules

Verdict: ⭐☆☆☆☆ Out of scope

Strengths

1. Ignition-First Design

Native support for modern immutable OSes
Declarative, atomic provisioning (no config drift)
First-boot partition/filesystem setup

2. Label-Based Matching

Flexible machine classification (MAC, UUID, region, role, custom)
Most-specific-match algorithm (override defaults per machine)
Query params for dynamic attributes

3. Terraform Integration

Declarative infrastructure as code
Plan before apply (preview changes)
State tracking for auditability
Rich templating (ct_config provider for Butane)

4. Minimal Dependencies

Single static binary (~20MB)
No database required (FileStore default)
No built-in DHCP/TFTP (separation of concerns)
Container-ready (OCI image available)

5. HTTP-Centric

Faster downloads than TFTP (iPXE via HTTP)
Proxy/CDN friendly for asset distribution
Standard web tooling (curl, load balancers, Ingress)

6. Production-Ready

Used by Typhoon Kubernetes (battle-tested)
Clear upgrade path (SemVer releases)
OpenPGP signature support for config integrity

Limitations

1. No Multi-Tenancy

Single namespace (all groups/profiles global)
No RBAC on gRPC API (client cert = full access)
Requires separate instances per tenant

2. Ignition-Only Focus

Cloud-Config deprecated (legacy support only)
No native Kickstart/Preseed/unattend.xml
Limits OS choice to CoreOS family

3. Storage Constraints

FileStore doesn’t scale to 10,000+ profiles
No built-in HA storage (requires NFS or custom backend)
Kubernetes deployment needs RWX PersistentVolume

4. No Machine Discovery

Doesn’t detect new machines (passive service)
No inventory management (use external CMDB)
No hardware introspection (use Ironic for that)

5. Limited Observability

No built-in metrics (Prometheus integration requires reverse proxy)
Logs are minimal (request logging only)
No audit trail for gRPC API changes (use Terraform state)

6. TFTP Still Required

Legacy BIOS PXE needs TFTP for chainloading to iPXE
Can’t fully eliminate TFTP unless all machines have native iPXE

Comparison with Alternatives

vs. Foreman

Feature	Matchbox	Foreman
OS Support	Ignition-based	Kickstart, Preseed, AutoYaST, etc.
Complexity	Low (single binary)	High (Rails app, DB, Puppet/Ansible)
Config Model	Declarative (Ignition)	Imperative (post-install scripts)
API	HTTP + gRPC	REST API
UI	None (API-only)	Full web UI
Terraform	Native provider	Community modules
Use Case	Container-centric infra	Traditional Linux servers

When to choose Matchbox: CoreOS-based Kubernetes clusters, minimal infrastructure
When to choose Foreman: Heterogeneous OS, need web UI, traditional config mgmt

vs. Metal³

Feature	Matchbox	Metal³
Platform	Standalone	Kubernetes-native (operator)
Bootstrap	Can bootstrap k8s cluster	Needs existing k8s cluster
Machine Lifecycle	Provision only	Provision + decommission + reprovision
Hardware Introspection	No (labels passed manually)	Yes (via Ironic)
Multi-tenancy	No	Yes (via k8s namespaces)
Complexity	Low	High (requires Ironic, DHCP, etc.)

When to choose Matchbox: Greenfield bare-metal, no existing k8s
When to choose Metal³: Existing k8s, need hardware mgmt lifecycle

vs. Cobbler

Feature	Matchbox	Cobbler
Age	Modern (2016+)	Legacy (2008+)
Config Format	Ignition (declarative)	Kickstart/Preseed (imperative)
Templating	Go templates (minimal)	Cheetah templates (extensive)
Python	Go (static binary)	Python (requires interpreter)
DHCP Management	External	Can manage DHCP
Maintenance	Active (Poseidon)	Low activity

When to choose Matchbox: Modern immutable OSes, container workloads
When to choose Cobbler: Legacy infra, need DHCP management, heterogeneous OS

vs. MAAS (Ubuntu)

Feature	Matchbox	MAAS
OS Support	CoreOS family	Ubuntu (primary), others (limited)
IPAM	No (external DHCP)	Built-in IPAM
Power Mgmt	No (manual or scripts)	Built-in (IPMI, AMT, etc.)
UI	No	Full web UI
Declarative	Yes (Terraform)	Limited (CLI mostly)
Cloud Integration	No	Yes (libvirt, LXD, VM hosts)

When to choose Matchbox: Non-Ubuntu, Kubernetes, minimal dependencies
When to choose MAAS: Ubuntu-centric, need power mgmt, cloud integration

vs. netboot.xyz

Feature	Matchbox	netboot.xyz
Purpose	Automated provisioning	Manual OS selection menu
Automation	Full (API-driven)	None (interactive menu)
Customization	Per-machine configs	Global menu
Ignition	Native support	No
Complexity	Medium	Very low

When to choose Matchbox: Automated fleet provisioning
When to choose netboot.xyz: Ad-hoc OS installation, homelab

Decision Matrix

Use this table to evaluate Matchbox for your use case:

Requirement	Weight	Matchbox Score	Notes
Ignition/CoreOS support	High	⭐⭐⭐⭐⭐	Native, first-class
Heterogeneous OS	High	⭐⭐☆☆☆	Limited to Ignition OSes
Declarative provisioning	Medium	⭐⭐⭐⭐⭐	Terraform native
Multi-tenancy	Medium	⭐☆☆☆☆	Requires separate instances
Web UI	Medium	☆☆☆☆☆	No UI (API-only)
Ease of deployment	Medium	⭐⭐⭐⭐☆	Binary or container, minimal deps
Scalability	Medium	⭐⭐⭐☆☆	FileStore limits, need shared storage for HA
Hardware mgmt	Low	☆☆☆☆☆	No power mgmt, no introspection
Cost	Low	⭐⭐⭐⭐⭐	Open source, Apache 2.0

Scoring:

⭐⭐⭐⭐⭐ Excellent
⭐⭐⭐⭐☆ Good
⭐⭐⭐☆☆ Adequate
⭐⭐☆☆☆ Limited
⭐☆☆☆☆ Poor
☆☆☆☆☆ Not supported

Recommendations

Choose Matchbox if:

✅ Provisioning Fedora CoreOS, Flatcar, or RHEL CoreOS
✅ Building bare-metal Kubernetes clusters
✅ Prefer declarative infrastructure (Terraform)
✅ Want minimal dependencies (single binary)
✅ Need flexible label-based machine matching
✅ Have homogeneous OS requirements (all Ignition-based)

Avoid Matchbox if:

❌ Need multi-OS support (Windows, traditional Linux)
❌ Require web UI for operations teams
❌ Need built-in hardware management (power, BIOS config)
❌ Have strict multi-tenancy requirements
❌ Need automated hardware discovery/introspection

Hybrid Approaches

Pattern 1: Matchbox + Ansible

Matchbox: Initial OS provisioning
Ansible: Post-boot configuration, app deployment
Works well for stateful services on bare-metal

Pattern 2: Matchbox + Metal³

Matchbox: Bootstrap initial k8s cluster
Metal³: Ongoing cluster node lifecycle management
Gradual migration from Matchbox to Metal³

Pattern 3: Matchbox + Terraform + External Secrets

Matchbox: Base OS + minimal config
Ignition: Fetch secrets from Vault/GCP Secret Manager
Terraform: Orchestrate end-to-end provisioning

Conclusion

Matchbox is a purpose-built, minimalist network boot service optimized for modern immutable operating systems (Ignition-based). It excels in container-centric bare-metal environments, particularly for Kubernetes clusters built with Fedora CoreOS or Flatcar Linux.

Best fit: Organizations adopting immutable infrastructure patterns, container orchestration, and declarative provisioning workflows.

Not ideal for: Heterogeneous OS environments, multi-tenant bare-metal clouds, or teams requiring extensive web UI and built-in hardware management.

For home labs and development, Matchbox offers an excellent balance of simplicity and power. For production Kubernetes deployments, it’s a proven, battle-tested solution (via Typhoon). For complex enterprise provisioning with mixed OS requirements, consider Foreman or MAAS instead.

1.6 - Ubiquiti Dream Machine Pro Analysis

Comprehensive analysis of the Ubiquiti Dream Machine Pro capabilities, focusing on network boot (PXE) support and infrastructure integration.

Overview

The Ubiquiti Dream Machine Pro (UDM Pro) is an all-in-one network gateway, router, and switch designed for enterprise and advanced home lab environments. This analysis focuses on its capabilities relevant to infrastructure automation and network boot scenarios.

Key Specifications

Hardware

Processor: Quad-core ARM Cortex-A57 @ 1.7 GHz
RAM: 4GB DDR4
Storage: 128GB eMMC (for UniFi OS, applications, and logs)
Network Interfaces:
- 1x WAN port (RJ45, SFP, or SFP+)
- 8x LAN ports (1 Gbps RJ45, configurable)
- 1x SFP+ port (10 Gbps)
- 1x SFP port (1 Gbps)
Additional Features:
- 3.5" SATA HDD bay (for UniFi Protect surveillance)
- IDS/IPS engine
- Deep packet inspection
- Built-in UniFi Network Controller

Software

OS: UniFi OS (Linux-based)
Controller: Built-in UniFi Network Controller
Services: DHCP, DNS, routing, firewall, VPN (site-to-site and remote access)

Network Boot (PXE) Support

Native DHCP PXE Capabilities

The UDM Pro provides basic PXE boot support through its DHCP server:

Supported:

DHCP Option 66 (next-server / TFTP server address)
DHCP Option 67 (filename / boot file name)
Basic single-architecture PXE booting

Configuration via UniFi Controller:

Navigate to Settings → Networks → Select your network
Scroll to DHCP section
Enable DHCP
Under Advanced DHCP Options:
- TFTP Server: IP address of your TFTP/PXE server (e.g., 192.168.42.16)
- Boot Filename: Name of the bootloader file (e.g., pxelinux.0 for BIOS or bootx64.efi for UEFI)

Limitations:

No multi-architecture support: Cannot differentiate boot files based on client architecture (BIOS vs. UEFI, x86_64 vs. ARM64)
No conditional DHCP options: Cannot vary filename or next-server based on client characteristics
Fixed boot parameters: One boot configuration for all PXE clients
Single bootloader only: Must choose either BIOS or UEFI bootloader, not both

Use Cases:

✅ Homogeneous environments (all BIOS or all UEFI)
✅ Single OS deployment scenarios
✅ Simple provisioning workflows
❌ Mixed BIOS/UEFI environments (requires external DHCP server with conditional logic)

Network Segmentation & VLANs

The UDM Pro excels at network segmentation, critical for infrastructure isolation:

VLAN Support: Native 802.1Q tagging
Firewall Rules: Inter-VLAN routing with granular firewall policies
Network Isolation: Can create fully isolated networks or controlled inter-network traffic
Use Cases for Infrastructure:
- Management VLAN (for PXE/provisioning)
- Production VLAN (workloads)
- IoT/OT VLAN (isolated devices)
- DMZ (exposed services)

VPN Capabilities

Site-to-Site VPN

Protocols: IPsec, WireGuard (experimental)
Use Case: Connect home lab to cloud infrastructure (GCP, AWS, Azure)
Performance: Hardware-accelerated encryption on UDM Pro

Remote Access VPN

Protocols: L2TP, OpenVPN
Use Case: Remote administration of home lab infrastructure
Integration: Can work with Cloudflare Access for additional security layer

IDS/IPS Engine

Technology: Suricata-based
Capabilities:
- Intrusion detection
- Intrusion prevention (can drop malicious traffic)
- Threat signatures updated via UniFi
Performance Impact: Can affect throughput on high-bandwidth connections
Recommendation: Enable for security-sensitive infrastructure segments

DNS & DHCP Services

DNS

Local DNS: Can act as caching DNS resolver
Custom DNS Records: Limited to UniFi controller hostname
Recommendation: Use external DNS (Pi-hole, Bind9) for advanced features like split-horizon DNS

DHCP

Static Leases: Supports MAC-based static IP assignments
DHCP Options: Can configure common options (NTP, DNS, domain name)
Reservations: Per-client reservations via GUI
PXE Options: Basic Option 66/67 support (as noted above)

Integration with Infrastructure-as-Code

UniFi Network API

REST API: Available for configuration automation
Python Libraries: pyunifi and others for programmatic access
Use Cases:
- Terraform provider for network state management
- Ansible modules for configuration automation
- CI/CD integration for network-as-code

Terraform Provider

Provider: paultyng/unifi
Capabilities: Manage networks, firewall rules, port forwarding, DHCP settings
Limitations: Not all UI features exposed via API

Configuration Persistence

Backup/Restore: JSON-based configuration export
Version Control: Can track config changes in Git
Recovery: Auto-backup to cloud (optional)

Performance Characteristics

Throughput

Routing/NAT: ~3.5 Gbps (without IDS/IPS)
IDS/IPS Enabled: ~850 Mbps - 1 Gbps
VPN (IPsec): ~1 Gbps
Inter-VLAN Routing: Wire speed (8 Gbps backplane)

Scalability

Concurrent Devices: 500+ clients tested
VLANs: Up to 32 networks/VLANs
Firewall Rules: Thousands (performance depends on complexity)
DHCP Leases: Supports large pools efficiently

Comparison to Alternatives

Feature	UDM Pro	pfSense	OPNsense	MikroTik
Basic PXE	✅	✅	✅	✅
Conditional DHCP	❌	✅	✅	✅
All-in-one	✅	❌	❌	Varies
GUI Ease-of-use	✅✅	⚠️	⚠️	❌
API/Automation	⚠️	✅	✅	✅✅
IDS/IPS Built-in	✅	⚠️ (addon)	⚠️ (addon)	❌
Hardware	Fixed	Flexible	Flexible	Flexible
Price	$$$	$ (+ hardware)	$ (+ hardware)	$ - $$$

Recommendations for Home Lab Use

Ideal Use Cases

✅ Use the UDM Pro when:

You want an all-in-one solution with minimal configuration
You need integrated UniFi controller and network management
Your home lab has mixed UniFi hardware (switches, APs)
You want a polished GUI and mobile app management
Network segmentation and VLANs are critical

Consider Alternatives When

⚠️ Look elsewhere if:

You need conditional DHCP options or multi-architecture PXE boot
You require advanced routing protocols (BGP, OSPF beyond basics)
You need granular firewall control and scripting (pfSense/OPNsense better)
Budget is tight and you already have x86 hardware (pfSense on old PC)
You need extremely low latency (sub-1ms) routing

Recommended Configuration for Infrastructure Lab

Network Segmentation:
- VLAN 10: Management (PXE, Ansible, provisioning tools)
- VLAN 20: Kubernetes cluster
- VLAN 30: Storage network (NFS, iSCSI)
- VLAN 40: Public-facing services (behind Cloudflare)
DHCP Strategy:
- Use UDM Pro native DHCP with basic PXE options for single-arch PXE needs
- Static reservations for infrastructure components
- Consider external DHCP server if conditional options are required
Firewall Rules:
- Default deny between VLANs
- Allow management VLAN → all (with source IP restrictions)
- Allow cluster VLAN → storage VLAN (on specific ports)
- NAT only on VLAN 40 (public services)
VPN Configuration:
- Site-to-Site to GCP via WireGuard (lower overhead than IPsec)
- Remote access VPN on separate VLAN with restrictive firewall
Integration:
- Terraform for network state management
- Ansible for DHCP/DNS servers in management VLAN
- Cloudflare Access for secure public service exposure

Conclusion

The UDM Pro is a capable all-in-one network device ideal for home labs that prioritize ease-of-use and integration with the UniFi ecosystem. It provides basic PXE boot support suitable for single-architecture environments, though conditional DHCP options require external DHCP servers for complex scenarios.

For infrastructure automation projects, the UDM Pro serves well as a reliable network foundation that handles VLANs, routing, and basic services, allowing you to focus on higher-level infrastructure concerns like container orchestration and cloud integration.

1.6.1 - UDM Pro VLAN Configuration & Capabilities

Detailed analysis of VLAN support on the Ubiquiti Dream Machine Pro, including port-based VLAN assignment and VPN integration.

Overview

The Ubiquiti Dream Machine Pro (UDM Pro) provides robust VLAN support through native 802.1Q tagging, enabling network segmentation for security, performance, and organizational purposes. This document covers VLAN configuration capabilities, port assignments, and VPN integration.

VLAN Fundamentals on UDM Pro

Supported Standards

802.1Q VLAN Tagging: Full support for standard VLAN tagging
VLAN Range: IDs 1-4094 (standard IEEE 802.1Q range)
Maximum VLANs: Up to 32 networks/VLANs per device
Native VLAN: Configurable per port (default: VLAN 1)

VLAN Types

Corporate Network

Default network type for general-purpose VLANs
Provides DHCP, inter-VLAN routing, and firewall capabilities
Can enable/disable guest policies, IGMP snooping, and multicast DNS

Guest Network

Isolated network with internet-only access
Automatic firewall rules preventing access to other VLANs
Captive portal support for guest authentication

IoT Network

Optimized for IoT devices with device isolation
Prevents lateral movement between IoT devices
Allows communication with controller/gateway only

Port-Based VLAN Assignment

Per-Port VLAN Configuration

The UDM Pro’s 8x 1 Gbps LAN ports and SFP/SFP+ ports support flexible VLAN assignment:

Configuration Options per Port:

Native VLAN/Untagged VLAN: The default VLAN for untagged traffic on the port
Tagged VLANs: Multiple VLANs that can pass through the port with 802.1Q tags
Port Profile: Pre-configured VLAN assignments that can be applied to ports

Port Profile Types

All: Port accepts all VLANs (trunk mode)

Passes all configured VLANs with tags
Used for connecting managed switches or access points
Native VLAN for untagged traffic

Specific VLANs: Port limited to selected VLANs

Choose which VLANs are allowed (tagged)
Set native/untagged VLAN
Used for controlled trunk links

Single VLAN: Access port mode

Port carries only one VLAN (untagged)
All traffic on this port belongs to specified VLAN
Used for end devices (PCs, servers, printers)

Configuration Steps

Via UniFi Controller GUI:

Create Port Profile:
- Navigate to Settings → Profiles → Port Manager
- Click Create New Port Profile
- Select profile type (All, LAN, or Custom)
- Configure VLAN settings:
  - Native VLAN/Network: Untagged VLAN
  - Tagged VLANs: Select allowed VLANs (for trunk mode)
- Enable/disable settings: PoE, Storm Control, Port Isolation
Assign Profile to Ports:
- Navigate to UniFi Devices → Select UDM Pro
- Go to Ports tab
- For each LAN port (1-8) or SFP port:
  - Click port to edit
  - Select Port Profile from dropdown
  - Apply changes
Quick Port Assignment (Alternative):
- Settings → Networks → Select VLAN
- Under Port Manager, assign specific ports to this network
- Ports become access ports for this VLAN

Example Port Layout

UDM Pro Port Assignment Example:

Port 1: Native VLAN 10 (Management) - Access Mode
        └── Use: Ansible control server

Port 2: Native VLAN 20 (Kubernetes) - Access Mode
        └── Use: K8s master node

Port 3: Native VLAN 30 (Storage) - Access Mode
        └── Use: NAS/SAN device

Port 4: Native VLAN 1, Tagged: 10,20,30,40 - Trunk Mode
        └── Use: Managed switch uplink

Port 5-7: Native VLAN 40 (DMZ) - Access Mode
          └── Use: Public-facing servers

Port 8: Native VLAN 1 (Default/Untagged) - Access Mode
        └── Use: Management laptop (temporary)

SFP+: Native VLAN 1, Tagged: All - Trunk Mode
      └── Use: 10G uplink to core switch

VLAN Features and Capabilities

Inter-VLAN Routing

Enabled by Default:

Hardware-accelerated routing between VLANs
Wire-speed performance (8 Gbps backplane)
Routing decisions made at Layer 3

Firewall Control:

Default behavior: Allow all inter-VLAN traffic
Recommended: Create explicit allow/deny rules per VLAN pair
Granular control: Protocol, port, source/destination filtering

Example Firewall Rules:

Rule 1: Allow Management (VLAN 10) → All VLANs
        Source: 192.168.10.0/24
        Destination: Any
        Action: Accept

Rule 2: Allow K8s (VLAN 20) → Storage (VLAN 30) - NFS only
        Source: 192.168.20.0/24
        Destination: 192.168.30.0/24
        Ports: 2049 (NFS), 111 (Portmapper)
        Action: Accept

Rule 3: Block IoT (VLAN 50) → All Private Networks
        Source: 192.168.50.0/24
        Destination: 192.168.0.0/16, 10.0.0.0/8, 172.16.0.0/12
        Action: Drop

Rule 4 (Implicit): Default Deny Between VLANs
        Source: Any
        Destination: Any
        Action: Drop

DHCP per VLAN

Each VLAN can have its own DHCP server:

Independent IP ranges per VLAN
Separate DHCP options (DNS, gateway, NTP, domain)
Static DHCP reservations per VLAN
PXE boot options (Option 66/67) per network

Configuration:

Settings → Networks → Select VLAN
DHCP section:
- Enable DHCP server
- Define IP range (e.g., 192.168.10.100-192.168.10.254)
- Set lease time
- Configure gateway (usually UDM Pro’s IP on this VLAN)
- Add custom DHCP options

Example DHCP Configuration:

VLAN 10 (Management):
  Subnet: 192.168.10.0/24
  Gateway: 192.168.10.1 (UDM Pro)
  DHCP Range: 192.168.10.100-192.168.10.200
  DNS: 192.168.10.10 (local DNS server)
  TFTP Server (Option 66): 192.168.10.16
  Boot Filename (Option 67): pxelinux.0

VLAN 20 (Kubernetes):
  Subnet: 192.168.20.0/24
  Gateway: 192.168.20.1 (UDM Pro)
  DHCP Range: 192.168.20.50-192.168.20.99
  DNS: 8.8.8.8, 8.8.4.4
  Domain Name: k8s.lab.local

VLAN Isolation

Guest Portal Isolation:

Guest networks auto-configured with isolation rules
Prevents access to RFC1918 private networks
Internet-only access by default

Manual Isolation (Firewall Rules):

Create LAN In rules to block inter-VLAN traffic
Use groups for easier management of multiple VLANs
Apply port isolation for additional security

Device Isolation (IoT Networks):

Prevents devices on same VLAN from communicating
Only controller/gateway access allowed
Use for untrusted IoT devices (cameras, smart home)

VPN and VLAN Integration

Site-to-Site VPN VLAN Assignment

✅ VLANs CAN be assigned to site-to-site VPN connections:

WireGuard VPN:

Configure remote subnet to map to specific local VLAN
Example: GCP subnet 10.128.0.0/20 → routed through VLAN 10
Routing table automatically updated
Firewall rules apply to VPN traffic

IPsec Site-to-Site:

Specify local networks (can select specific VLANs)
Remote networks configured in tunnel settings
Multiple VLANs can traverse single VPN tunnel
Perfect Forward Secrecy supported

Configuration Steps:

Settings → VPN → Site-to-Site VPN
Create New VPN tunnel (WireGuard or IPsec)
Under Local Networks, select VLANs to include:
- Option 1: Select “All” networks
- Option 2: Choose specific VLANs (e.g., VLAN 10, 20 only)
Configure Remote Networks (cloud provider subnets)
Set encryption parameters and pre-shared keys
Create Firewall Rules for VPN traffic:
- Allow specific VLAN → VPN tunnel
- Control which VLANs can reach remote networks

Example Site-to-Site Config:

Home Lab → GCP WireGuard VPN

Local Networks:
  - VLAN 10 (Management): 192.168.10.0/24
  - VLAN 20 (Kubernetes): 192.168.20.0/24

Remote Networks:
  - GCP VPC: 10.128.0.0/20

Firewall Rules:
  - Allow VLAN 10 → GCP VPC (all protocols)
  - Allow VLAN 20 → GCP VPC (HTTPS, kubectl API only)
  - Block all other VLANs from VPN tunnel

Remote Access VPN VLAN Assignment

✅ VLANs CAN be assigned to remote access VPN clients:

L2TP/IPsec Remote Access:

VPN clients land on a specific VLAN
Default: All clients in same VPN subnet
Firewall rules control VLAN access from VPN

OpenVPN Remote Access (via UniFi Network Application addon):

Not natively built into UDM Pro
Requires UniFi Network Application 6.0+
Can route VPN clients to specific VLAN

Teleport VPN (UniFi’s solution):

Built-in remote access VPN
Clients route through UDM Pro
Can access specific VLANs based on firewall rules
Layer 3 routing to VLANs

Configuration:

Settings → VPN → Remote Access
Enable L2TP or configure Teleport
Set VPN Network (e.g., 192.168.100.0/24)
Advanced:
- Enable access to specific VLANs
- By default, VPN network is treated as separate VLAN
Firewall Rules to allow VPN → VLANs:
- Source: VPN network (192.168.100.0/24)
- Destination: VLAN 10, VLAN 20 (or specific resources)
- Action: Accept

Example Remote Access Config:

Remote VPN Users → Home Lab Access

VPN Network: 192.168.100.0/24
VPN Gateway: 192.168.100.1 (UDM Pro)

Firewall Rules:
  Rule 1: Allow VPN → Management VLAN (admin users)
          Source: 192.168.100.0/24
          Dest: 192.168.10.0/24
          Ports: SSH (22), HTTPS (443)
  
  Rule 2: Allow VPN → Kubernetes VLAN (developers)
          Source: 192.168.100.0/24
          Dest: 192.168.20.0/24
          Ports: kubectl (6443), app ports (8080-8090)
  
  Rule 3: Block VPN → Storage VLAN (security)
          Source: 192.168.100.0/24
          Dest: 192.168.30.0/24
          Action: Drop

VPN VLAN Routing Limitations

Current Limitations:

Cannot assign individual VPN clients to different VLANs dynamically
No VLAN assignment based on user identity (all clients in same VPN network)
RADIUS integration does not support per-user VLAN assignment for VPN
For per-user VLAN control, use firewall rules based on source IP

Workarounds:

Use firewall rules with VPN client IP ranges for granular access
Deploy separate VPN tunnels for different access levels
Use RADIUS for authentication + firewall rules for authorization

VLAN Best Practices for Home Lab

Network Segmentation Strategy

Recommended VLAN Layout:

VLAN 1:   Default/Management (UDM Pro access)
VLAN 10:  Infrastructure Management (Ansible, PXE, monitoring)
VLAN 20:  Kubernetes Cluster (control plane + workers)
VLAN 30:  Storage Network (NFS, iSCSI, object storage)
VLAN 40:  DMZ/Public Services (exposed to internet via Cloudflare)
VLAN 50:  IoT Devices (isolated smart home devices)
VLAN 60:  Guest Network (visitor WiFi, untrusted devices)
VLAN 100: VPN Remote Access (remote admin/dev access)

Firewall Policy Design

Default Deny Approach:

Create explicit allow rules for necessary traffic
Set implicit deny for all inter-VLAN traffic
Log dropped packets for troubleshooting

Rule Order (top to bottom):

Management VLAN → All (with source IP restrictions)
Kubernetes → Storage (specific ports)
DMZ → Internet (outbound only)
VPN → Specific VLANs (based on role)
All → Internet (NAT)
Block RFC1918 from DMZ
Drop all (implicit)

Performance Optimization

VLAN Routing Performance:

Inter-VLAN routing is hardware-accelerated
No performance penalty for multiple VLANs
Use VLAN tagging on trunk ports to reduce switch load

Multicast and Broadcast Control:

Enable IGMP snooping per VLAN for multicast efficiency
Disable multicast DNS (mDNS) between VLANs if not needed
Use multicast routing for cross-VLAN multicast (advanced)

Advanced VLAN Features

VLAN-Specific Services

DNS per VLAN:

Configure different DNS servers per VLAN via DHCP
Example: Management VLAN uses local DNS, DMZ uses public DNS

NTP per VLAN:

DHCP Option 42 for NTP server
Different time sources per network segment

Domain Name per VLAN:

DHCP Option 15 for domain name
Useful for split-horizon DNS setups

VLAN Tagging on WiFi

UniFi WiFi Integration:

Each WiFi SSID can map to a specific VLAN
Multiple SSIDs on same AP → different VLANs
Seamless VLAN tagging for wireless clients

Configuration:

Create WiFi network in UniFi Controller
Assign VLAN ID to SSID
Client traffic automatically tagged

VLAN Monitoring and Troubleshooting

Traffic Statistics:

Per-VLAN bandwidth usage visible in UniFi Controller
Deep Packet Inspection (DPI) provides application-level stats
Export data for analysis in external tools

Debugging Tools:

Port mirroring for packet capture
Flow logs for traffic analysis
Firewall logs show inter-VLAN blocks

Common Issues:

VLAN not working: Check port profile assignment and native VLAN config
No inter-VLAN routing: Verify firewall rules aren’t blocking traffic
DHCP not working on VLAN: Ensure DHCP server enabled on that network
VPN can’t reach VLAN: Check VPN local networks include the VLAN

Summary

VLAN Port Assignment: ✅ YES

The UDM Pro fully supports port-based VLAN assignment:

Individual ports can be assigned to specific VLANs (access mode)
Ports can carry multiple tagged VLANs (trunk mode)
Native/untagged VLAN configurable per port
Port profiles simplify configuration across multiple devices

VPN VLAN Assignment: ✅ YES

VLANs can be assigned to VPN connections:

Site-to-Site VPN: Select which VLANs traverse the tunnel
Remote Access VPN: VPN clients route to specific VLANs via firewall rules
Routing Control: Full control over which VLANs are accessible via VPN
Limitations: No per-user VLAN assignment; use firewall rules for granular access

Key Capabilities

Up to 32 VLANs supported
Hardware-accelerated inter-VLAN routing
Per-VLAN DHCP, DNS, and firewall policies
Full integration with UniFi WiFi for SSID-to-VLAN mapping
Flexible port profiles for easy configuration
VPN integration for both site-to-site and remote access scenarios

2 - Architecture Decision Records

Documentation of architectural decisions made using MADR 4.0.0 standard

Architecture Decision Records (ADRs)

This section contains architectural decision records that document the key design choices made. Each ADR follows the MADR 4.0.0 format and includes:

Context and problem statement
Decision drivers and constraints
Considered options with pros and cons
Decision outcome and rationale
Consequences (positive and negative)
Confirmation methods

ADR Categories

ADRs are classified into three categories:

Strategic - High-level architectural decisions affecting the entire system (frameworks, authentication strategies, cross-cutting patterns). Use for foundational technology choices.
User Journey - Decisions solving specific user journey problems. More tactical than strategic, but still architectural. Use when evaluating approaches to implement user-facing features.
API Design - API endpoint implementation decisions (pagination, filtering, bulk operations). Use for significant API design trade-offs that warrant documentation.

Status Values

Each ADR has a status that reflects its current state:

proposed - Decision is under consideration
accepted - Decision has been approved and should be implemented
rejected - Decision was considered but not approved
deprecated - Decision is no longer relevant or has been superseded
superseded by ADR-XXXX - Decision has been replaced by a newer ADR

These records provide historical context for architectural decisions and help ensure consistency across the platform.

2.1 - [0001] Use MADR for Architecture Decision Records

Adopt Markdown Architectural Decision Records (MADR) as the standard format for documenting architectural decisions in the project.

Context and Problem Statement

As the project grows, architectural decisions are made that have long-term impacts on the system’s design, maintainability, and scalability. Without a structured way to document these decisions, we risk losing the context and rationale behind important choices, making it difficult for current and future team members to understand why certain approaches were taken.

How should we document architectural decisions in a way that is accessible, maintainable, and provides sufficient context for future reference?

Decision Drivers

Need for clear documentation of architectural decisions and their rationale
Easy accessibility and searchability of past decisions
Low barrier to entry for creating and maintaining decision records
Integration with existing documentation workflow
Version control friendly format
Industry-standard approach that team members may already be familiar with

Considered Options

MADR (Markdown Architectural Decision Records)
ADR using custom format
Wiki-based documentation
No formal ADR process

Decision Outcome

Chosen option: “MADR (Markdown Architectural Decision Records)”, because it provides a well-established, standardized format that is lightweight, version-controlled, and integrates seamlessly with our existing documentation structure. MADR 4.0.0 offers a clear template that captures all necessary information while remaining flexible enough for different types of decisions.

Consequences

Good, because MADR is a widely adopted standard with clear documentation and examples
Good, because markdown files are easy to create, edit, and review through pull requests
Good, because ADRs will be version-controlled alongside code, maintaining historical context
Good, because the format is flexible enough to accommodate strategic, user-journey, and API design decisions
Good, because team members can easily search and reference past decisions
Neutral, because requires discipline to maintain and update ADR status as decisions evolve
Bad, because team members need to learn and follow the MADR format conventions

Confirmation

Compliance will be confirmed through:

Code reviews ensuring new architectural decisions are documented as ADRs
ADRs are stored in docs/content/r&d/adrs/ following the naming convention NNNN-title-with-dashes.md
Regular reviews during architecture discussions to reference and update existing ADRs

Pros and Cons of the Options

MADR (Markdown Architectural Decision Records)

MADR 4.0.0 is a standardized format for documenting architectural decisions using markdown.

Good, because it’s a well-established standard with extensive documentation
Good, because markdown is simple, portable, and version-control friendly
Good, because it provides a clear structure while remaining flexible
Good, because it integrates with static site generators and documentation tools
Good, because it’s lightweight and doesn’t require special tools
Neutral, because it requires some initial learning of the format
Neutral, because maintaining consistency requires discipline

ADR using custom format

Create our own custom format for architectural decision records.

Good, because we can tailor it exactly to our needs
Bad, because it requires defining and maintaining our own standard
Bad, because new team members won’t be familiar with the format
Bad, because we lose the benefits of community knowledge and tooling
Bad, because it may evolve inconsistently over time

Wiki-based documentation

Use a wiki system (like Confluence, Notion, or GitHub Wiki) to document decisions.

Good, because wikis provide easy editing and hyperlinking
Good, because some team members may be familiar with wiki tools
Neutral, because it may or may not integrate with version control
Bad, because content may not be version-controlled alongside code
Bad, because it creates a separate system to maintain
Bad, because it’s harder to review changes through standard PR process
Bad, because portability and long-term accessibility may be concerns

No formal ADR process

Continue without a structured approach to documenting architectural decisions.

Good, because it requires no additional overhead
Bad, because context and rationale for decisions are lost over time
Bad, because new team members struggle to understand why decisions were made
Bad, because it leads to repeated discussions of previously settled questions
Bad, because it makes it difficult to track when decisions should be revisited

More Information

MADR 4.0.0 specification: https://adr.github.io/madr/
ADRs will be categorized as: strategic, user-journey, or api-design
ADR status values: proposed | accepted | rejected | deprecated | superseded by ADR-XXXX
All ADRs are stored in docs/content/r&d/adrs/ directory

2.2 - [0002] Network Boot Architecture for Home Lab

Evaluate options for network booting servers in a home lab environment, considering local vs cloud-hosted boot servers.

Context and Problem Statement

When setting up a home lab infrastructure, servers need to be provisioned and booted over the network using PXE (Preboot Execution Environment). This requires a TFTP/HTTP server to serve boot files to requesting machines. The question is: where should this boot server be hosted to balance security, reliability, cost, and operational complexity?

Decision Drivers

Security: Minimize attack surface and ensure only authorized servers receive boot files
Reliability: Boot process should be resilient and not dependent on external network connectivity
Cost: Minimize ongoing infrastructure costs
Complexity: Keep the operational burden manageable
Trust Model: Clear verification of requesting server identity

Considered Options

Option 1: TFTP/HTTP server locally on home lab network
Option 2: TFTP/HTTP server on public cloud (without VPN)
Option 3: TFTP/HTTP server on public cloud (with VPN)

Decision Outcome

Chosen option: “Option 3: TFTP/HTTP server on public cloud (with VPN)”, because:

No local machine management: Unlike Option 1, this avoids the need to maintain dedicated local hardware for the boot server, reducing operational overhead
Secure protocol support: The VPN tunnel encrypts all traffic, allowing unsecured protocols like TFTP to be used without risk of data exposure over public internet routes (unlike Option 2)
Cost-effective VPN: The UDM Pro natively supports WireGuard, enabling a self-managed VPN solution that avoids expensive managed VPN services (~$180-300/year vs ~$540-900/year)

Consequences

Good, because all traffic is encrypted through WireGuard VPN tunnel
Good, because boot server is not exposed to public internet (no public attack surface)
Good, because trust model is simple - subnet validation similar to local option
Good, because centralized cloud management reduces local maintenance burden
Good, because boot server remains available even if home lab storage fails
Good, because UDM Pro’s native WireGuard support keeps costs at ~$180-300/year
Bad, because boot process depends on both internet connectivity and VPN availability
Bad, because VPN adds latency to boot file transfers
Bad, because VPN gateway becomes an additional failure point
Bad, because higher ongoing cost compared to local-only option (~$180-300/year vs ~$10/year)

Confirmation

The implementation will be confirmed by:

Successfully network booting a test server using the chosen architecture
Validating the trust model prevents unauthorized boot requests
Measuring actual costs against estimates

Pros and Cons of the Options

Option 1: TFTP/HTTP server locally on home lab network

Run the boot server on local infrastructure (e.g., Raspberry Pi, dedicated VM, or container) within the home lab network.

Boot Flow Sequence

sequenceDiagram
    participant Server as Home Lab Server
    participant DHCP as Local DHCP Server
    participant Boot as Local TFTP/HTTP Server

    Server->>DHCP: PXE Boot Request (DHCP Discover)
    DHCP->>Server: DHCP Offer with Boot Server IP
    Server->>Boot: TFTP Request for Boot File
    Boot->>Boot: Verify MAC/IP against allowlist
    Boot->>Server: Send iPXE/Boot Loader
    Server->>Boot: HTTP Request for Kernel/Initrd
    Boot->>Server: Send Boot Files
    Server->>Server: Boot into OS

Trust Model

MAC Address Allowlist: Maintain a list of known server MAC addresses
Network Isolation: Boot server only accessible from home lab VLAN
No external exposure: Traffic never leaves local network
Physical security: Relies on physical access control to home lab

Cost Estimate

Hardware: ~$50-100 one-time (Raspberry Pi or repurposed hardware)
Power: ~$5-10/year (low power consumption)
Total: ~$55-110 initial + ~$10/year ongoing

Pros and Cons

Good, because no dependency on internet connectivity for booting
Good, because lowest latency for boot file transfers
Good, because all data stays within local network (maximum privacy)
Good, because lowest ongoing cost
Good, because simple trust model based on network isolation
Neutral, because requires dedicated local hardware or resources
Bad, because single point of failure if boot server goes down
Bad, because requires local maintenance and updates

Option 2: TFTP/HTTP server on public cloud (without VPN)

Host the boot server on a cloud provider (AWS, GCP, Azure) and expose it directly to the internet.

Boot Flow Sequence

sequenceDiagram
    participant Server as Home Lab Server
    participant DHCP as Local DHCP Server
    participant Router as Home Router/NAT
    participant Internet as Internet
    participant Boot as Cloud TFTP/HTTP Server

    Server->>DHCP: PXE Boot Request (DHCP Discover)
    DHCP->>Server: DHCP Offer with Cloud Boot Server IP
    Server->>Router: TFTP Request
    Router->>Internet: NAT Translation
    Internet->>Boot: TFTP Request from Home IP
    Boot->>Boot: Verify source IP + token/certificate
    Boot->>Internet: Send iPXE/Boot Loader
    Internet->>Router: Response
    Router->>Server: Boot Loader
    Server->>Router: HTTP Request for Kernel/Initrd
    Router->>Internet: NAT Translation
    Internet->>Boot: HTTP Request with auth headers
    Boot->>Boot: Validate request authenticity
    Boot->>Internet: Send Boot Files
    Internet->>Router: Response
    Router->>Server: Boot Files
    Server->>Server: Boot into OS

Trust Model

Source IP Validation: Restrict to home lab’s public IP (dynamic IP is problematic)
Certificate/Token Authentication: Embed certificates in initial bootloader
TLS for HTTP: All HTTP traffic encrypted
Challenge-Response: Boot server can challenge requesting server
Risk: TFTP typically unencrypted, vulnerable to interception

Cost Estimate

Cloud VM (t3.micro or equivalent): ~$10-15/month
Data Transfer: ~$1-5/month (boot files are typically small)
Static IP: ~$3-5/month
Total: ~$170-300/year

Pros and Cons

Good, because boot server remains available even if home lab has issues
Good, because centralized management in cloud console
Good, because easy to scale or replicate
Neutral, because requires internet connectivity for every boot
Bad, because significantly higher ongoing cost
Bad, because TFTP protocol is inherently insecure over public internet
Bad, because complex trust model required (IP validation, certificates)
Bad, because boot process depends on internet availability
Bad, because higher latency for boot file transfers
Bad, because public exposure increases attack surface

Option 3: TFTP/HTTP server on public cloud (with VPN)

Host the boot server in the cloud but connect the home lab to the cloud via a site-to-site VPN tunnel.

Boot Flow Sequence

sequenceDiagram
    participant Server as Home Lab Server
    participant DHCP as Local DHCP Server
    participant VPN as VPN Gateway (Home)
    participant CloudVPN as VPN Gateway (Cloud)
    participant Boot as Cloud TFTP/HTTP Server

    Note over VPN,CloudVPN: Site-to-Site VPN Tunnel Established

    Server->>DHCP: PXE Boot Request (DHCP Discover)
    DHCP->>Server: DHCP Offer with Boot Server Private IP
    Server->>VPN: TFTP Request to Private IP
    VPN->>CloudVPN: Encrypted VPN Tunnel
    CloudVPN->>Boot: TFTP Request (appears local)
    Boot->>Boot: Verify source IP from home lab subnet
    Boot->>CloudVPN: Send iPXE/Boot Loader
    CloudVPN->>VPN: Encrypted Response
    VPN->>Server: Boot Loader
    Server->>VPN: HTTP Request for Kernel/Initrd
    VPN->>CloudVPN: Encrypted VPN Tunnel
    CloudVPN->>Boot: HTTP Request
    Boot->>Boot: Validate subnet membership
    Boot->>CloudVPN: Send Boot Files
    CloudVPN->>VPN: Encrypted Response
    VPN->>Server: Boot Files
    Server->>Server: Boot into OS

Trust Model

VPN Tunnel Encryption: All traffic encrypted end-to-end
Private IP Addressing: Boot server only accessible via VPN
Subnet Validation: Verify requests come from trusted home lab subnet
VPN Authentication: Strong auth at tunnel level (certificates, pre-shared keys)
No public exposure: Boot server has no public IP

Cost Estimate

Cloud VM (t3.micro or equivalent): ~$10-15/month
Data Transfer (VPN): ~$5-10/month
VPN Gateway Service (if using managed): ~$30-50/month OR
Self-managed VPN (WireGuard/OpenVPN): ~$0 additional
Total (self-managed VPN): ~$180-300/year
Total (managed VPN): ~$540-900/year

Pros and Cons

Good, because all traffic encrypted through VPN tunnel
Good, because boot server not exposed to public internet
Good, because trust model similar to local option (subnet validation)
Good, because centralized cloud management benefits
Good, because boot server available if home lab storage fails
Neutral, because moderate complexity (VPN setup and maintenance)
Bad, because higher cost than local option
Bad, because boot process still depends on internet + VPN availability
Bad, because VPN adds latency to boot process
Bad, because VPN gateway becomes additional failure point
Bad, because most expensive option if using managed VPN service

More Information

Key Questions for Decision

How critical is boot availability during internet outages?
Is the home lab public IP static or dynamic?
What is the acceptable boot time latency?
How many servers need to be supported?
Is there existing VPN infrastructure?

Issue #595 - story(docs): create adr for network boot architecture

2.3 - [0003] Cloud Provider Selection for Network Boot Infrastructure

Evaluate Google Cloud Platform vs Amazon Web Services for hosting network boot server infrastructure as required by ADR-0002.

Context and Problem Statement

ADR-0002 established that network boot infrastructure will be hosted on a cloud provider and accessed via VPN (specifically WireGuard from the UDM Pro). The decision to use cloud hosting provides resilience against local hardware failures while maintaining security through encrypted VPN tunnels.

The question now is: Which cloud provider should host the network boot infrastructure?

This decision will affect:

Cost: Ongoing monthly/annual infrastructure costs
Protocol Support: Ability to serve TFTP, HTTP, and HTTPS boot files
VPN Integration: Ease of WireGuard deployment and management
Operational Complexity: Management overhead and maintenance burden
Performance: Boot file transfer latency and throughput
Vendor Lock-in: Future flexibility to migrate or multi-cloud

Decision Drivers

Cost Efficiency: Minimize ongoing infrastructure costs for home lab scale
Protocol Support: Must support TFTP (UDP/69), HTTP (TCP/80), and HTTPS (TCP/443) for network boot workflows
WireGuard Compatibility: Must support self-managed WireGuard VPN with reasonable effort
UDM Pro Integration: Should work seamlessly with UniFi Dream Machine Pro’s native WireGuard client
Simplicity: Minimize operational complexity for a single-person home lab
Existing Expertise: Leverage existing team knowledge and infrastructure
Performance: Sufficient throughput and low latency for boot file transfers (50-200MB per boot)

Considered Options

Option 1: Google Cloud Platform (GCP)
Option 2: Amazon Web Services (AWS)

Decision Outcome

Chosen option: “Option 1: Google Cloud Platform (GCP)”, because:

Existing Infrastructure: The home lab already uses GCP extensively (Cloud Run services, load balancers, mTLS infrastructure per existing codebase), reducing operational overhead and leveraging existing expertise
Comparable Costs: Both providers offer similar costs for the required infrastructure (~$6-12/month for compute + VPN), with GCP’s e2-micro being sufficient
Equivalent Protocol Support: Both support TFTP/HTTP/HTTPS via direct VM access (load balancers unnecessary for single boot server), meeting all protocol requirements
WireGuard Compatibility: Both require self-managed WireGuard deployment (neither has native WireGuard support), with nearly identical implementation complexity
Unified Management: Consolidating all cloud infrastructure on GCP simplifies monitoring, billing, IAM, and operational workflows

While AWS would be a viable alternative (especially with t4g.micro ARM instances offering slightly better price/performance), the existing GCP investment makes it the pragmatic choice to avoid multi-cloud complexity.

Consequences

Good, because consolidates all cloud infrastructure on a single provider (reduced operational complexity)
Good, because leverages existing GCP expertise and IAM configurations
Good, because unified Cloud Monitoring/Logging across all services
Good, because single cloud bill simplifies cost tracking
Good, because existing Terraform modules and patterns can be reused
Good, because GCP’s e2-micro instances (~$6.50/month) are cost-effective for the workload
Good, because self-managed WireGuard provides flexibility and low cost (~$10/month total)
Neutral, because both providers have comparable protocol support (TFTP/HTTP/HTTPS via VM)
Neutral, because both require self-managed WireGuard (no native support)
Bad, because creates vendor lock-in to GCP (migration would require relearning and reconfiguration)
Bad, because foregoes AWS’s slightly cheaper t4g.micro ARM instances (~$6/month vs GCP’s ~$6.50/month)
Bad, because multi-cloud strategy could provide redundancy (accepted trade-off for simplicity)

Confirmation

The implementation will be confirmed by:

Successfully deploying WireGuard VPN gateway on GCP Compute Engine
Establishing site-to-site VPN tunnel between UDM Pro and GCP
Network booting a test server via VPN using TFTP and HTTP protocols
Measuring actual costs against estimates (~$10-15/month)
Validating boot performance (transfer time < 30 seconds for typical boot)

Pros and Cons of the Options

Option 1: Google Cloud Platform (GCP)

Host network boot infrastructure on Google Cloud Platform.

Architecture Overview

graph TB
    subgraph "Home Lab Network"
        A[Home Lab Servers]
        B[UDM Pro - WireGuard Client]
    end
    
    subgraph "GCP VPC"
        C[WireGuard Gateway VM<br/>e2-micro]
        D[Boot Server VM<br/>e2-micro]
        C -->|VPC Routing| D
    end
    
    A -->|PXE Boot Request| B
    B -->|WireGuard Tunnel| C
    C -->|TFTP/HTTP/HTTPS| D
    D -->|Boot Files| C
    C -->|Encrypted Response| B
    B -->|Boot Files| A

Implementation Details

Compute:

WireGuard Gateway: e2-micro VM (~$6.50/month) running Ubuntu 22.04
- Self-managed WireGuard server
- IP forwarding enabled
- Static external IP (~$3.50/month if VM ever stops)
Boot Server: e2-micro VM (same or consolidated with gateway)
- TFTP server (tftpd-hpa)
- HTTP server (nginx or simple Python server)
- Optional HTTPS with self-signed cert or Let’s Encrypt

Networking:

VPC: Default VPC or custom VPC with private subnets
Firewall Rules:
- Allow UDP/51820 from home lab public IP (WireGuard)
- Allow UDP/69, TCP/80, TCP/443 from VPN subnet (boot protocols)
Routes: Custom route to direct home lab subnet through WireGuard gateway
Cloud VPN: Not used (self-managed WireGuard instead to save ~$65/month)

WireGuard Setup:

Install WireGuard on Compute Engine VM
Configure wg0 interface with PostUp/PostDown iptables rules
Store private key in Secret Manager
UDM Pro connects as WireGuard peer

Cost Breakdown (US regions):

Component	Monthly Cost
e2-micro VM (WireGuard + Boot)	~$6.50
Static External IP (if attached)	~$3.50
Egress (10 boots × 150MB)	~$0.18
Total	~$10.18
Annual	~$122

Pros and Cons

Good, because existing home lab infrastructure already uses GCP extensively
Good, because consolidates all cloud resources on single provider (unified billing, IAM, monitoring)
Good, because leverages existing GCP expertise and Terraform modules
Good, because Cloud Monitoring/Logging already configured for other services
Good, because Secret Manager integration for WireGuard key storage
Good, because e2-micro instance size is sufficient for network boot workload
Good, because low cost (~$10/month for self-managed WireGuard)
Good, because VPC networking is familiar and well-documented
Neutral, because requires self-managed WireGuard (no native support, same as AWS)
Neutral, because TFTP/HTTP/HTTPS served directly from VM (no special GCP features needed)
Bad, because slightly more expensive than AWS t4g.micro (~$6.50/month vs ~$6/month)
Bad, because creates vendor lock-in to GCP ecosystem
Bad, because Cloud VPN (managed IPsec) is expensive (~$73/month), so must use self-managed WireGuard

Option 2: Amazon Web Services (AWS)

Host network boot infrastructure on Amazon Web Services.

Architecture Overview

graph TB
    subgraph "Home Lab Network"
        A[Home Lab Servers]
        B[UDM Pro - WireGuard Client]
    end
    
    subgraph "AWS VPC"
        C[WireGuard Gateway EC2<br/>t4g.micro]
        D[Boot Server EC2<br/>t4g.micro]
        C -->|VPC Routing| D
    end
    
    A -->|PXE Boot Request| B
    B -->|WireGuard Tunnel| C
    C -->|TFTP/HTTP/HTTPS| D
    D -->|Boot Files| C
    C -->|Encrypted Response| B
    B -->|Boot Files| A

Implementation Details

Compute:

WireGuard Gateway: t4g.micro EC2 (~$6/month, ARM-based Graviton)
- Self-managed WireGuard server
- Source/Dest check disabled for IP forwarding
- Elastic IP (free when attached to running instance)
Boot Server: t4g.micro EC2 (same or consolidated with gateway)
- TFTP server (tftpd-hpa)
- HTTP server (nginx)
- Optional HTTPS with Let’s Encrypt or self-signed cert

Networking:

VPC: Default VPC or custom VPC with private subnets
Security Groups:
- WireGuard SG: Allow UDP/51820 from home lab public IP
- Boot Server SG: Allow UDP/69, TCP/80, TCP/443 from WireGuard SG
Route Table: Add route for home lab subnet via WireGuard instance
Site-to-Site VPN: Not used (self-managed WireGuard saves ~$30/month)

WireGuard Setup:

Install WireGuard on Ubuntu 22.04 or Amazon Linux 2023 EC2
Configure wg0 with iptables MASQUERADE
Store private key in Secrets Manager
UDM Pro connects as WireGuard peer

Cost Breakdown (US East):

Component	Monthly Cost
t4g.micro EC2 (WireGuard + Boot)	~$6.00
Elastic IP (attached)	$0.00
Egress (10 boots × 150MB)	~$0.09
Total (On-Demand)	~$6.09
Total (1-yr Reserved)	~$3.59
Annual (On-Demand)	~$73
Annual (Reserved)	~$43

Pros and Cons

Good, because t4g.micro ARM instances offer best price/performance (~$6/month on-demand)
Good, because Reserved Instances provide significant savings (~40% with 1-year commitment)
Good, because Elastic IP is free when attached to running instance
Good, because AWS has extensive documentation and community support
Good, because potential for future multi-cloud strategy
Good, because ACM provides free SSL certificates (if public domain used)
Good, because Secrets Manager for WireGuard key storage
Good, because low cost (~$6/month on-demand, ~$3.50/month with RI)
Neutral, because requires self-managed WireGuard (no native support, same as GCP)
Neutral, because TFTP/HTTP/HTTPS served directly from EC2 (no special AWS features)
Bad, because introduces multi-cloud complexity (separate billing, IAM, monitoring)
Bad, because no existing AWS infrastructure in home lab (new learning curve)
Bad, because requires separate monitoring/logging setup (CloudWatch vs Cloud Monitoring)
Bad, because separate Terraform state and modules needed
Bad, because Site-to-Site VPN is expensive (~$36/month), so must use self-managed WireGuard

More Information

Detailed Analysis

For in-depth analysis of each provider’s capabilities:

Key Findings Summary

Both providers offer:

✅ TFTP Support: Via direct VM/EC2 access (load balancers don’t support TFTP)
✅ HTTP/HTTPS Support: Full support via direct VM/EC2 or load balancers
✅ WireGuard Compatibility: Self-managed deployment on VM/EC2 (neither has native support)
✅ UDM Pro Integration: Native WireGuard client works with both
✅ Low Cost: $6-12/month for compute + VPN infrastructure
✅ Sufficient Performance: 100+ Mbps throughput on smallest instances

Key differences:

GCP: Slightly higher cost (~$10/month), but consolidates with existing infrastructure
AWS: Slightly lower cost (~$6/month on-demand, ~$3.50/month Reserved), but introduces multi-cloud complexity

Cost Comparison Table

Component	GCP (e2-micro)	AWS (t4g.micro On-Demand)	AWS (t4g.micro 1-yr RI)
Compute	$6.50/month	$6.00/month	$3.50/month
Static IP	$3.50/month	$0.00 (Elastic IP free when attached)	$0.00
Egress (1.5GB)	$0.18/month	$0.09/month	$0.09/month
Monthly	$10.18	$6.09	$3.59
Annual	$122	$73	$43

Savings Analysis: AWS is ~$49-79/year cheaper, but introduces operational complexity.

Protocol Support Comparison

Protocol	GCP Support	AWS Support	Implementation
TFTP (UDP/69)	⚠️ Via VM	⚠️ Via EC2	Direct VM/EC2 access (no LB support)
HTTP (TCP/80)	✅ Full	✅ Full	Direct VM/EC2 or Load Balancer
HTTPS (TCP/443)	✅ Full	✅ Full	Direct VM/EC2 or Load Balancer + cert
WireGuard	⚠️ Self-managed	⚠️ Self-managed	Install on VM/EC2

WireGuard Deployment Comparison

Aspect	GCP	AWS
Native Support	❌ No (IPsec Cloud VPN only)	❌ No (IPsec Site-to-Site VPN only)
Self-Managed	✅ Compute Engine	✅ EC2
Setup Complexity	Similar (install, configure, firewall)	Similar (install, configure, SG)
IP Forwarding	Enable on VM	Disable Source/Dest check
Firewall	VPC Firewall rules	Security Groups
Key Storage	Secret Manager	Secrets Manager
Cost	~$10/month total	~$6/month total

Trade-offs Analysis

Choosing GCP:

Wins: Operational simplicity, unified infrastructure, existing expertise
Loses: ~$50-80/year higher cost, vendor lock-in

Choosing AWS:

Wins: Lower cost, Reserved Instance savings, multi-cloud optionality
Loses: Multi-cloud complexity, separate monitoring/billing, new tooling

For a home lab prioritizing simplicity over cost optimization, GCP’s consolidation benefits outweigh the modest cost difference.

ADR-0002: Network Boot Architecture - Established requirement for cloud-hosted boot server with VPN
ADR-0001: Use MADR for Architecture Decision Records - MADR format used for this ADR

Future Considerations

Cost Reevaluation: If annual costs become significant, reconsider AWS Reserved Instances
Multi-Cloud: If multi-cloud strategy emerges, migrate boot server to AWS
Managed WireGuard: If GCP or AWS adds native WireGuard support, reevaluate managed option
High Availability: If HA required, evaluate multi-region deployment costs on both providers

Issue #597 - story(docs): create adr for cloud provider selection

2.4 - [0004] Server Operating System Selection

Evaluate operating systems for homelab server infrastructure with focus on Kubernetes cluster setup and maintenance.

Context and Problem Statement

The homelab infrastructure requires a server operating system to run Kubernetes clusters for container workloads. The choice of operating system significantly impacts ease of cluster initialization, ongoing maintenance burden, security posture, and operational complexity.

The question is: Which operating system should be used for homelab Kubernetes servers?

This decision will affect:

Cluster Initialization: Complexity and time required to bootstrap Kubernetes
Maintenance Burden: Frequency and complexity of OS updates, Kubernetes upgrades, and patching
Security Posture: Attack surface, built-in security features, and hardening requirements
Resource Efficiency: RAM, CPU, and disk overhead
Operational Complexity: Day-to-day management, troubleshooting, and debugging
Learning Curve: Time required for team to become proficient

Decision Drivers

Ease of Kubernetes Setup: Minimize steps and complexity for cluster initialization
Maintenance Simplicity: Reduce ongoing operational burden for updates and upgrades
Security-First Design: Minimal attack surface and strong security defaults
Resource Efficiency: Low RAM/CPU/disk overhead for cost-effective homelab
Learning Curve: Reasonable adoption time for single-person homelab
Community Support: Strong documentation and active community
Immutability: Prefer declarative, version-controlled configuration (GitOps-friendly)
Purpose-Built: OS optimized specifically for Kubernetes vs general-purpose

Considered Options

Option 1: Ubuntu Server with k3s
Option 2: Fedora Server with kubeadm
Option 3: Talos Linux (purpose-built Kubernetes OS)
Option 4: Harvester HCI (hyperconverged platform)

Decision Outcome

Chosen option: “Option 3: Talos Linux”, because:

Minimal Attack Surface: No SSH, shell, or package manager eliminates entire classes of vulnerabilities, providing the strongest security posture
Built-in Kubernetes: No separate installation or configuration complexity - Kubernetes is included and optimized
Declarative Configuration: API-driven, immutable infrastructure aligns with GitOps principles and prevents configuration drift
Lowest Resource Overhead: ~768MB RAM vs 1-2GB+ for traditional distros, maximizing homelab hardware efficiency
Simplified Maintenance: Declarative upgrades (talosctl upgrade) for both OS and Kubernetes reduce operational burden
Security by Default: Immutable filesystem, no shell, KSPP compliance - secure without manual hardening

While the learning curve is steeper than traditional Linux distributions, the benefits of purpose-built Kubernetes infrastructure, minimal maintenance, and superior security outweigh the initial learning investment for a dedicated Kubernetes homelab.

Consequences

Good, because minimal attack surface (no SSH/shell) provides strongest security posture
Good, because declarative configuration enables GitOps workflows and prevents drift
Good, because lowest resource overhead (~768MB RAM) maximizes homelab efficiency
Good, because built-in Kubernetes eliminates installation complexity
Good, because immutable infrastructure prevents configuration drift
Good, because simplified upgrades (single command for OS + K8s) reduce maintenance burden
Good, because smallest disk footprint (~500MB) vs 10GB+ for traditional distros
Good, because secure by default (no manual hardening required)
Good, because purpose-built design optimized specifically for Kubernetes
Good, because API-driven management (talosctl) enables automation
Neutral, because steeper learning curve (paradigm shift from shell-based management)
Neutral, because smaller community than Ubuntu/Fedora (but active and helpful)
Bad, because limited to Kubernetes workloads only (not general-purpose)
Bad, because no shell access requires different troubleshooting approach
Bad, because newer platform (less mature than Ubuntu/Fedora)
Bad, because no escape hatch for manual intervention when needed

Confirmation

The implementation will be confirmed by:

Successfully bootstrapping a Talos cluster using talosctl
Deploying test workloads and validating functionality
Performing declarative OS and Kubernetes upgrades
Measuring actual resource usage (RAM < 1GB per node)
Validating security posture (no SSH/shell, immutable filesystem)
Testing GitOps workflow (machine configs in version control)

Pros and Cons of the Options

Option 1: Ubuntu Server with k3s

Host Kubernetes using Ubuntu Server 24.04 LTS with k3s lightweight Kubernetes distribution.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Ubuntu Server
    participant K3s as k3s Components
    
    Admin->>Server: Install Ubuntu 24.04 LTS
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system
    Admin->>Server: curl -sfL https://get.k3s.io | sh -
    Server->>K3s: Download k3s binary
    K3s->>Server: Configure containerd
    K3s->>Server: Start k3s service
    K3s->>Server: Initialize etcd (embedded)
    K3s->>Server: Start API server
    K3s->>Server: Deploy built-in CNI (Flannel)
    K3s-->>Admin: Control plane ready
    Admin->>Server: Retrieve node token
    Admin->>Server: Install k3s agent on workers
    K3s->>Server: Join workers to cluster
    K3s-->>Admin: Cluster ready (5-10 minutes)

Implementation Details

Installation:

# Single-command k3s install
curl -sfL https://get.k3s.io | sh -

# Get token for workers
sudo cat /var/lib/rancher/k3s/server/node-token

# Install on workers
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -

Resource Requirements:

RAM: 1GB total (512MB OS + 512MB k3s)
CPU: 1-2 cores
Disk: 20GB (10GB OS + 10GB containers)

Maintenance:

# OS updates
sudo apt update && sudo apt upgrade

# k3s upgrade
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.32.0+k3s1 sh -

# Or automatic via system-upgrade-controller

Pros and Cons

Good, because most familiar Linux distribution (easy adoption)
Good, because 5-year LTS support (10 years with Ubuntu Pro)
Good, because k3s provides single-command setup
Good, because extensive documentation and community support
Good, because compatible with all Kubernetes tooling
Good, because automatic security updates available
Good, because general-purpose (can run non-K8s workloads)
Good, because low learning curve
Neutral, because moderate resource overhead (1GB RAM)
Bad, because general-purpose OS has larger attack surface
Bad, because requires manual OS updates and reboots
Bad, because managing OS + Kubernetes lifecycle separately
Bad, because imperative configuration (not GitOps-native)
Bad, because mutable filesystem (configuration drift possible)

Option 2: Fedora Server with kubeadm

Host Kubernetes using Fedora Server with kubeadm (official Kubernetes tool) and CRI-O container runtime.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Fedora Server
    participant K8s as Kubernetes Components
    
    Admin->>Server: Install Fedora 41
    Server->>Server: Configure network
    Admin->>Server: Update system (dnf update)
    Admin->>Server: Install CRI-O
    Server->>Server: Configure CRI-O runtime
    Admin->>Server: Install kubeadm/kubelet/kubectl
    Server->>Server: Disable swap, load kernel modules
    Server->>Server: Configure SELinux
    Admin->>K8s: kubeadm init --cri-socket=unix:///var/run/crio/crio.sock
    K8s->>Server: Generate certificates
    K8s->>Server: Start etcd
    K8s->>Server: Start API server
    K8s-->>Admin: Control plane ready
    Admin->>K8s: kubectl apply CNI
    K8s->>Server: Deploy CNI pods
    Admin->>K8s: kubeadm join (workers)
    K8s-->>Admin: Cluster ready (15-20 minutes)

Implementation Details

Installation:

# Install CRI-O
sudo dnf install -y cri-o
sudo systemctl enable --now crio

# Install kubeadm components
sudo dnf install -y kubelet kubeadm kubectl

# Initialize cluster
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/crio/crio.sock

# Install CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

Resource Requirements:

RAM: 2.2GB total (700MB OS + 1.5GB Kubernetes)
CPU: 2+ cores
Disk: 35GB (15GB OS + 20GB containers)

Maintenance:

# OS updates (every 13 months major upgrade)
sudo dnf update -y

# Kubernetes upgrade
sudo dnf update -y kubeadm
sudo kubeadm upgrade apply v1.32.0
sudo dnf update -y kubelet kubectl

Pros and Cons

Good, because SELinux enabled by default (stronger than AppArmor)
Good, because latest kernel and packages (bleeding edge)
Good, because native CRI-O support (OpenShift compatibility)
Good, because upstream for RHEL (enterprise patterns)
Good, because kubeadm provides full control over cluster
Neutral, because faster release cycle (latest features, but more upgrades)
Bad, because short support cycle (13 months per release)
Bad, because bleeding-edge can introduce instability
Bad, because complex kubeadm setup (many manual steps)
Bad, because higher resource overhead (2.2GB RAM)
Bad, because SELinux configuration for Kubernetes is complex
Bad, because frequent OS upgrades required (every 13 months)
Bad, because managing OS + Kubernetes separately
Bad, because imperative configuration (not GitOps-native)

Option 3: Talos Linux (purpose-built Kubernetes OS)

Use Talos Linux, an immutable, API-driven operating system designed specifically for Kubernetes with built-in cluster management.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Talos as Talos Linux
    participant K8s as Kubernetes Components
    
    Admin->>Server: Boot Talos ISO (PXE or USB)
    Server->>Talos: Start in maintenance mode
    Talos-->>Admin: API endpoint ready
    Admin->>Admin: Generate configs (talosctl gen config)
    Admin->>Talos: talosctl apply-config (controlplane.yaml)
    Talos->>Server: Install Talos to disk
    Server->>Server: Reboot from disk
    Talos->>K8s: Start kubelet
    Talos->>K8s: Start etcd
    Talos->>K8s: Start API server
    Admin->>Talos: talosctl bootstrap
    Talos->>K8s: Initialize cluster
    K8s->>Talos: Start controller-manager
    K8s-->>Admin: Control plane ready
    Admin->>K8s: Apply CNI
    Admin->>Talos: Apply worker configs
    Talos->>K8s: Join workers
    K8s-->>Admin: Cluster ready (10-15 minutes)

Implementation Details

Installation:

# Generate machine configs
talosctl gen config homelab https://192.168.1.10:6443

# Apply config to control plane (booted from ISO)
talosctl apply-config --insecure --nodes 192.168.1.10 --file controlplane.yaml

# Bootstrap Kubernetes
talosctl bootstrap --nodes 192.168.1.10 --endpoints 192.168.1.10

# Get kubeconfig
talosctl kubeconfig --nodes 192.168.1.10

# Add workers
talosctl apply-config --insecure --nodes 192.168.1.11 --file worker.yaml

Machine Configuration (declarative YAML):

version: v1alpha1
machine:
  type: controlplane
  install:
    disk: /dev/sda
  network:
    hostname: control-plane-1
    interfaces:
      - interface: eth0
        addresses:
          - 192.168.1.10/24
cluster:
  clusterName: homelab
  controlPlane:
    endpoint: https://192.168.1.10:6443
  network:
    cni:
      name: custom
      urls:
        - https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

Resource Requirements:

RAM: 768MB total (256MB OS + 512MB Kubernetes)
CPU: 1-2 cores
Disk: 10-15GB (500MB OS + 10GB containers)

Maintenance:

# Upgrade Talos (OS + Kubernetes)
talosctl upgrade --nodes 192.168.1.10 --image ghcr.io/siderolabs/installer:v1.9.0

# Upgrade Kubernetes version
talosctl upgrade-k8s --nodes 192.168.1.10 --to 1.32.0

# Apply config changes
talosctl apply-config --nodes 192.168.1.10 --file controlplane.yaml

Pros and Cons

Good, because Kubernetes built-in (no separate installation)
Good, because minimal attack surface (no SSH, shell, package manager)
Good, because immutable infrastructure (config drift impossible)
Good, because API-driven management (GitOps-friendly)
Good, because lowest resource overhead (~768MB RAM)
Good, because declarative configuration (YAML in version control)
Good, because secure by default (no manual hardening)
Good, because smallest disk footprint (~500MB OS)
Good, because designed specifically for Kubernetes
Good, because simple declarative upgrades (OS + K8s)
Good, because UEFI Secure Boot support
Neutral, because smaller community (but active and helpful)
Bad, because steep learning curve (paradigm shift)
Bad, because limited to Kubernetes workloads only
Bad, because troubleshooting without shell requires different approach
Bad, because relatively new (less mature than Ubuntu/Fedora)
Bad, because no escape hatch for manual intervention

Option 4: Harvester HCI (hyperconverged platform)

Use Harvester, a hyperconverged infrastructure platform built on K3s and KubeVirt for unified VM + container management.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Harvester as Harvester HCI
    participant K3s as K3s / KubeVirt
    participant Storage as Longhorn Storage
    
    Admin->>Server: Boot Harvester ISO
    Server->>Harvester: Installation wizard
    Admin->>Harvester: Configure cluster (VIP, storage)
    Harvester->>Server: Install RancherOS 2.0
    Harvester->>Server: Install K3s
    Server->>Server: Reboot
    Harvester->>K3s: Start K3s server
    K3s->>Storage: Deploy Longhorn
    K3s->>Server: Deploy KubeVirt
    K3s->>Server: Deploy multus CNI
    Harvester-->>Admin: Web UI ready
    Admin->>Harvester: Add nodes
    Harvester->>K3s: Join cluster
    K3s-->>Admin: Cluster ready (20-30 minutes)

Implementation Details

Installation: Interactive ISO wizard or cloud-init config

Resource Requirements:

RAM: 8GB minimum per node (16GB+ recommended)
CPU: 4+ cores per node
Disk: 250GB+ per node (100GB OS + 150GB storage)
Nodes: 3+ for production HA

Features:

Web UI management
Built-in storage (Longhorn)
VM support (KubeVirt)
Live migration
Rancher integration

Pros and Cons

Good, because unified VM + container platform
Good, because built-in K3s (Kubernetes included)
Good, because web UI simplifies management
Good, because built-in persistent storage (Longhorn)
Good, because VM live migration
Good, because Rancher integration
Neutral, because immutable OS layer
Bad, because very heavy resource requirements (8GB+ RAM)
Bad, because complex architecture (KubeVirt, Longhorn, multus)
Bad, because overkill for container-only workloads
Bad, because larger attack surface (web UI, VM layer)
Bad, because requires 3+ nodes for HA (not single-node friendly)
Bad, because steep learning curve for full feature set

More Information

Detailed Analysis

For in-depth analysis of each operating system:

Ubuntu Server Analysis
- Installation methods (kubeadm, k3s, MicroK8s)
- Cluster initialization sequences
- Maintenance requirements and upgrade procedures
- Resource overhead and security posture
Fedora Server Analysis
- kubeadm with CRI-O installation
- SELinux configuration for Kubernetes
- Rapid release cycle implications
- RHEL ecosystem compatibility
Talos Linux Analysis
- API-driven, immutable architecture
- Declarative configuration model
- Security-first design principles
- Production readiness and advanced features
Harvester HCI Analysis
- Hyperconverged infrastructure capabilities
- VM + container unified platform
- KubeVirt and Longhorn integration
- Multi-node cluster requirements

Key Findings Summary

Resource efficiency comparison:

✅ Talos: 768MB RAM, 500MB disk (most efficient)
✅ Ubuntu + k3s: 1GB RAM, 20GB disk (efficient)
⚠️ Fedora + kubeadm: 2.2GB RAM, 35GB disk (moderate)
❌ Harvester: 8GB+ RAM, 250GB+ disk (heavy)

Security posture comparison:

✅ Talos: Minimal attack surface (no SSH/shell, immutable)
✅ Fedora: SELinux by default (strong MAC)
⚠️ Ubuntu: AppArmor (moderate security)
⚠️ Harvester: Larger attack surface (web UI, VM layer)

Operational complexity comparison:

✅ Ubuntu + k3s: Single command install, familiar management
✅ Talos: Declarative, automated (after learning curve)
⚠️ Fedora + kubeadm: Manual kubeadm steps, frequent OS upgrades
❌ Harvester: Complex HCI architecture, heavy requirements

Decision Matrix

Criterion	Ubuntu + k3s	Fedora + kubeadm	Talos Linux	Harvester
Setup Simplicity	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
Maintenance Burden	⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Security Posture	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐
Resource Efficiency	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐
Learning Curve	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐	⭐⭐⭐
Community Support	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Immutability	⭐	⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
GitOps-Friendly	⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐
Purpose-Built	⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Overall Score	29/45	24/45	38/45	28/45

Talos Linux scores highest for Kubernetes-dedicated homelab infrastructure prioritizing security, efficiency, and GitOps workflows.

Trade-offs Analysis

Choosing Talos Linux:

Wins: Best security, lowest overhead, declarative configuration, minimal maintenance
Loses: Steeper learning curve, no shell access, smaller community

Choosing Ubuntu + k3s:

Wins: Easiest adoption, largest community, general-purpose flexibility
Loses: Higher attack surface, manual OS management, imperative config

Choosing Fedora + kubeadm:

Wins: Latest features, SELinux, enterprise compatibility
Loses: Frequent OS upgrades, complex setup, higher overhead

Choosing Harvester:

Wins: VM + container unified platform, web UI
Loses: Heavy resources, complex architecture, overkill for K8s-only

For a Kubernetes-dedicated homelab prioritizing security and efficiency, Talos Linux’s benefits outweigh the learning curve investment.

ADR-0001: Use MADR for Architecture Decision Records - MADR format used for this ADR
ADR-0002: Network Boot Architecture - Server provisioning architecture
ADR-0003: Cloud Provider Selection - Cloud infrastructure decisions

Future Considerations

Team Growth: If team grows beyond single person, reassess Ubuntu for familiarity
VM Requirements: If VM workloads emerge, consider Harvester or KubeVirt on Talos
Enterprise Patterns: If RHEL compatibility needed, reconsider Fedora/CentOS Stream
Maintenance Burden: If Talos learning curve proves too steep, fallback to k3s
Talos Maturity: Monitor Talos ecosystem growth and production adoption

Issue #598 - story(docs): create adr for server operating system

2.5 - [0005] Network Boot Infrastructure Implementation on Google Cloud

Evaluate implementation approaches for deploying network boot infrastructure on Google Cloud Platform using UEFI HTTP boot, comparing custom server implementation versus Matchbox-based solution.

Context and Problem Statement

ADR-0002 established that network boot infrastructure will be hosted on a cloud provider accessed via WireGuard VPN. ADR-0003 selected Google Cloud Platform as the hosting provider to consolidate infrastructure and leverage existing expertise.

The remaining question is: How should the network boot server itself be implemented?

This decision affects:

Development Effort: Time required to build, test, and maintain the solution
Feature Completeness: Capabilities for boot image management, machine mapping, and provisioning workflows
Operational Complexity: Deployment, monitoring, and troubleshooting burden
Security: Boot image integrity, access control, and audit capabilities
Scalability: Ability to grow from single home lab to multiple environments

The boot server must handle:

HTTP/HTTPS requests for UEFI boot scripts, kernels, initrd images, and cloud-init configurations
Machine-to-image mapping to serve appropriate boot files based on MAC address, hardware profile, or tags
Boot image lifecycle management including upload, versioning, and rollback capabilities

Hardware-Specific Context

The target bare metal servers (HP DL360 Gen 9) have the following network boot capabilities:

UEFI HTTP Boot: Supported in iLO 4 firmware v2.40+ (released 2016)
TLS Support: Server-side TLS only (no client certificate authentication)
Boot Process: Firmware handles initial HTTP requests directly (no PXE/TFTP chain loading required)
Configuration: Boot URL configured via iLO RBSU or UEFI System Utilities

Security Implications: Since the servers cannot present client certificates for mTLS authentication with Cloudflare, the WireGuard VPN serves as the secure transport layer for boot traffic. The HTTP boot server is only accessible through the VPN tunnel.

Reference: HP DL360 Gen 9 Network Boot Analysis

Decision Drivers

Time to Production: Minimize time to get a working network boot infrastructure
Feature Requirements: Must support machine-specific boot configurations, image versioning, and cloud-init integration
Maintenance Burden: Prefer solutions that minimize ongoing maintenance and updates
GCP Integration: Should leverage GCP services (Cloud Storage, Secret Manager, IAM)
Security: Boot images must be served securely with access control and integrity verification
Observability: Comprehensive logging and monitoring for troubleshooting boot failures
Cost: Minimize infrastructure costs while meeting functional requirements
Future Flexibility: Ability to extend or customize as needs evolve

Considered Options

Option 1: Custom server implementation (Go-based)
Option 2: Matchbox-based solution

Decision Outcome

Chosen option: “Option 1: Custom implementation”, because:

UEFI HTTP Boot Simplification: Elimination of TFTP/PXE dramatically reduces implementation complexity
Cloud Run Deployment: HTTP-only boot enables serverless deployment (~$5/month vs $8-17/month)
Development Time Manageable: UEFI HTTP boot reduces custom development to 2-3 weeks
Full Control: Custom implementation maintains flexibility for future home lab requirements
GCP Native Integration: Direct Cloud Storage, Firestore, Secret Manager, and IAM integration
Existing Framework: Leverages z5labs/humus patterns already in use across services
HTTP REST API: Native HTTP REST admin API via z5labs/humus framework provides better integration with existing tooling

Consequences

Good, because UEFI HTTP boot eliminates TFTP complexity entirely
Good, because Cloud Run deployment reduces operational overhead and cost
Good, because leverages existing z5labs/humus framework and Go expertise
Good, because GCP native integration (Cloud Storage, Firestore, Secret Manager, IAM)
Good, because full control over implementation enables future customization
Good, because simplified testing (HTTP-only, no TFTP/PXE edge cases)
Good, because OpenTelemetry observability built-in from existing patterns
Neutral, because requires 2-3 weeks development time vs 1 week for Matchbox setup
Neutral, because ongoing maintenance responsibility (no upstream project support)
Bad, because custom implementation may miss edge cases that Matchbox handles
Bad, because reinvents machine matching and boot configuration patterns
Bad, because Cloud Run cold start latency needs monitoring (mitigated with min instances = 1)

Confirmation

The implementation success will be validated by:

Successfully deploying custom boot server on GCP Cloud Run
Successfully network booting HP DL360 Gen 9 via UEFI HTTP boot through WireGuard VPN
Confirming iLO 4 firmware v2.40+ compatibility with HTTP boot workflow
Validating boot image upload and versioning workflows via HTTP REST API
Measuring Cloud Run cold start latency for boot requests (target: < 100ms)
Measuring boot file request latency for kernel/initrd downloads (target: < 100ms)
Confirming Cloud Storage integration for boot asset storage
Testing machine-to-image mapping based on MAC address using Firestore
Validating WireGuard VPN security for boot traffic (compensating for lack of client cert support)
Verifying OpenTelemetry observability integration with Cloud Monitoring

Pros and Cons of the Options

Option 1: Custom Server Implementation (Go-based)

Build a custom network boot server in Go, leveraging the existing z5labs/humus framework for HTTP services.

Architecture Overview

architecture-beta
    group gcp(cloud)[GCP VPC]

    service wg_nlb(internet)[Network LB] in gcp
    service wireguard(server)[WireGuard Gateway] in gcp
    service https_lb(internet)[HTTPS LB] in gcp
    service compute(server)[Compute Engine] in gcp
    service storage(database)[Cloud Storage] in gcp
    service firestore(database)[Firestore] in gcp
    service secrets(disk)[Secret Manager] in gcp
    service monitoring(internet)[Cloud Monitoring] in gcp

    group homelab(cloud)[Home Lab]
    service udm(server)[UDM Pro] in homelab
    service servers(server)[Bare Metal Servers] in homelab

    servers:L -- R:udm
    udm:R -- L:wg_nlb
    wg_nlb:R -- L:wireguard
    wireguard:R -- L:https_lb
    https_lb:R -- L:compute
    compute:B --> T:storage
    compute:B --> T:firestore
    compute:R --> L:secrets
    compute:T --> B:monitoring

Components:

Boot Server: Go service deployed to Cloud Run (or Compute Engine VM as fallback)
- HTTP/HTTPS server (using z5labs/humus framework with OpenAPI)
- UEFI HTTP boot endpoint serving boot scripts and assets
- HTTP REST admin API for boot configuration management
Cloud Storage: Buckets for boot images, boot scripts, kernels, initrd files
Firestore/Datastore: Machine-to-image mapping database (MAC → boot profile)
Secret Manager: WireGuard keys, TLS certificates (optional for HTTPS boot)
Cloud Monitoring: Metrics for boot requests, success/failure rates, latency

Boot Image Lifecycle

sequenceDiagram
    participant Admin
    participant API as Boot Server API
    participant Storage as Cloud Storage
    participant DB as Firestore
    participant Monitor as Cloud Monitoring

    Note over Admin,Monitor: Upload Boot Image
    Admin->>API: POST /api/v1/images (kernel, initrd, metadata)
    API->>API: Validate image integrity (checksum)
    API->>Storage: Upload kernel to gs://boot-images/kernels/
    API->>Storage: Upload initrd to gs://boot-images/initrd/
    API->>DB: Store metadata (version, checksum, tags)
    API->>Monitor: Log upload event
    API->>Admin: 201 Created (image ID)

    Note over Admin,Monitor: Map Machine to Image
    Admin->>API: POST /api/v1/machines (MAC, image_id, profile)
    API->>DB: Store machine mapping
    API->>Admin: 201 Created

    Note over Admin,Monitor: UEFI HTTP Boot Request
    participant Server as Home Lab Server
    Note right of Server: iLO 4 firmware v2.40+ initiates HTTP request directly
    Server->>API: HTTP GET /boot?mac=aa:bb:cc:dd:ee:ff (via WireGuard VPN)
    API->>DB: Query machine mapping by MAC
    API->>API: Generate iPXE script (kernel, initrd URLs)
    API->>Monitor: Log boot script request
    API->>Server: Send iPXE script
    
    Server->>API: HTTP GET /kernels/ubuntu-22.04.img
    API->>Storage: Fetch kernel from Cloud Storage
    API->>Monitor: Log kernel download (size, duration)
    API->>Server: Stream kernel file
    
    Server->>API: HTTP GET /initrd/ubuntu-22.04.img
    API->>Storage: Fetch initrd from Cloud Storage
    API->>Monitor: Log initrd download
    API->>Server: Stream initrd file
    
    Server->>Server: Boot into OS
    
    Note over Admin,Monitor: Rollback Image Version
    Admin->>API: POST /api/v1/machines/{mac}/rollback
    API->>DB: Update machine mapping to previous image_id
    API->>Monitor: Log rollback event
    API->>Admin: 200 OK

Implementation Details

Development Stack:

Language: Go 1.24 (leverage existing Go expertise)
HTTP Framework: z5labs/humus (consistent with existing services)
UEFI Boot: Standard HTTP handlers (no special libraries needed)
Storage Client: cloud.google.com/go/storage
Database: Firestore for machine mappings (or simple JSON config in Cloud Storage)
Observability: OpenTelemetry (metrics, traces, logs to Cloud Monitoring/Trace)

Deployment:

Cloud Run (preferred - HTTP-only boot enables serverless deployment):
- Min instances: 1 (ensures fast boot response, avoids cold start delays)
- Max instances: 2 (home lab scale)
- Memory: 512MB
- CPU: 1 vCPU
- Health checks: /health/startup, /health/liveness
- Concurrency: 10 requests per instance
Alternative - Compute Engine VM (if Cloud Run latency unacceptable):
- e2-micro instance ($6.50/month)
- Container-Optimized OS with Docker
- systemd service for boot server
- Health checks: /health/startup, /health/liveness
Networking:
- VPC firewall: Allow TCP/80, TCP/443 from WireGuard subnet (no UDP/69 needed)
- Static internal IP for boot server (Compute Engine) or HTTPS Load Balancer (Cloud Run)
- Cloud NAT for outbound connectivity (Cloud Storage access)

Configuration Management:

Machine mappings stored in Firestore or Cloud Storage JSON files

Boot profiles defined in YAML (similar to Matchbox groups):

profiles:
  - name: ubuntu-22.04-server
    kernel: gs://boot-images/kernels/ubuntu-22.04.img
    initrd: gs://boot-images/initrd/ubuntu-22.04.img
    cmdline: "console=tty0 console=ttyS0"
    cloud_init: gs://boot-images/cloud-init/ubuntu-base.yaml

machines:
  - mac: "aa:bb:cc:dd:ee:ff"
    profile: ubuntu-22.04-server
    hostname: node-01

Cost Breakdown:

Option A: Cloud Run Deployment (Preferred):

Component	Monthly Cost
Cloud Run (1 min instance, 512MB, always-on)	$3.50
Cloud Storage (50GB boot images)	$1.00
Firestore (minimal reads/writes)	$0.50
Egress (10 boots × 150MB)	$0.18
Total	~$5.18

Option B: Compute Engine Deployment (If Cloud Run latency unacceptable):

Component	Monthly Cost
e2-micro VM (boot server)	$6.50
Cloud Storage (50GB boot images)	$1.00
Firestore (minimal reads/writes)	$0.50
Egress (10 boots × 150MB)	$0.18
Total	~$8.18

Pros and Cons

Good, because UEFI HTTP boot eliminates TFTP complexity entirely
Good, because Cloud Run deployment option reduces operational overhead and infrastructure cost
Good, because full control over boot server implementation and features
Good, because leverages existing Go expertise and z5labs/humus framework patterns
Good, because seamless GCP integration (Cloud Storage, Firestore, Secret Manager, IAM)
Good, because minimal dependencies (no external projects to track)
Good, because customizable to specific home lab requirements
Good, because OpenTelemetry observability built-in from existing patterns
Good, because can optimize for home lab scale (< 20 machines)
Good, because lightweight implementation (no unnecessary features)
Good, because simplified testing (HTTP-only, no TFTP/PXE edge cases)
Good, because standard HTTP serving is well-understood (lower risk than TFTP)
Neutral, because development effort required (2-3 weeks for MVP, reduced from 3-4 weeks)
Neutral, because requires ongoing maintenance and security updates
Neutral, because Cloud Run cold start latency needs validation (POC required)
Bad, because reinvents machine matching and boot configuration patterns
Bad, because testing network boot scenarios still requires hardware
Bad, because potential for bugs in custom implementation
Bad, because no community support or established best practices
Bad, because development time still longer than Matchbox (2-3 weeks vs 1 week)

Option 2: Matchbox-Based Solution

Deploy Matchbox, an open-source network boot server developed by CoreOS (now part of Red Hat), to handle UEFI HTTP boot workflows.

Architecture Overview

architecture-beta
    group gcp(cloud)[GCP VPC]
    
    service wg_nlb(internet)[Network LB] in gcp
    service wireguard(server)[WireGuard Gateway] in gcp
    service https_lb(internet)[HTTPS LB] in gcp
    service compute(server)[Compute Engine] in gcp
    service storage(database)[Cloud Storage] in gcp
    service secrets(disk)[Secret Manager] in gcp
    service monitoring(internet)[Cloud Monitoring] in gcp
    
    group homelab(cloud)[Home Lab]
    service udm(server)[UDM Pro] in homelab
    service servers(server)[Bare Metal Servers] in homelab
    
    servers:L -- R:udm
    udm:R -- L:wg_nlb
    wg_nlb:R -- L:wireguard
    wireguard:R -- L:https_lb
    https_lb:R -- L:compute
    compute:B --> T:storage
    compute:R --> L:secrets
    compute:T --> B:monitoring

Components:

Matchbox Server: Container deployed to Cloud Run or Compute Engine VM
- HTTP/gRPC APIs for boot workflows and configuration
- UEFI HTTP boot support (TFTP disabled)
- Machine grouping and profile templating
- Ignition, Cloud-Init, and generic boot support
Cloud Storage: Backend for boot assets (mounted via gcsfuse or synced periodically)
Local Storage (Compute Engine only): /var/lib/matchbox for assets and configuration (synced from Cloud Storage)
Secret Manager: WireGuard keys, Matchbox TLS certificates
Cloud Monitoring: Logs from Matchbox container, custom metrics via log parsing

Boot Image Lifecycle

sequenceDiagram
    participant Admin
    participant CLI as matchbox CLI / API
    participant Matchbox as Matchbox Server
    participant Storage as Cloud Storage
    participant Monitor as Cloud Monitoring

    Note over Admin,Monitor: Upload Boot Image
    Admin->>CLI: Upload kernel/initrd via gRPC API
    CLI->>Matchbox: gRPC CreateAsset(kernel, initrd)
    Matchbox->>Matchbox: Validate asset integrity
    Matchbox->>Matchbox: Store to /var/lib/matchbox/assets/
    Matchbox->>Storage: Sync to gs://boot-assets/ (via sidecar script)
    Matchbox->>Monitor: Log asset upload event
    Matchbox->>CLI: Asset ID, checksum

    Note over Admin,Monitor: Create Boot Profile
    Admin->>CLI: Create profile YAML (kernel, initrd, cmdline)
    CLI->>Matchbox: gRPC CreateProfile(profile.yaml)
    Matchbox->>Matchbox: Store to /var/lib/matchbox/profiles/
    Matchbox->>Storage: Sync profiles to gs://boot-config/
    Matchbox->>CLI: Profile ID

    Note over Admin,Monitor: Create Machine Group
    Admin->>CLI: Create group YAML (MAC selector, profile mapping)
    CLI->>Matchbox: gRPC CreateGroup(group.yaml)
    Matchbox->>Matchbox: Store to /var/lib/matchbox/groups/
    Matchbox->>Storage: Sync groups to gs://boot-config/
    Matchbox->>CLI: Group ID

    Note over Admin,Monitor: UEFI HTTP Boot Request
    participant Server as Home Lab Server
    Note right of Server: iLO 4 firmware v2.40+ initiates HTTP request directly
    Server->>Matchbox: HTTP GET /boot.ipxe?mac=aa:bb:cc:dd:ee:ff (via WireGuard VPN)
    Matchbox->>Matchbox: Match MAC to group
    Matchbox->>Matchbox: Render iPXE template with profile
    Matchbox->>Monitor: Log boot request (MAC, group, profile)
    Matchbox->>Server: Send iPXE script
    
    Server->>Matchbox: HTTP GET /assets/ubuntu-22.04-kernel.img
    Matchbox->>Matchbox: Serve from /var/lib/matchbox/assets/
    Matchbox->>Monitor: Log asset download (size, duration)
    Matchbox->>Server: Stream kernel file
    
    Server->>Matchbox: HTTP GET /assets/ubuntu-22.04-initrd.img
    Matchbox->>Matchbox: Serve from /var/lib/matchbox/assets/
    Matchbox->>Monitor: Log asset download
    Matchbox->>Server: Stream initrd file
    
    Server->>Server: Boot into OS
    
    Note over Admin,Monitor: Rollback Machine Group
    Admin->>CLI: Update group YAML (change profile reference)
    CLI->>Matchbox: gRPC UpdateGroup(group.yaml)
    Matchbox->>Matchbox: Update /var/lib/matchbox/groups/
    Matchbox->>Storage: Sync updated group config
    Matchbox->>Monitor: Log group update
    Matchbox->>CLI: Success

Implementation Details

Matchbox Deployment:

Container: quay.io/poseidon/matchbox:latest (official image)
Deployment Options:
- Cloud Run (preferred - HTTP-only boot enables serverless deployment):
  - Min instances: 1 (ensures fast boot response)
  - Memory: 1GB RAM (Matchbox recommendation)
  - CPU: 1 vCPU
  - Storage: Cloud Storage for assets/profiles/groups (via HTTP API)
- Compute Engine VM (if persistent local storage preferred):
  - e2-small instance ($14/month, 2GB RAM recommended for Matchbox)
  - /var/lib/matchbox: Persistent disk (10GB SSD, $1.70/month)
  - Cloud Storage sync: Periodic backup of assets/profiles/groups to gs://matchbox-config/
  - Option: Use gcsfuse to mount Cloud Storage directly (adds latency but simplifies backups)

Configuration Structure:

/var/lib/matchbox/
├── assets/           # Boot images (kernels, initrds, ISOs)
│   ├── ubuntu-22.04-kernel.img
│   ├── ubuntu-22.04-initrd.img
│   └── flatcar-stable.img.gz
├── profiles/         # Boot profiles (YAML)
│   ├── ubuntu-server.yaml
│   └── flatcar-container.yaml
└── groups/           # Machine groups (YAML)
    ├── default.yaml
    ├── node-01.yaml
    └── storage-nodes.yaml

Example Profile (profiles/ubuntu-server.yaml):

id: ubuntu-22.04-server
name: Ubuntu 22.04 LTS Server
boot:
  kernel: /assets/ubuntu-22.04-kernel.img
  initrd:
    - /assets/ubuntu-22.04-initrd.img
  args:
    - console=tty0
    - console=ttyS0
    - ip=dhcp
ignition_id: ubuntu-base.yaml

Example Group (groups/node-01.yaml):

id: node-01
name: Node 01 - Ubuntu Server
profile: ubuntu-22.04-server
selector:
  mac: "aa:bb:cc:dd:ee:ff"
metadata:
  hostname: node-01.homelab.local
  ssh_authorized_keys:
    - "ssh-ed25519 AAAA..."

GCP Integration:

Cloud Storage Sync: Cron job or sidecar container to sync /var/lib/matchbox to Cloud Storage

# Sync every 5 minutes
*/5 * * * * gsutil -m rsync -r /var/lib/matchbox gs://matchbox-config/

Secret Manager: Store Matchbox TLS certificates for gRPC API authentication
Cloud Monitoring: Ship Matchbox logs to Cloud Logging, parse for metrics:
- Boot request count by MAC/group
- Asset download success/failure rates
- TFTP vs HTTP request distribution

Networking:

VPC firewall: Allow TCP/8080 (HTTP), TCP/8081 (gRPC) from WireGuard subnet (no UDP/69 needed)
Optional: Internal load balancer if high availability required (adds ~$18/month)
Note: Cloud Run deployment includes integrated HTTPS load balancing

Cost Breakdown:

Option A: Cloud Run Deployment (Preferred):

Component	Monthly Cost
Cloud Run (1 min instance, 1GB RAM, always-on)	$7.00
Cloud Storage (50GB boot images)	$1.00
Egress (10 boots × 150MB)	$0.18
Total	~$8.18

Option B: Compute Engine Deployment (If persistent local storage preferred):

Component	Monthly Cost
e2-small VM (Matchbox server)	$14.00
Persistent SSD (10GB)	$1.70
Cloud Storage (50GB backups)	$1.00
Egress (10 boots × 150MB)	$0.18
Total	~$16.88

Pros and Cons

Good, because HTTP-only boot enables Cloud Run deployment (reduces cost significantly)
Good, because UEFI HTTP boot eliminates TFTP complexity and potential failure points
Good, because production-ready boot server with extensive real-world usage
Good, because feature-complete with machine grouping, templating, and multi-OS support
Good, because gRPC API for programmatic boot configuration management
Good, because supports Ignition (Flatcar, CoreOS), Cloud-Init, and generic boot workflows
Good, because well-documented with established best practices
Good, because active community and upstream maintenance (Red Hat/CoreOS)
Good, because reduces development time to days (deploy + configure vs weeks of coding)
Good, because avoids reinventing network boot patterns (machine matching, boot configuration)
Good, because proven security model (TLS for gRPC, asset integrity checks)
Neutral, because requires learning Matchbox configuration patterns (YAML profiles/groups)
Neutral, because containerized deployment (Docker on Compute Engine or Cloud Run)
Neutral, because Cloud Run deployment option competitive with custom implementation cost
Bad, because introduces external dependency (Matchbox project maintenance)
Bad, because some features unnecessary for home lab scale (large-scale provisioning, etcd backend)
Bad, because less control over implementation details (limited customization)
Bad, because Cloud Storage integration requires custom sync scripts (Matchbox doesn’t natively support GCS backend)
Bad, because dependency on upstream for security patches and bug fixes

UEFI HTTP Boot Architecture

This section documents the UEFI HTTP boot capability that fundamentally changes the network boot infrastructure design.

Boot Process Overview

Traditional PXE Boot (NOT USED - shown for comparison):

sequenceDiagram
    participant Server as Bare Metal Server
    participant DHCP as DHCP Server
    participant TFTP as TFTP Server
    participant HTTP as HTTP Server

    Note over Server,HTTP: Traditional PXE Boot Chain (NOT USED)
    Server->>DHCP: DHCP Discover
    DHCP->>Server: DHCP Offer (TFTP server, boot filename)
    Server->>TFTP: TFTP GET /pxelinux.0
    TFTP->>Server: Send PXE bootloader
    Server->>TFTP: TFTP GET /ipxe.efi
    TFTP->>Server: Send iPXE binary
    Server->>HTTP: HTTP GET /boot.ipxe
    HTTP->>Server: Send boot script
    Server->>HTTP: HTTP GET /kernel, /initrd
    HTTP->>Server: Stream boot files

UEFI HTTP Boot (ACTUAL IMPLEMENTATION):

sequenceDiagram
    participant Server as HP DL360 Gen 9<br/>(iLO 4 v2.40+)
    participant DHCP as DHCP Server<br/>(UDM Pro)
    participant VPN as WireGuard VPN
    participant HTTP as HTTP Boot Server<br/>(GCP Cloud Run)

    Note over Server,HTTP: UEFI HTTP Boot (ACTUAL IMPLEMENTATION)
    Server->>DHCP: DHCP Discover
    DHCP->>Server: DHCP Offer (boot URL: http://boot.internal/boot.ipxe?mac=...)
    Note right of Server: Firmware initiates HTTP request directly<br/>(no TFTP/PXE chain loading)
    Server->>VPN: WireGuard tunnel established
    Server->>HTTP: HTTP GET /boot.ipxe?mac=aa:bb:cc:dd:ee:ff
    HTTP->>Server: Send boot script with kernel/initrd URLs
    Server->>HTTP: HTTP GET /assets/talos-kernel.img
    HTTP->>Server: Stream kernel (via WireGuard)
    Server->>HTTP: HTTP GET /assets/talos-initrd.img
    HTTP->>Server: Stream initrd (via WireGuard)
    Server->>Server: Boot into OS

Key Differences

Aspect	Traditional PXE	UEFI HTTP Boot
Initial Protocol	TFTP (UDP/69)	HTTP (TCP/80) or HTTPS (TCP/443)
Boot Loader	Requires TFTP transfer of iPXE binary	Firmware has HTTP client built-in
Chain Loading	PXE → TFTP → iPXE → HTTP	Direct HTTP boot (no chain)
Firewall Rules	UDP/69, TCP/80, TCP/443	TCP/80, TCP/443 only
Cloud Run Support	❌ (UDP not supported)	✅ (HTTP-only)
Transfer Speed	~1-5 Mbps (TFTP)	10-100 Mbps (HTTP)
Complexity	High (multiple protocols)	Low (HTTP-only)

Security Architecture

Challenge: HP DL360 Gen 9 UEFI HTTP boot does not support client-side TLS certificates (mTLS).

Solution: WireGuard VPN provides transport-layer security:

flowchart LR
    subgraph homelab[Home Lab]
        server[HP DL360 Gen 9<br/>UEFI HTTP Boot<br/>iLO 4 v2.40+]
        udm[UDM Pro<br/>WireGuard Client]
    end

    subgraph gcp[Google Cloud Platform]
        wg_gw[WireGuard Gateway<br/>Compute Engine]
        cr[Boot Server<br/>Cloud Run]
    end

    server -->|HTTP| udm
    udm -->|Encrypted WireGuard Tunnel| wg_gw
    wg_gw -->|HTTP| cr

    style server fill:#f9f,stroke:#333
    style udm fill:#bbf,stroke:#333
    style wg_gw fill:#bfb,stroke:#333
    style cr fill:#fbb,stroke:#333

Why WireGuard instead of Cloudflare mTLS?

Cloudflare mTLS Limitation: Requires client certificates at TLS layer
UEFI Firmware Limitation: Cannot present client certificates during TLS handshake
WireGuard Solution: Provides mutual authentication at network layer (pre-shared keys)
Security Equivalent: WireGuard offers same security properties as mTLS:
- Mutual authentication (both endpoints authenticated)
- Confidentiality (all traffic encrypted)
- Integrity (authenticated encryption via ChaCha20-Poly1305)
- No Internet exposure (boot server only accessible via VPN)

Firmware Configuration

HP iLO 4 UEFI HTTP Boot Setup:

Access Configuration:
- iLO web interface → Remote Console → Power On → Press F9 (RBSU)
- Or: Direct RBSU access during POST (Press F9)
Enable UEFI HTTP Boot:
- Navigate: System Configuration → BIOS/Platform Configuration (RBSU) → Network Options
- Set Network Boot to Enabled
- Set Boot Mode to UEFI (not Legacy BIOS)
- Enable UEFI HTTP Boot Support
Configure NIC:
- Navigate: RBSU → Network Options → [FlexibleLOM/PCIe NIC]
- Set Option ROM to Enabled (required for UEFI boot option to appear)
- Set Network Boot to Enabled
- Configure IPv4/IPv6 settings (DHCP or static)
Set Boot Order:
- Navigate: RBSU → Boot Options → UEFI Boot Order
- Move network device to top priority
Configure Boot URL (via DHCP or static):
- DHCP option 67: http://10.x.x.x/boot.ipxe?mac=${net0/mac}
- Or: Static configuration in UEFI System Utilities

Required Firmware Versions:

iLO 4: v2.40 or later (for UEFI HTTP boot support)
System ROM: P89 v2.60 or later (recommended)

Verification:

# Check iLO firmware version via REST API
curl -k -u admin:password https://ilo-address/redfish/v1/Managers/1/ | jq '.FirmwareVersion'

# Expected output: "2.40" or higher

Architectural Implications

TFTP Elimination Impact:

Deployment: Cloud Run becomes viable (no UDP/TFTP requirement)
Cost: Reduced infrastructure costs (~$5-8/month vs $8-17/month)
Complexity: Simplified networking (TCP-only firewall rules)
Development: Reduced effort (no TFTP library, testing, edge cases)
Scalability: Cloud Run autoscaling vs fixed VM capacity
Maintenance: Serverless reduces operational overhead

Decision Impact:

The removal of TFTP complexity fundamentally shifts the cost/benefit analysis:

Custom Implementation: More attractive (Cloud Run, reduced development time)
Matchbox: Still valid but cost/complexity advantage reduced
TCO Gap: Narrowed from ~$8,000-12,000 to ~$4,000-8,000 (Year 1)
Development Gap: Reduced from 2-3 weeks to 1-2 weeks

Detailed Comparison

Feature Comparison

Feature	Custom Implementation	Matchbox
UEFI HTTP Boot	✅ Native (standard HTTP)	✅ Built-in
HTTP/HTTPS Boot	✅ Via z5labs/humus	✅ Built-in
Cloud Run Deployment	✅ Preferred option	✅ Enabled by HTTP-only
Boot Scripting	✅ Custom templates	✅ Go templates
Machine-to-Image Mapping	✅ Firestore/JSON	✅ YAML groups with selectors
Boot Profile Management	✅ Custom API	✅ gRPC API + YAML
Cloud-Init Support	⚠️ Requires implementation	✅ Native support
Ignition Support	❌ Not planned	✅ Native support (Flatcar, CoreOS)
Asset Versioning	⚠️ Requires implementation	⚠️ Manual (via Cloud Storage versioning)
Rollback Capability	⚠️ Requires implementation	✅ Update group to previous profile
OpenTelemetry Observability	✅ Built-in	⚠️ Logs only (requires parsing)
GCP Cloud Storage Integration	✅ Native SDK	⚠️ Requires sync scripts
HTTP REST Admin API	✅ Native (z5labs/humus)	⚠️ gRPC only
Multi-Environment Support	⚠️ Requires implementation	✅ Groups + metadata

Development Effort Comparison

Task	Custom Implementation	Matchbox
Initial Setup	1-2 days (project scaffolding)	4-8 hours (deployment + config)
UEFI HTTP Boot	1-2 days (standard HTTP endpoints)	✅ Included
HTTP Boot API	2-3 days (z5labs/humus endpoints)	✅ Included
Machine Matching Logic	2-3 days (database queries, selectors)	✅ Included
Boot Script Templates	2-3 days (boot script templating)	✅ Included
Cloud-Init Support	3-5 days (parsing, injection)	✅ Included
Asset Management	2-3 days (upload, storage)	✅ Included
HTTP REST Admin API	2-3 days (OpenAPI endpoints)	✅ Included (gRPC)
Cloud Run Deployment	1 day (Cloud Run config)	1 day (Cloud Run config)
Testing	3-5 days (unit, integration, E2E - simplified)	2-3 days (integration only)
Documentation	2-3 days	1 day (reference existing docs)
Total Effort	2-3 weeks	1 week

Operational Complexity

Aspect	Custom Implementation	Matchbox
Deployment	Docker container on Compute Engine	Docker container on Compute Engine
Configuration Updates	API calls or Terraform updates	YAML file updates + API/filesystem sync
Monitoring	OpenTelemetry metrics to Cloud Monitoring	Log parsing + custom metrics
Troubleshooting	Full access to code, custom logging	Matchbox logs + gRPC API inspection
Security Patches	Manual code updates	Upstream container image updates
Dependency Updates	Manual Go module updates	Upstream Matchbox updates
Backup/Restore	Cloud Storage + Firestore backups	Sync `/var/lib/matchbox` to Cloud Storage

Cost Comparison Summary

Comparing Cloud Run Deployments (Preferred for both options):

Item	Custom (Cloud Run)	Matchbox (Cloud Run)	Difference
Compute	Cloud Run ($3.50/month)	Cloud Run ($7/month)	+$3.50/month
Storage	Cloud Storage ($1/month)	Cloud Storage ($1/month)	$0
Development	2-3 weeks @ $100/hour = $8,000-12,000	1 week @ $100/hour = $4,000	-$4,000-8,000
Annual Infrastructure	~$54	~$96	+$42/year
TCO (Year 1)	~$8,054-12,054	~$4,096	-$3,958-7,958
TCO (Year 3)	~$8,162-12,162	~$4,288	-$3,874-7,874

Key Insights:

UEFI HTTP boot enables Cloud Run deployment for both options, dramatically reducing infrastructure costs
Custom implementation TCO gap narrowed from $7,895-11,895 to $3,958-7,958 (Year 1)
Both options now cost ~$5-8/month for infrastructure (vs $8-17/month with TFTP)
Development time difference reduced from 2-3 weeks to 1-2 weeks
Decision is much closer than originally assessed

Risk Analysis

Risk	Custom Implementation	Matchbox	Mitigation
Security Vulnerabilities	Medium (standard HTTP code, well-understood)	Medium (upstream dependency)	Both: Monitor for security updates, automated deployments
Boot Failures	Medium (HTTP-only reduces complexity)	Low (battle-tested)	Custom: Comprehensive E2E testing with real hardware
Cloud Run Cold Starts	Medium (needs validation)	Medium (needs validation)	Both: Min instances = 1 (always-on)
Maintenance Burden	Medium (ongoing code maintenance)	Low (upstream handles updates)	Both: Automated deployment pipelines
GCP Integration Issues	Low (native SDK)	Medium (sync scripts)	Matchbox: Robust sync with error handling
Scalability Limits	Low (Cloud Run autoscaling)	Low (handles thousands of nodes)	Both: Monitor boot request latency
Dependency Abandonment	N/A (no external deps)	Low (Red Hat backing)	Matchbox: Can fork if necessary

Implementation Plan

Phase 1: Core Boot Server (Week 1)

Project Setup (1-2 days)
- Create Go project with z5labs/humus framework
- Set up OpenAPI specification for HTTP REST admin API
- Configure Cloud Storage and Firestore clients
- Implement basic health check endpoints
UEFI HTTP Boot Endpoints (2-3 days)
- HTTP endpoint serving boot scripts (iPXE format)
- Kernel and initrd streaming from Cloud Storage
- MAC-based machine matching using Firestore
- Boot script templating with machine-specific parameters
Testing & Deployment (2-3 days)
- Deploy to Cloud Run with min instances = 1
- Configure WireGuard VPN connectivity
- Test UEFI HTTP boot from HP DL360 Gen 9 (iLO 4 v2.40+)
- Validate boot latency and Cloud Run cold start metrics

Phase 2: Admin API & Management (Week 2)

HTTP REST Admin API (2-3 days)
- Boot image upload endpoints (kernel, initrd, metadata)
- Machine-to-image mapping management
- Boot profile CRUD operations
- Asset versioning and integrity validation
Cloud-Init Integration (2-3 days)
- Cloud-init configuration templating
- Metadata injection for machine-specific settings
- Integration with boot workflow
Observability & Documentation (2-3 days)
- OpenTelemetry metrics integration
- Cloud Monitoring dashboards
- API documentation
- Operational runbooks

Success Criteria

✅ Successfully boot HP DL360 Gen 9 via UEFI HTTP boot through WireGuard VPN
✅ Boot latency < 100ms for HTTP requests (kernel/initrd downloads)
✅ Cloud Run cold start latency < 100ms (with min instances = 1)
✅ Machine-to-image mapping works correctly based on MAC address
✅ Cloud Storage integration functional (upload, retrieve boot assets)
✅ HTTP REST API fully functional for boot configuration management
✅ Firestore stores machine mappings and boot profiles correctly
✅ OpenTelemetry metrics available in Cloud Monitoring
✅ Configuration update workflow clear and documented
✅ Firmware compatibility confirmed (no TFTP fallback needed)

More Information

ADR-0002: Network Boot Architecture - Established cloud-hosted boot server with VPN
ADR-0003: Cloud Provider Selection - Selected GCP as hosting provider
ADR-0001: Use MADR for Architecture Decision Records - MADR format

Future Considerations

High Availability: If boot server uptime becomes critical, evaluate multi-region deployment or failover strategies
Multi-Cloud: If multi-cloud strategy emerges, custom implementation provides better portability
Enterprise Features: If advanced provisioning workflows required (bare metal Kubernetes, Ignition support, etc.), evaluate adding features to custom implementation
Asset Versioning: Implement comprehensive boot image versioning and rollback capabilities beyond basic Cloud Storage versioning
Multi-Environment Support: Add support for multiple environments (dev, staging, prod) with environment-specific boot profiles

Issue #601 - story(docs): create adr for network boot infrastructure on google cloud
Issue #595 - story(docs): create adr for network boot architecture
Issue #597 - story(docs): create adr for cloud provider selection

Research and Development

1 - Technology Analysis

Technology Analysis

Network Boot & Provisioning

Cloud Providers

Operating Systems

Hardware

Future Analysis Topics

1.1 - Server Operating System Analysis

Overview

Evaluated Options

Evaluation Criteria

Related ADRs

1.1.1 - Ubuntu Analysis

Overview

Kubernetes Installation Methods

1. kubeadm (Official Kubernetes Tool)

2. k3s (Lightweight Kubernetes)

3. MicroK8s (Canonical’s Distribution)

Cluster Initialization Sequence

kubeadm Approach

k3s Approach

Maintenance Requirements

OS Updates

Kubernetes Upgrades

Resource Overhead

Security Posture

Learning Curve

Community Support

Pros and Cons Summary

Pros

Cons

Recommendations

1.1.2 - Fedora Analysis

Overview

Kubernetes Installation Methods

1. kubeadm (Official Kubernetes Tool)

2. k3s (Lightweight Kubernetes)

3. OKD (OpenShift Kubernetes Distribution)

Cluster Initialization Sequence

kubeadm with CRI-O

k3s Approach

Maintenance Requirements

OS Updates

Kubernetes Upgrades

Resource Overhead

Security Posture

Learning Curve

Community Support

Pros and Cons Summary

Pros

Cons

Recommendations

Comparison with Ubuntu

1.1.3 - Talos Linux Analysis

Overview

Kubernetes Installation Methods

Built-in Kubernetes (Only Option)

Cluster Initialization Sequence

Maintenance Requirements

OS Updates

Configuration Changes

Resource Overhead

Security Posture

Learning Curve

Community Support

Pros and Cons Summary

Pros

Cons

Recommendations

Comparison with Ubuntu and Fedora

Advanced Features

Talos System Extensions

Cluster API Integration

Image Factory

Disaster Recovery

Production Readiness

1.1.4 - Harvester Analysis

Overview

Kubernetes Installation Methods