This is the multi-page printable view of this section. Click here to print.
Research and Development
- 1: Technology Analysis
- 1.1: Server Operating System Analysis
- 1.1.1: Ubuntu Analysis
- 1.1.2: Fedora Analysis
- 1.1.3: Talos Linux Analysis
- 1.1.4: Harvester Analysis
- 1.2: Amazon Web Services Analysis
- 1.3: Google Cloud Platform Analysis
- 1.3.1: Cloud Storage FUSE (gcsfuse)
- 1.3.2: GCP Network Boot Protocol Support
- 1.3.3: GCP WireGuard VPN Support
- 1.4: HP ProLiant DL360 Gen9 Analysis
- 1.4.1: Configuration Guide
- 1.4.2: Hardware Specifications
- 1.4.3: Network Boot Capabilities
- 1.5: Matchbox Analysis
- 1.5.1: Configuration Model
- 1.5.2: Deployment Patterns
- 1.5.3: Network Boot Support
- 1.5.4: Use Case Evaluation
- 1.6: Ubiquiti Dream Machine Pro Analysis
- 2: Architecture Decision Records
1 - Technology Analysis
Technology Analysis
This section contains detailed research and analysis of various technologies evaluated for potential use in the home lab infrastructure.
Network Boot & Provisioning
- Matchbox - Network boot service for bare-metal provisioning
- Comprehensive analysis of PXE/iPXE/GRUB support
- Configuration model (profiles, groups, templating)
- Deployment patterns and operational considerations
- Use case evaluation and comparison with alternatives
Cloud Providers
- Google Cloud Platform - GCP capabilities for network boot infrastructure
- Network boot protocol support (TFTP, HTTP, HTTPS)
- WireGuard VPN deployment and integration
- Cost analysis and performance considerations
- Amazon Web Services - AWS capabilities for network boot infrastructure
- Network boot protocol support (TFTP, HTTP, HTTPS)
- WireGuard VPN deployment and integration
- Cost analysis and performance considerations
Operating Systems
- Server Operating Systems - OS evaluation for Kubernetes homelab infrastructure
- Ubuntu Server analysis (kubeadm, k3s, MicroK8s)
- Fedora Server analysis (kubeadm with CRI-O)
- Talos Linux analysis (purpose-built Kubernetes OS)
- Harvester HCI analysis (hyperconverged platform)
- Comparison of setup complexity, maintenance, security, and resource overhead
Hardware
- HP DL360 Gen9 - Enterprise server hardware analysis
- UniFi Dream Machine Pro - Network gateway and controller
Future Analysis Topics
Planned technology evaluations:
- Storage Solutions: Ceph, GlusterFS, ZFS over iSCSI
- Container Orchestration: Kubernetes distributions (k3s, Talos, etc.)
- Observability: Prometheus, Grafana, Loki, Tempo stack
- Service Mesh: Istio, Linkerd, Cilium comparison
- CI/CD: GitLab Runner, Tekton, Argo Workflows
- Secret Management: Vault, External Secrets Operator
- Load Balancing: MetalLB, kube-vip, Cilium LB-IPAM
1.1 - Server Operating System Analysis
This section provides detailed analysis of operating systems evaluated for the homelab server infrastructure, with a focus on Kubernetes cluster setup and maintenance.
Overview
The selection of a server operating system is critical for homelab infrastructure. The primary evaluation criterion is ease of Kubernetes cluster initialization and ongoing maintenance burden.
Evaluated Options
Ubuntu - Traditional general-purpose Linux distribution
- Kubernetes via kubeadm, k3s, or MicroK8s
- Strong community support and extensive documentation
- Familiar package management and system administration
Fedora - Cutting-edge Linux distribution
- Latest kernel and system components
- Kubernetes via kubeadm or k3s
- Shorter support lifecycle with more frequent upgrades
Talos Linux - Purpose-built Kubernetes OS
- API-driven, immutable infrastructure
- Built-in Kubernetes with minimal attack surface
- Designed specifically for container workloads
Harvester - Hyperconverged infrastructure platform
- Built on Rancher and K3s
- Combines compute, storage, and networking
- VM and container workloads on unified platform
Evaluation Criteria
Each option is evaluated based on:
- Kubernetes Installation Methods - Available tooling and installation approaches
- Cluster Initialization Process - Steps required to bootstrap a cluster
- Maintenance Requirements - OS updates, Kubernetes upgrades, security patches
- Resource Overhead - Memory, CPU, and storage footprint
- Learning Curve - Ease of adoption and operational complexity
- Community Support - Documentation quality and ecosystem maturity
- Security Posture - Attack surface and security-first design
Related ADRs
- ADR-0004: Server Operating System Selection - Final decision based on this analysis
1.1.1 - Ubuntu Analysis
Overview
Ubuntu Server is a popular general-purpose Linux distribution developed by Canonical. It provides Long Term Support (LTS) releases with 5 years of standard support and optional Extended Security Maintenance (ESM).
Key Facts:
- Latest LTS: Ubuntu 24.04 LTS (Noble Numbat)
- Support Period: 5 years standard, 10 years with Ubuntu Pro (free for personal use)
- Kernel: Linux 6.8+ (LTS), regular HWE updates
- Package Manager: APT/DPKG, Snap
- Init System: systemd
Kubernetes Installation Methods
Ubuntu supports multiple Kubernetes installation approaches:
1. kubeadm (Official Kubernetes Tool)
Installation:
# Install container runtime (containerd)
sudo apt-get update
sudo apt-get install -y containerd
# Configure containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd
# Install kubeadm, kubelet, kubectl
sudo apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.31/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.31/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
Cluster Initialization:
# Initialize control plane
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
# Configure kubectl for admin
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# Install CNI (e.g., Calico, Flannel)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
# Join worker nodes
kubeadm token create --print-join-command
Pros:
- Official Kubernetes tooling, well-documented
- Full control over cluster configuration
- Supports latest Kubernetes versions
- Large community and extensive resources
Cons:
- More manual steps than turnkey solutions
- Requires understanding of Kubernetes architecture
- Manual upgrade process for each component
- More complex troubleshooting
2. k3s (Lightweight Kubernetes)
Installation:
# Single-command install on control plane
curl -sfL https://get.k3s.io | sh -
# Get node token for workers
sudo cat /var/lib/rancher/k3s/server/node-token
# Install on worker nodes
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -
Pros:
- Extremely simple installation (single command)
- Lightweight (< 512MB RAM)
- Built-in container runtime (containerd)
- Automatic updates via Rancher System Upgrade Controller
- Great for edge and homelab use cases
Cons:
- Less customizable than kubeadm
- Some features removed (e.g., in-tree storage, cloud providers)
- Slightly different from upstream Kubernetes
3. MicroK8s (Canonical’s Distribution)
Installation:
# Install via snap
sudo snap install microk8s --classic
# Join cluster
sudo microk8s add-node
# Run output command on worker nodes
# Enable addons
microk8s enable dns storage ingress
Pros:
- Zero-ops, single package install
- Snap-based automatic updates
- Addons for common services (DNS, storage, ingress)
- Canonical support available
Cons:
- Requires snap (not universally liked)
- Less ecosystem compatibility than vanilla Kubernetes
- Ubuntu-specific (less portable)
Cluster Initialization Sequence
kubeadm Approach
sequenceDiagram
participant Admin
participant Server as Ubuntu Server
participant K8s as Kubernetes Components
Admin->>Server: Install Ubuntu 24.04 LTS
Server->>Server: Configure network (static IP)
Admin->>Server: Update system (apt update && upgrade)
Admin->>Server: Install containerd
Server->>Server: Configure containerd (CRI)
Admin->>Server: Install kubeadm/kubelet/kubectl
Server->>Server: Disable swap, configure kernel modules
Admin->>K8s: kubeadm init --pod-network-cidr=10.244.0.0/16
K8s->>Server: Generate certificates
K8s->>Server: Start etcd
K8s->>Server: Start API server
K8s->>Server: Start controller-manager
K8s->>Server: Start scheduler
K8s-->>Admin: Control plane ready
Admin->>K8s: kubectl apply -f calico.yaml
K8s->>Server: Deploy CNI pods
Admin->>K8s: kubeadm join (on workers)
K8s->>Server: Add worker nodes
K8s-->>Admin: Cluster readyk3s Approach
sequenceDiagram
participant Admin
participant Server as Ubuntu Server
participant K3s as k3s Components
Admin->>Server: Install Ubuntu 24.04 LTS
Server->>Server: Configure network (static IP)
Admin->>Server: Update system
Admin->>Server: curl -sfL https://get.k3s.io | sh -
Server->>K3s: Download k3s binary
K3s->>Server: Configure containerd
K3s->>Server: Start k3s service
K3s->>Server: Initialize etcd (embedded)
K3s->>Server: Start API server
K3s->>Server: Start controller-manager
K3s->>Server: Start scheduler
K3s->>Server: Deploy built-in CNI (Flannel)
K3s-->>Admin: Control plane ready
Admin->>Server: Retrieve node token
Admin->>Server: Install k3s agent on workers
K3s->>Server: Join workers to cluster
K3s-->>Admin: Cluster ready (5-10 minutes total)Maintenance Requirements
OS Updates
Security Patches:
# Automatic security updates (recommended)
sudo apt-get install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades
# Manual updates
sudo apt-get update
sudo apt-get upgrade
Frequency:
- Security patches: Weekly to monthly
- Kernel updates: Monthly (may require reboot)
- Major version upgrades: Every 2 years (LTS to LTS)
Kubernetes Upgrades
kubeadm Upgrade:
# Upgrade control plane
sudo apt-get update
sudo apt-get install -y kubeadm=1.32.0-*
sudo kubeadm upgrade apply v1.32.0
sudo apt-get install -y kubelet=1.32.0-* kubectl=1.32.0-*
sudo systemctl restart kubelet
# Upgrade workers
kubectl drain <node> --ignore-daemonsets
sudo apt-get install -y kubeadm=1.32.0-* kubelet=1.32.0-* kubectl=1.32.0-*
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon <node>
k3s Upgrade:
# Manual upgrade
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.32.0+k3s1 sh -
# Automatic upgrade via system-upgrade-controller
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
Upgrade Frequency: Every 3-6 months (Kubernetes minor versions)
Resource Overhead
Minimal Installation (Ubuntu Server + k3s):
- RAM: ~512MB (OS) + 512MB (k3s) = 1GB total
- CPU: 1 core minimum, 2 cores recommended
- Disk: 10GB (OS) + 10GB (container images) = 20GB
- Network: 1 Gbps recommended
Full Installation (Ubuntu Server + kubeadm):
- RAM: ~512MB (OS) + 1-2GB (Kubernetes components) = 2GB+ total
- CPU: 2 cores minimum
- Disk: 15GB (OS) + 20GB (container images/etcd) = 35GB
- Network: 1 Gbps recommended
Security Posture
Strengths:
- Regular security updates via Ubuntu Security Team
- AppArmor enabled by default
- SELinux support available
- Kernel hardening features (ASLR, stack protection)
- Ubuntu Pro ESM for extended CVE coverage (free for personal use)
Attack Surface:
- Full general-purpose OS (larger attack surface than minimal OS)
- Many installed packages by default (can be minimized)
- Requires manual hardening for production use
Hardening Steps:
# Disable unnecessary services
sudo systemctl disable snapd.service
sudo systemctl disable bluetooth.service
# Configure firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 6443/tcp # Kubernetes API
sudo ufw allow 10250/tcp # Kubelet
sudo ufw enable
# CIS Kubernetes Benchmark compliance
# Use tools like kube-bench for validation
Learning Curve
Ease of Adoption: ⭐⭐⭐⭐⭐ (Excellent)
- Most familiar Linux distribution for many users
- Extensive documentation and tutorials
- Large community support (forums, Stack Overflow)
- Straightforward package management
- Similar to Debian-based systems
Required Knowledge:
- Basic Linux system administration (apt, systemd, networking)
- Kubernetes concepts (pods, services, deployments)
- Container runtime basics (containerd, Docker)
- Text editor (vim, nano) for configuration
Community Support
Ecosystem Maturity: ⭐⭐⭐⭐⭐ (Excellent)
- Documentation: Comprehensive official docs, community guides
- Community: Massive user base, active forums
- Commercial Support: Available from Canonical (Ubuntu Pro)
- Third-Party Tools: Excellent compatibility with all Kubernetes tools
- Tutorials: Abundant resources for Kubernetes on Ubuntu
Resources:
Pros and Cons Summary
Pros
- Good, because most familiar and well-documented Linux distribution
- Good, because 5-year LTS support (10 years with Ubuntu Pro)
- Good, because multiple Kubernetes installation options (kubeadm, k3s, MicroK8s)
- Good, because k3s provides extremely simple setup (single command)
- Good, because extensive package ecosystem (60,000+ packages)
- Good, because strong community support and resources
- Good, because automatic security updates available
- Good, because low learning curve for most administrators
- Good, because compatible with all Kubernetes tooling and addons
- Good, because Ubuntu Pro free for personal use (extended security)
Cons
- Bad, because general-purpose OS has larger attack surface than minimal OS
- Bad, because more resource overhead than purpose-built Kubernetes OS (1-2GB RAM)
- Bad, because requires manual OS updates and reboots
- Bad, because kubeadm setup is complex with many manual steps
- Bad, because snap packages controversial (for MicroK8s)
- Bad, because Kubernetes upgrades require manual intervention (unless using k3s auto-upgrade)
- Bad, because managing OS + Kubernetes lifecycle separately increases complexity
- Neutral, because many preinstalled packages (can be removed, but require effort)
Recommendations
Best for:
- Users familiar with Ubuntu/Debian ecosystem
- Homelabs requiring general-purpose server functionality (not just Kubernetes)
- Teams wanting multiple Kubernetes installation options
- Users prioritizing community support and documentation
Best Installation Method:
- Homelab/Learning: k3s (simplest, auto-updates, lightweight)
- Production-like: kubeadm (full control, upstream Kubernetes)
- Ubuntu-specific: MicroK8s (Canonical support, snap-based)
Avoid if:
- Seeking minimal attack surface (consider Talos Linux)
- Want infrastructure-as-code for OS layer (consider Talos Linux)
- Prefer hyperconverged platform (consider Harvester)
1.1.2 - Fedora Analysis
Overview
Fedora Server is a cutting-edge Linux distribution sponsored by Red Hat, serving as the upstream for Red Hat Enterprise Linux (RHEL). It emphasizes innovation with the latest software packages and kernel versions.
Key Facts:
- Latest Version: Fedora 41 (October 2024)
- Support Period: ~13 months per release (shorter than Ubuntu LTS)
- Kernel: Linux 6.11+ (latest stable)
- Package Manager: DNF/RPM, Flatpak
- Init System: systemd
Kubernetes Installation Methods
Fedora supports standard Kubernetes installation approaches:
1. kubeadm (Official Kubernetes Tool)
Installation:
# Install container runtime (CRI-O preferred on Fedora)
sudo dnf install -y cri-o
sudo systemctl enable --now crio
# Add Kubernetes repository
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.key
EOF
# Install kubeadm, kubelet, kubectl
sudo dnf install -y kubelet kubeadm kubectl
sudo systemctl enable --now kubelet
Cluster Initialization:
# Initialize control plane
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/crio/crio.sock
# Configure kubectl
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# Install CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
# Join workers
kubeadm token create --print-join-command
Pros:
- CRI-O is native to Fedora ecosystem (same as RHEL/OpenShift)
- Latest Kubernetes versions available quickly
- Familiar to RHEL/CentOS users
- Fully upstream Kubernetes
Cons:
- Manual setup process (same as Ubuntu/kubeadm)
- Requires Kubernetes knowledge
- More complex than turnkey solutions
2. k3s (Lightweight Kubernetes)
Installation:
# Same single-command install
curl -sfL https://get.k3s.io | sh -
# Retrieve token
sudo cat /var/lib/rancher/k3s/server/node-token
# Install on workers
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -
Pros:
- Simple installation (identical to Ubuntu)
- Lightweight and fast
- Well-tested on Fedora/RHEL family
Cons:
- Less customizable
- Not using native CRI-O by default (uses embedded containerd)
3. OKD (OpenShift Kubernetes Distribution)
Installation (Single-Node):
# Download and install OKD
wget https://github.com/okd-project/okd/releases/download/4.15.0-0.okd-2024-01-27-070424/openshift-install-linux-4.15.0-0.okd-2024-01-27-070424.tar.gz
tar -xvf openshift-install-linux-*.tar.gz
sudo mv openshift-install /usr/local/bin/
# Create install config
./openshift-install create install-config --dir=cluster
# Install cluster
./openshift-install create cluster --dir=cluster
Pros:
- Enterprise features (operators, web console, image registry)
- Built-in CI/CD and developer tools
- Based on Fedora CoreOS (immutable, auto-updating)
Cons:
- Very heavy resource requirements (16GB+ RAM)
- Complex installation and management
- Overkill for simple homelab use
Cluster Initialization Sequence
kubeadm with CRI-O
sequenceDiagram
participant Admin
participant Server as Fedora Server
participant K8s as Kubernetes Components
Admin->>Server: Install Fedora 41
Server->>Server: Configure network (static IP)
Admin->>Server: Update system (dnf update)
Admin->>Server: Install CRI-O
Server->>Server: Configure CRI-O runtime
Server->>Server: Enable crio.service
Admin->>Server: Install kubeadm/kubelet/kubectl
Server->>Server: Disable swap, load kernel modules
Server->>Server: Configure SELinux (permissive for Kubernetes)
Admin->>K8s: kubeadm init --cri-socket=unix:///var/run/crio/crio.sock
K8s->>Server: Generate certificates
K8s->>Server: Start etcd
K8s->>Server: Start API server
K8s->>Server: Start controller-manager
K8s->>Server: Start scheduler
K8s-->>Admin: Control plane ready
Admin->>K8s: kubectl apply CNI
K8s->>Server: Deploy CNI pods
Admin->>K8s: kubeadm join (workers)
K8s->>Server: Add worker nodes
K8s-->>Admin: Cluster readyk3s Approach
sequenceDiagram
participant Admin
participant Server as Fedora Server
participant K3s as k3s Components
Admin->>Server: Install Fedora 41
Server->>Server: Configure network
Admin->>Server: Update system (dnf update)
Admin->>Server: Disable firewalld (or configure)
Admin->>Server: curl -sfL https://get.k3s.io | sh -
Server->>K3s: Download k3s binary
K3s->>Server: Configure containerd
K3s->>Server: Start k3s service
K3s->>Server: Initialize embedded etcd
K3s->>Server: Start API server
K3s->>Server: Deploy built-in CNI
K3s-->>Admin: Control plane ready
Admin->>Server: Retrieve node token
Admin->>Server: Install k3s agent on workers
K3s->>Server: Join workers
K3s-->>Admin: Cluster ready (5-10 minutes)Maintenance Requirements
OS Updates
Security and System Updates:
# Automatic updates (dnf-automatic)
sudo dnf install -y dnf-automatic
sudo systemctl enable --now dnf-automatic.timer
# Manual updates
sudo dnf update -y
sudo reboot # if kernel updated
Frequency:
- Security patches: Weekly to monthly
- Kernel updates: Monthly (frequent updates)
- Major version upgrades: Every ~13 months (Fedora releases)
Version Upgrade:
# Upgrade to next Fedora release
sudo dnf upgrade --refresh
sudo dnf install dnf-plugin-system-upgrade
sudo dnf system-upgrade download --releasever=42
sudo dnf system-upgrade reboot
Kubernetes Upgrades
kubeadm Upgrade:
# Upgrade control plane
sudo dnf update -y kubeadm
sudo kubeadm upgrade apply v1.32.0
sudo dnf update -y kubelet kubectl
sudo systemctl restart kubelet
# Upgrade workers
kubectl drain <node> --ignore-daemonsets
sudo dnf update -y kubeadm kubelet kubectl
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon <node>
k3s Upgrade: Same as Ubuntu (curl script or system-upgrade-controller)
Upgrade Frequency: Kubernetes every 3-6 months, Fedora OS every ~13 months
Resource Overhead
Minimal Installation (Fedora Server + k3s):
- RAM: ~600MB (OS) + 512MB (k3s) = 1.2GB total
- CPU: 1 core minimum, 2 cores recommended
- Disk: 12GB (OS) + 10GB (containers) = 22GB
- Network: 1 Gbps recommended
Full Installation (Fedora Server + kubeadm + CRI-O):
- RAM: ~700MB (OS) + 1.5GB (Kubernetes) = 2.2GB total
- CPU: 2 cores minimum
- Disk: 15GB (OS) + 20GB (containers) = 35GB
- Network: 1 Gbps recommended
Note: Slightly higher overhead than Ubuntu due to SELinux and newer components.
Security Posture
Strengths:
- SELinux enabled by default (stronger than AppArmor)
- Latest security patches and kernel (bleeding edge)
- CRI-O container runtime (security-focused, used by OpenShift)
- Shorter support window = less legacy CVEs
- Active security team and rapid response
Attack Surface:
- General-purpose OS (larger surface than minimal OS)
- More installed packages than minimal server
- SELinux can be complex to configure for Kubernetes
Hardening Steps:
# Configure firewall (firewalld default on Fedora)
sudo firewall-cmd --permanent --add-port=6443/tcp # API server
sudo firewall-cmd --permanent --add-port=10250/tcp # Kubelet
sudo firewall-cmd --reload
# SELinux configuration for Kubernetes
sudo setenforce 0 # Permissive (Kubernetes not fully SELinux-ready)
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
# Disable unnecessary services
sudo systemctl disable bluetooth.service
Learning Curve
Ease of Adoption: ⭐⭐⭐⭐ (Good)
- Familiar for RHEL/CentOS/Alma/Rocky users
- DNF package manager (similar to APT)
- Excellent documentation
- SELinux learning curve can be steep
Required Knowledge:
- RPM-based system administration (dnf, systemd)
- SELinux basics (or willingness to use permissive mode)
- Kubernetes concepts
- Firewalld configuration
Differences from Ubuntu:
- DNF vs APT package manager
- SELinux vs AppArmor
- Firewalld vs UFW
- Faster release cycle (more frequent upgrades)
Community Support
Ecosystem Maturity: ⭐⭐⭐⭐ (Good)
- Documentation: Excellent official docs, Red Hat resources
- Community: Large user base, active forums
- Commercial Support: RHEL support available (paid)
- Third-Party Tools: Good compatibility with Kubernetes tools
- Tutorials: Abundant resources, especially for RHEL ecosystem
Resources:
Pros and Cons Summary
Pros
- Good, because latest kernel and software packages (bleeding edge)
- Good, because SELinux enabled by default (stronger MAC than AppArmor)
- Good, because native CRI-O support (same as RHEL/OpenShift)
- Good, because upstream for RHEL (enterprise compatibility)
- Good, because multiple Kubernetes installation options
- Good, because k3s simplifies setup dramatically
- Good, because strong security focus and rapid CVE response
- Good, because familiar to RHEL/CentOS ecosystem
- Good, because automatic updates available (dnf-automatic)
- Neutral, because shorter support cycle (13 months) ensures latest features
Cons
- Bad, because short support cycle requires frequent OS upgrades (every ~13 months)
- Bad, because bleeding-edge packages can introduce instability
- Bad, because SELinux configuration for Kubernetes is complex (often set to permissive)
- Bad, because smaller community than Ubuntu (though still large)
- Bad, because general-purpose OS has larger attack surface than minimal OS
- Bad, because more resource overhead than purpose-built Kubernetes OS
- Bad, because OS upgrade every 13 months adds maintenance burden
- Bad, because less beginner-friendly than Ubuntu
- Bad, because managing OS + Kubernetes lifecycle separately
- Neutral, because rapid release cycle can be pro or con depending on preference
Recommendations
Best for:
- Users familiar with RHEL/CentOS/Rocky/Alma ecosystem
- Teams wanting latest kernel and software features
- Environments requiring SELinux (compliance, enterprise standards)
- Learning OpenShift/OKD ecosystem (Fedora CoreOS foundation)
- Users comfortable with frequent OS upgrades
Best Installation Method:
- Homelab/Learning: k3s (simplest, lightweight)
- Enterprise-like: kubeadm + CRI-O (OpenShift compatibility)
- Advanced: OKD (if resources available, 16GB+ RAM)
Avoid if:
- Prefer long-term stability (choose Ubuntu LTS)
- Want minimal maintenance (frequent Fedora upgrades required)
- Seeking minimal attack surface (consider Talos Linux)
- Uncomfortable with SELinux complexity
- Want infrastructure-as-code for OS (consider Talos Linux)
Comparison with Ubuntu
| Aspect | Fedora | Ubuntu LTS |
|---|---|---|
| Support Period | 13 months | 5 years (10 with Pro) |
| Kernel | Latest (6.11+) | LTS (6.8+) |
| Security | SELinux | AppArmor |
| Package Manager | DNF/RPM | APT/DEB |
| Release Cycle | 6 months | 2 years (LTS) |
| Upgrade Frequency | Every 13 months | Every 2-5 years |
| Community Size | Large | Very Large |
| Enterprise Upstream | RHEL | N/A |
| Stability | Bleeding edge | Stable/Conservative |
| Learning Curve | Moderate | Easy |
Verdict: Fedora is excellent for those wanting latest features and comfortable with frequent upgrades. Ubuntu LTS is better for long-term stability and minimal maintenance.
1.1.3 - Talos Linux Analysis
Overview
Talos Linux is a modern operating system designed specifically for running Kubernetes. It is API-driven, immutable, and minimal, with no SSH access, shell, or package manager. All configuration is done via a declarative API.
Key Facts:
- Latest Version: Talos 1.9 (supports Kubernetes 1.31)
- Support: Community-driven, commercial support available from Sidero Labs
- Kernel: Linux 6.6+ LTS
- Architecture: Immutable, API-driven, no shell access
- Management: talosctl CLI + Kubernetes API
Kubernetes Installation Methods
Talos Linux has built-in Kubernetes - there is only one installation method.
Built-in Kubernetes (Only Option)
Installation Process:
- Boot Talos ISO/PXE (maintenance mode)
- Apply machine configuration via talosctl
- Bootstrap Kubernetes via talosctl bootstrap
Machine Configuration (YAML):
# controlplane.yaml
version: v1alpha1
machine:
type: controlplane
install:
disk: /dev/sda
network:
hostname: control-plane-1
interfaces:
- interface: eth0
dhcp: false
addresses:
- 192.168.1.10/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.1.1
cluster:
clusterName: homelab
controlPlane:
endpoint: https://192.168.1.10:6443
network:
cni:
name: custom
urls:
- https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
Cluster Initialization:
# Generate machine configs
talosctl gen config homelab https://192.168.1.10:6443
# Apply config to control plane node (booted from ISO)
talosctl apply-config --insecure --nodes 192.168.1.10 --file controlplane.yaml
# Wait for install to complete, then bootstrap
talosctl bootstrap --nodes 192.168.1.10 --endpoints 192.168.1.10
# Retrieve kubeconfig
talosctl kubeconfig --nodes 192.168.1.10 --endpoints 192.168.1.10
# Apply config to worker nodes
talosctl apply-config --insecure --nodes 192.168.1.11 --file worker.yaml
Pros:
- Kubernetes built-in, no separate installation
- Declarative configuration (GitOps-friendly)
- Extremely minimal attack surface (no shell, no SSH)
- Immutable infrastructure (config changes require reboot)
- Automatic updates via Talos controller
- Designed from ground up for Kubernetes
Cons:
- Steep learning curve (completely different paradigm)
- No SSH/shell access (all via API)
- Troubleshooting requires different mindset
- Limited to Kubernetes workloads only (not general-purpose)
- Smaller community than traditional distros
Cluster Initialization Sequence
sequenceDiagram
participant Admin
participant Server as Bare Metal Server
participant Talos as Talos Linux
participant K8s as Kubernetes Components
Admin->>Server: Boot Talos ISO (PXE or USB)
Server->>Talos: Start in maintenance mode
Talos-->>Admin: API endpoint ready (no shell)
Admin->>Admin: Generate configs (talosctl gen config)
Admin->>Talos: talosctl apply-config (controlplane.yaml)
Talos->>Server: Partition disk
Talos->>Server: Install Talos to /dev/sda
Talos->>Server: Write machine config
Server->>Server: Reboot from disk
Talos->>Talos: Load machine config
Talos->>K8s: Start kubelet
Talos->>K8s: Start etcd
Talos->>K8s: Start API server
Admin->>Talos: talosctl bootstrap
Talos->>K8s: Initialize cluster
K8s->>Talos: Start controller-manager
K8s->>Talos: Start scheduler
K8s-->>Admin: Control plane ready
Admin->>K8s: Apply CNI (via talosctl or kubectl)
K8s->>Talos: Deploy CNI pods
Admin->>Talos: Apply worker configs
Talos->>K8s: Join workers to cluster
K8s-->>Admin: Cluster ready (10-15 minutes)Maintenance Requirements
OS Updates
Declarative Upgrades:
# Upgrade Talos version (rolling upgrade)
talosctl upgrade --nodes 192.168.1.10 --image ghcr.io/siderolabs/installer:v1.9.0
# Kubernetes version upgrade (also declarative)
talosctl upgrade-k8s --nodes 192.168.1.10 --to 1.32.0
Automatic Updates (via Talos System Extensions):
# machine config with auto-update extension
machine:
install:
extensions:
- image: ghcr.io/siderolabs/system-upgrade-controller
Frequency:
- Talos releases: Every 2-3 months
- Kubernetes upgrades: Follow upstream cadence (quarterly)
- Security patches: Built into Talos releases
- No traditional OS patching (immutable system)
Configuration Changes
All changes via machine config:
# Edit machine config YAML
vim controlplane.yaml
# Apply updated config (triggers reboot if needed)
talosctl apply-config --nodes 192.168.1.10 --file controlplane.yaml
No manual package installs - everything declarative.
Resource Overhead
Minimal Footprint (Talos Linux + Kubernetes):
- RAM: ~256MB (OS) + 512MB (Kubernetes) = 768MB total
- CPU: 1 core minimum, 2 cores recommended
- Disk: ~500MB (OS) + 10GB (container images/etcd) = 10-15GB total
- Network: 1 Gbps recommended
Comparison:
- Ubuntu + k3s: ~1GB RAM
- Talos: ~768MB RAM (lighter)
- Ubuntu + kubeadm: ~2GB RAM
- Talos: ~768MB RAM (much lighter)
Minimal install size: ~500MB (vs 10GB+ for Ubuntu/Fedora)
Security Posture
Strengths: ⭐⭐⭐⭐⭐ (Excellent)
- No SSH access - attack surface eliminated
- No shell - cannot install malware
- No package manager - no additional software installation
- Immutable filesystem - rootfs read-only
- Minimal components: Only Kubernetes and essential services
- API-only access - mTLS-authenticated talosctl
- KSPP compliance: Kernel Self-Protection Project standards
- Signed images: Cryptographically signed Talos images
- Secure Boot support: UEFI Secure Boot compatible
Attack Surface:
- Smallest possible: Only Kubernetes API, kubelet, and Talos API
- ~30 running processes (vs 100+ on Ubuntu/Fedora)
- ~200MB filesystem (vs 5-10GB on Ubuntu/Fedora)
No hardening needed - secure by default.
Security Features:
# Built-in security (example config)
machine:
sysctls:
kernel.kptr_restrict: "2"
kernel.yama.ptrace_scope: "1"
kernel:
modules:
- name: br_netfilter
features:
kubernetesTalosAPIAccess:
enabled: true
allowedRoles:
- os:reader
Learning Curve
Ease of Adoption: ⭐⭐ (Challenging)
- Paradigm shift: No shell/SSH, API-only management
- Requires understanding of declarative infrastructure
- Talosctl CLI has learning curve
- Excellent documentation helps
- Different troubleshooting approach (logs via API)
Required Knowledge:
- Kubernetes fundamentals (critical)
- YAML configuration syntax
- Networking basics (especially CNI)
- GitOps concepts helpful
- Comfort with “infrastructure as code”
Debugging without shell:
# View logs via API
talosctl logs --nodes 192.168.1.10 kubelet
# Get system metrics
talosctl dashboard --nodes 192.168.1.10
# Interactive mode (limited shell in emergency)
talosctl dashboard --nodes 192.168.1.10
# Service status
talosctl service --nodes 192.168.1.10
Community Support
Ecosystem Maturity: ⭐⭐⭐ (Growing)
- Documentation: Excellent official docs
- Community: Smaller but very active (Slack, GitHub Discussions)
- Commercial Support: Available from Sidero Labs
- Third-Party Tools: Growing ecosystem (Cluster API, GitOps tools)
- Tutorials: Increasing number of community guides
Resources:
Community Size: Smaller than Ubuntu/Fedora, but dedicated and helpful.
Pros and Cons Summary
Pros
- Good, because Kubernetes is built-in (no separate installation)
- Good, because minimal attack surface (no SSH, shell, or package manager)
- Good, because immutable infrastructure (config drift impossible)
- Good, because API-driven management (GitOps-friendly)
- Good, because extremely low resource overhead (~768MB RAM)
- Good, because automatic security patches via Talos upgrades
- Good, because declarative configuration (version-controlled)
- Good, because secure by default (no hardening required)
- Good, because smallest disk footprint (~500MB OS)
- Good, because designed specifically for Kubernetes (opinionated and optimized)
- Good, because UEFI Secure Boot support
- Good, because upgrades are simple and declarative (talosctl upgrade)
Cons
- Bad, because steep learning curve (no shell/SSH paradigm shift)
- Bad, because limited to Kubernetes workloads only (not general-purpose)
- Bad, because troubleshooting without shell requires different approach
- Bad, because smaller community than Ubuntu/Fedora
- Bad, because relatively new (less mature than traditional distros)
- Bad, because no escape hatch for manual intervention
- Bad, because requires comfort with declarative infrastructure
- Bad, because debugging is harder for beginners
- Neutral, because opinionated design (pro for K8s-only, con for general use)
Recommendations
Best for:
- Kubernetes-dedicated infrastructure (no general-purpose workloads)
- Security-focused environments (minimal attack surface)
- GitOps workflows (declarative configuration)
- Immutable infrastructure advocates
- Teams comfortable with API-driven management
- Production Kubernetes clusters (once team is trained)
Best Installation Method:
- Only option: Built-in Kubernetes via talosctl
Avoid if:
- Need general-purpose server functionality (SSH, cron jobs, etc.)
- Team unfamiliar with Kubernetes (too steep a learning curve)
- Require shell access for troubleshooting comfort
- Want traditional package management (apt, dnf)
- Prefer familiar Linux administration tools
Comparison with Ubuntu and Fedora
| Aspect | Talos Linux | Ubuntu + k3s | Fedora + kubeadm |
|---|---|---|---|
| K8s Installation | Built-in | Single command | Manual (kubeadm) |
| Attack Surface | Minimal (~30 processes) | Medium (~100) | Medium (~100) |
| Resource Overhead | 768MB RAM | 1GB RAM | 2.2GB RAM |
| Disk Footprint | 500MB | 10GB | 15GB |
| Security Model | Immutable, no shell | AppArmor, shell | SELinux, shell |
| Management | API-only (talosctl) | SSH + kubectl | SSH + kubectl |
| Learning Curve | Steep | Easy | Moderate |
| Community Size | Small (growing) | Very Large | Large |
| Support Period | Rolling releases | 5-10 years | 13 months |
| Use Case | Kubernetes only | General-purpose | General-purpose |
| Upgrades | Declarative, simple | Manual OS + K8s | Manual OS + K8s |
| Configuration | Declarative YAML | Imperative + YAML | Imperative + YAML |
| Troubleshooting | API logs/metrics | SSH + logs | SSH + logs |
| GitOps-Friendly | Excellent | Good | Good |
| Best for | K8s-dedicated infra | Homelabs, learning | RHEL ecosystem |
Verdict: Talos is the most secure and efficient option for Kubernetes-only infrastructure, but requires team buy-in to API-driven, immutable paradigm. Ubuntu/Fedora better for general-purpose servers or teams wanting shell access.
Advanced Features
Talos System Extensions
Extend Talos functionality with extensions:
machine:
install:
extensions:
- image: ghcr.io/siderolabs/intel-ucode:20240312
- image: ghcr.io/siderolabs/iscsi-tools:v0.1.4
Cluster API Integration
Talos works natively with Cluster API:
# Install Cluster API + Talos provider
clusterctl init --infrastructure talos
# Create cluster from template
clusterctl generate cluster homelab --infrastructure talos > cluster.yaml
kubectl apply -f cluster.yaml
Image Factory
Custom Talos images with extensions:
# Build custom image
curl -X POST https://factory.talos.dev/image \
-d '{"talos_version":"v1.9.0","extensions":["siderolabs/intel-ucode"]}'
Disaster Recovery
Talos supports etcd backup/restore:
# Backup etcd
talosctl etcd snapshot --nodes 192.168.1.10
# Restore from snapshot
talosctl bootstrap --recover-from ./etcd-snapshot.db
Production Readiness
Production Use: ✅ Yes (many companies run Talos in production)
High Availability:
- 3+ control plane nodes recommended
- External etcd supported
- Load balancer for API server
Monitoring:
- Prometheus metrics built-in
- Talos dashboard for health
- Standard Kubernetes observability tools
Example Production Clusters:
- Sidero Metal (bare metal provisioning)
- Various cloud providers (AWS, GCP, Azure)
- Edge deployments (minimal footprint)
1.1.4 - Harvester Analysis
Overview
Harvester is a Hyperconverged Infrastructure (HCI) platform built on Kubernetes, designed to provide VM and container management on a unified platform. It combines compute, storage, and networking with built-in K3s for orchestration.
Key Facts:
- Latest Version: Harvester 1.4 (based on K3s 1.30+)
- Foundation: Built on RancherOS 2.0, K3s, and KubeVirt
- Support: Supported by SUSE (acquired Rancher)
- Architecture: HCI platform with VM + container workloads
- Management: Web UI + kubectl + Rancher integration
Kubernetes Installation Methods
Harvester includes K3s as its foundation - Kubernetes is built-in.
Built-in K3s (Only Option)
Installation Process:
- Boot Harvester ISO (interactive installer or PXE)
- Complete installation wizard (web UI or console)
- Create cluster (automatic K3s deployment)
- Access via web UI or kubectl
Interactive Installation:
# Boot from Harvester ISO
1. Choose "Create a new Harvester cluster"
2. Configure:
- Cluster token
- Node role (management/worker/witness)
- Network interface (management network)
- VIP (Virtual IP for cluster access)
- Storage disk (Longhorn persistent storage)
3. Install completes (15-20 minutes)
4. Access web UI at https://<VIP>
Configuration (cloud-init for automated install):
# config.yaml
token: my-cluster-token
os:
hostname: harvester-node-1
modules:
- kvm
kernel_parameters:
- intel_iommu=on
install:
mode: create
device: /dev/sda
iso_url: https://releases.rancher.com/harvester/v1.4.0/harvester-v1.4.0-amd64.iso
vip: 192.168.1.100
vip_mode: static
networks:
harvester-mgmt:
interfaces:
- name: eth0
default_route: true
ip: 192.168.1.10
subnet_mask: 255.255.255.0
gateway: 192.168.1.1
Pros:
- Complete HCI solution (VMs + containers)
- Web UI for management (no CLI required)
- Built-in storage (Longhorn CSI)
- Built-in networking (multus, SR-IOV)
- VM live migration
- Rancher integration for multi-cluster management
- K3s built-in (no separate Kubernetes install)
Cons:
- Heavy resource requirements (8GB+ RAM per node)
- Complex architecture (steep learning curve)
- Larger attack surface than minimal OS
- Overkill for container-only workloads
- Requires 3+ nodes for production HA
Cluster Initialization Sequence
sequenceDiagram
participant Admin
participant Server as Bare Metal Server
participant Harvester as Harvester HCI
participant K3s as K3s / KubeVirt
participant Storage as Longhorn Storage
Admin->>Server: Boot Harvester ISO
Server->>Harvester: Start installation wizard
Harvester-->>Admin: Interactive console/web UI
Admin->>Harvester: Configure cluster (token, VIP, storage)
Harvester->>Server: Partition disks (OS + Longhorn storage)
Harvester->>Server: Install RancherOS 2.0 base
Harvester->>Server: Install K3s components
Server->>Server: Reboot
Harvester->>K3s: Start K3s server
K3s->>Server: Initialize control plane
K3s->>Server: Deploy Harvester operators
K3s->>Storage: Deploy Longhorn for persistent storage
K3s->>Server: Deploy KubeVirt for VM management
K3s->>Server: Deploy multus CNI (multi-network)
Harvester-->>Admin: Web UI ready at https://<VIP>
Admin->>Harvester: Add additional nodes (join cluster)
Harvester->>K3s: Join nodes to cluster
K3s->>Storage: Replicate storage across nodes
Harvester-->>Admin: Cluster ready (20-30 minutes)
Admin->>Harvester: Create VMs or deploy containersMaintenance Requirements
OS Updates
Harvester Upgrades (includes OS + K3s):
# Via Web UI:
# Settings → Upgrade → Select version → Start upgrade
# Via kubectl (after downloading upgrade image):
kubectl apply -f https://releases.rancher.com/harvester/v1.4.0/version.yaml
# Monitor upgrade progress
kubectl get upgrades -n harvester-system
Frequency:
- Harvester releases: Every 2-3 months (minor versions)
- Security patches: Included in Harvester releases
- K3s upgrades: Bundled with Harvester upgrades
- No separate OS patching (managed by Harvester)
Kubernetes Upgrades
K3s is upgraded with Harvester - no separate upgrade process.
Version Compatibility:
- Harvester 1.4.x → K3s 1.30+
- Harvester 1.3.x → K3s 1.28+
- Harvester 1.2.x → K3s 1.26+
Upgrade Process:
- Web UI or kubectl to trigger upgrade
- Rolling upgrade of nodes (one at a time)
- VM live migration during node upgrades
- Automatic rollback on failure
Resource Overhead
Single Node (Harvester HCI):
- RAM: 8GB minimum (16GB recommended for VMs)
- CPU: 4 cores minimum (8 cores recommended)
- Disk: 250GB minimum (SSD recommended)
- 100GB for OS/Harvester components
- 150GB+ for Longhorn storage (VM disks)
- Network: 1 Gbps minimum (10 Gbps for production)
Three-Node Cluster (Production HA):
- RAM: 32GB per node (64GB for VM-heavy workloads)
- CPU: 8 cores per node minimum
- Disk: 500GB+ per node (NVMe SSD recommended)
- Network: 10 Gbps recommended (separate storage network ideal)
Comparison:
- Ubuntu + k3s: 1GB RAM
- Talos: 768MB RAM
- Harvester: 8GB+ RAM (much heavier)
Note: Harvester is designed for multi-node HCI, not single-node homelabs.
Security Posture
Strengths:
- SELinux-based (RancherOS 2.0 foundation)
- Immutable OS layer (similar to Talos)
- RBAC built-in (Kubernetes + Rancher)
- Network segmentation (multus CNI)
- VM isolation (KubeVirt)
- Signed images and secure boot support
Attack Surface:
- Larger than Talos/k3s: Includes web UI, VM management, storage layer
- KubeVirt adds additional components
- Web UI is additional attack vector
- More processes than minimal OS (~50+ services)
Security Features:
# VM network isolation example
apiVersion: network.harvesterhci.io/v1beta1
kind: VlanConfig
metadata:
name: production-vlan
spec:
vlanID: 100
uplink:
linkAttributes: 1500
Hardening:
- Firewall rules (web UI or kubectl)
- RBAC policies (restrict VM/namespace access)
- Network policies (isolate workloads)
- Rancher authentication integration (LDAP, SAML)
Learning Curve
Ease of Adoption: ⭐⭐⭐ (Moderate)
- Web UI simplifies management (no CLI required for basic tasks)
- Requires understanding of VMs + containers
- Kubernetes knowledge helpful but not required initially
- Longhorn storage concepts (replicas, snapshots)
- KubeVirt for VM management (learning curve)
Required Knowledge:
- Basic Kubernetes concepts (pods, services)
- VM management (KubeVirt/libvirt)
- Storage concepts (Longhorn, CSI)
- Networking (VLANs, SR-IOV optional)
- Web UI navigation
Debugging:
# Access via kubectl (kubeconfig from web UI)
kubectl get nodes -n harvester-system
# View Harvester logs
kubectl logs -n harvester-system <pod-name>
# VM console access (via web UI or virtctl)
virtctl console <vm-name>
# Storage debugging
kubectl get volumes -A
Community Support
Ecosystem Maturity: ⭐⭐⭐⭐ (Good)
- Documentation: Excellent official docs
- Community: Active Slack, GitHub Discussions, forums
- Commercial Support: Available from SUSE/Rancher
- Third-Party Tools: Rancher ecosystem integration
- Tutorials: Growing number of guides and videos
Resources:
Pros and Cons Summary
Pros
- Good, because unified platform for VMs + containers (no separate hypervisor)
- Good, because built-in K3s (Kubernetes included)
- Good, because web UI simplifies management (no CLI required)
- Good, because built-in persistent storage (Longhorn CSI)
- Good, because VM live migration (no downtime during maintenance)
- Good, because multi-network support (multus CNI, SR-IOV)
- Good, because Rancher integration (multi-cluster management)
- Good, because automatic upgrades (OS + K3s + components)
- Good, because commercial support available (SUSE)
- Good, because designed for bare-metal HCI (no cloud dependencies)
- Neutral, because immutable OS layer (similar to Talos benefits)
Cons
- Bad, because very heavy resource requirements (8GB+ RAM minimum)
- Bad, because complex architecture (KubeVirt, Longhorn, multus, etc.)
- Bad, because overkill for container-only workloads (use k3s/Talos instead)
- Bad, because larger attack surface than minimal OS (web UI, VM layer)
- Bad, because requires 3+ nodes for production HA (not single-node friendly)
- Bad, because steep learning curve for full feature set (VMs + storage + networking)
- Bad, because relatively new platform (less mature than Ubuntu/Fedora)
- Bad, because limited to Rancher ecosystem (vendor lock-in)
- Bad, because slower to adopt latest Kubernetes versions (depends on K3s bundle)
- Neutral, because opinionated HCI design (pro for VM use cases, con for simplicity)
Recommendations
Best for:
- Hybrid workloads (VMs + containers on same platform)
- Homelab users wanting to consolidate VM hypervisor + Kubernetes
- Teams familiar with Rancher ecosystem
- Multi-node clusters (3+ nodes)
- Environments requiring VM live migration
- Users wanting web UI for infrastructure management
- Replacing VMware/Proxmox + Kubernetes with unified platform
Best Installation Method:
- Only option: Interactive ISO install or PXE with cloud-init
Avoid if:
- Running container-only workloads (use k3s or Talos instead)
- Limited resources (< 8GB RAM per node)
- Single-node homelab (Harvester designed for multi-node)
- Want minimal attack surface (use Talos)
- Prefer traditional Linux shell access (use Ubuntu/Fedora)
- Need latest Kubernetes versions immediately (Harvester lags upstream)
Comparison with Other Options
| Aspect | Harvester | Talos Linux | Ubuntu + k3s | Fedora + kubeadm |
|---|---|---|---|---|
| Primary Use Case | VMs + Containers | Containers only | General-purpose | General-purpose |
| Resource Overhead | 8GB+ RAM | 768MB RAM | 1GB RAM | 2.2GB RAM |
| Kubernetes | Built-in K3s | Built-in | Install k3s | Install kubeadm |
| Management | Web UI + kubectl | API-only (talosctl) | SSH + kubectl | SSH + kubectl |
| Storage | Built-in Longhorn | External CSI | External CSI | External CSI |
| VM Support | Native (KubeVirt) | No | Via KubeVirt | Via KubeVirt |
| Learning Curve | Moderate | Steep | Easy | Moderate |
| Attack Surface | Large | Minimal | Medium | Medium |
| Multi-Node | Designed for | Supports | Supports | Supports |
| Single-Node | Not ideal | Excellent | Excellent | Good |
| Best for | VM + K8s hybrid | K8s-only | Homelab/learning | RHEL ecosystem |
Verdict: Harvester is excellent for VM + container hybrid workloads with 3+ nodes, but overkill for container-only infrastructure. Use Talos or k3s for Kubernetes-only clusters, Ubuntu/Fedora for general-purpose servers.
Advanced Features
VM Management (KubeVirt)
Create VMs via YAML:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: ubuntu-vm
spec:
running: true
template:
spec:
domain:
devices:
disks:
- name: root
disk:
bus: virtio
resources:
requests:
memory: 4Gi
cpu: 2
volumes:
- name: root
containerDisk:
image: docker.io/harvester/ubuntu:22.04
Live Migration
Move VMs between nodes:
# Via web UI: VM → Actions → Migrate
# Via kubectl
kubectl patch vm ubuntu-vm --type merge -p '{"spec":{"running":false}}'
kubectl patch vm ubuntu-vm --type merge -p '{"spec":{"running":true}}'
Backup and Restore
Harvester supports VM backups:
# Configure S3 backup target (web UI)
# Create VM snapshot
# Restore from snapshot or backup
Rancher Integration
Manage multiple clusters:
# Import Harvester cluster into Rancher
# Deploy workloads across clusters
# Central authentication and RBAC
Use Case Examples
Use Case 1: Replace VMware + Kubernetes
Scenario: Currently running VMware ESXi for VMs + separate Kubernetes cluster
Harvester Solution:
- Consolidate to 3-node Harvester cluster
- Migrate VMs to KubeVirt
- Deploy containers on same cluster
- Save VMware licensing costs
Benefits:
- Single platform for VMs + containers
- Unified management (web UI + kubectl)
- Built-in HA and live migration
Use Case 2: Homelab with Mixed Workloads
Scenario: Need Windows VMs + Linux containers + storage server
Harvester Solution:
- Windows VMs via KubeVirt (GPU passthrough supported)
- Linux containers via K3s workloads
- Longhorn for persistent storage (NFS export supported)
Benefits:
- No need for separate Proxmox/ESXi
- Kubernetes-native management
- Learn enterprise HCI platform
Use Case 3: Edge Computing
Scenario: Deploy compute at remote sites (3-5 nodes each)
Harvester Solution:
- Harvester cluster at each edge location
- Rancher for central management
- VM + container workloads
Benefits:
- Autonomous operation (no cloud dependency)
- Rancher multi-cluster management
- Built-in storage and networking
Production Readiness
Production Use: ✅ Yes (used in enterprise environments)
High Availability:
- 3+ nodes required for HA
- Witness node for even-node clusters
- VM live migration during maintenance
- Longhorn 3-replica storage
Monitoring:
- Built-in Prometheus + Grafana
- Rancher monitoring integration
- Alerting and notifications
Disaster Recovery:
- VM backups to S3
- Cluster backups (etcd + config)
- Restore to new cluster
Enterprise Features:
- Rancher authentication (LDAP, SAML, OAuth)
- Multi-tenancy (namespaces, RBAC)
- Audit logging
- Network policies
1.2 - Amazon Web Services Analysis
This section contains detailed analysis of Amazon Web Services (AWS) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.
Overview
Amazon Web Services is Amazon’s comprehensive cloud computing platform, offering compute, storage, networking, and managed services. This analysis focuses on AWS’s capabilities to support the network boot architecture decided in ADR-0002.
Key Services Evaluated
- EC2: Virtual machine instances for hosting boot server
- VPN / VPC: Network connectivity and VPN capabilities
- Elastic Load Balancing: Application and Network Load Balancers
- NAT Gateway: Network address translation for outbound connectivity
- VPC: Virtual Private Cloud networking and routing
Documentation Sections
- Network Boot Support - Analysis of TFTP, HTTP, and HTTPS routing capabilities
- WireGuard Support - Evaluation of WireGuard VPN integration options
1.2.1 - AWS Network Boot Protocol Support
Network Boot Protocol Support on Amazon Web Services
This document analyzes AWS’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.
TFTP (Trivial File Transfer Protocol) Support
Native Support
Status: ❌ Not natively supported by Elastic Load Balancing
AWS’s Elastic Load Balancing services do not support TFTP protocol natively:
- Application Load Balancer (ALB): HTTP/HTTPS only (Layer 7)
- Network Load Balancer (NLB): TCP/UDP support, but not TFTP-aware
- Classic Load Balancer: Deprecated, similar limitations
TFTP operates on UDP port 69 with unique protocol semantics (variable block sizes, retransmissions, port negotiation) that standard load balancers cannot parse.
Implementation Options
Option 1: Direct EC2 Instance Access (Recommended for VPN Scenario)
Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from an EC2 instance:
- Approach: Run TFTP server (e.g.,
tftpd-hpa,dnsmasq) on an EC2 instance - Access: Home lab connects via VPN tunnel to instance’s private IP
- Security Group: Allow UDP/69 from VPN subnet/security group
- Pros:
- Simple implementation
- No load balancer needed (single boot server sufficient for home lab)
- TFTP traffic encrypted through VPN tunnel
- Direct instance-to-client communication
- Cons:
- Single point of failure (no HA)
- Manual failover if instance fails
Option 2: Network Load Balancer (NLB) UDP Passthrough
While NLB doesn’t understand TFTP protocol, it can forward UDP traffic:
- Approach: Configure NLB to forward UDP/69 to target group
- Limitations:
- No TFTP-specific health checks
- Health checks would use TCP or different protocol
- Adds cost and complexity without significant benefit for single server
- Use Case: Only relevant for multi-AZ HA deployment (overkill for home lab)
TFTP Security Considerations
- Encryption: TFTP itself is unencrypted, but VPN tunnel provides encryption
- Security Groups: Restrict UDP/69 to VPN security group or CIDR only
- File Access Control: Configure TFTP server with restricted file access
- Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads
HTTP Support
Native Support
Status: ✅ Fully supported
AWS provides comprehensive HTTP support through multiple services:
Elastic Load Balancing - Application Load Balancer
- Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (preview)
- Port: Any port (typically 80 for HTTP)
- Routing: Path-based, host-based, query string, header-based routing
- Health Checks: HTTP health checks with configurable paths and response codes
- SSL Offloading: Terminate SSL at ALB and use HTTP to backend
- Backend: EC2 instances, ECS, EKS, Lambda
EC2 Direct Access
For VPN scenario, HTTP can be served directly from EC2 instance:
- Approach: Run HTTP server (nginx, Apache, custom service) on EC2
- Access: Home lab accesses via VPN tunnel to private IP
- Security Group: Allow TCP/80 from VPN security group
- Pros: Simpler than ALB for single boot server
HTTP Boot Flow for Network Boot
- PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
- iPXE → HTTP: iPXE chainloads kernel/initrd via HTTP
- Kernel/Initrd: Large boot files served efficiently over HTTP
Performance Considerations
- Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
- Compression: gzip compression for text-based configs
- CloudFront: Optional CDN for caching boot files (probably overkill for VPN scenario)
- TCP Optimization: AWS network optimized for low-latency TCP
HTTPS Support
Native Support
Status: ✅ Fully supported with advanced features
AWS provides enterprise-grade HTTPS support:
Elastic Load Balancing - Application Load Balancer
- Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 (preview)
- SSL/TLS Termination: Terminate SSL at ALB
- Certificate Management:
- AWS Certificate Manager (ACM) - free SSL certificates with automatic renewal
- Import custom certificates
- Integration with private CA via ACM Private CA
- TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable via security policy)
- Cipher Suites: Predefined security policies (modern, compatible, legacy)
- SNI Support: Multiple certificates on single load balancer
AWS Certificate Manager (ACM)
- Free Certificates: No cost for public SSL certificates used with AWS services
- Automatic Renewal: ACM automatically renews certificates before expiration
- Private CA: ACM Private CA for internal PKI (additional cost)
- Integration: Native integration with ALB, CloudFront, API Gateway
HTTPS for Network Boot
Use Case
Modern UEFI firmware and iPXE support HTTPS boot:
- iPXE HTTPS: iPXE compiled with
DOWNLOAD_PROTO_HTTPScan fetch over HTTPS - UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot
- Security: Boot file integrity verified via HTTPS chain of trust
Implementation on AWS
Certificate Provisioning:
- Use ACM certificate for public domain (free, auto-renewed)
- Use self-signed certificate for VPN-only access (add to iPXE trust store)
- Use ACM Private CA for internal PKI ($400/month - expensive for home lab)
ALB Configuration:
- HTTPS listener on port 443
- Target group pointing to EC2 boot server
- Security policy with TLS 1.2+ minimum
Alternative: Direct EC2 HTTPS:
- Run nginx/Apache with TLS on EC2 instance
- Access via VPN tunnel to private IP with HTTPS
- Simpler setup for VPN-only scenario
- Use Let’s Encrypt or self-signed certificate
Mutual TLS (mTLS) Support
AWS ALB supports mutual TLS authentication (as of 2022):
- Client Certificates: Require client certificates for authentication
- Trust Store: Upload trusted CA certificates to ALB
- Use Case: Ensure only authorized home lab servers can access boot files
- Integration: Combine with VPN for defense-in-depth
- Passthrough Mode: ALB can pass client cert to backend for validation
Routing and Load Balancing Capabilities
VPC Routing
- Route Tables: Define routes to direct traffic through VPN gateway
- Route Propagation: BGP route propagation for VPN connections
- Transit Gateway: Advanced multi-VPC/VPN routing (overkill for home lab)
Security Groups
- Stateful Firewall: Automatic return traffic handling
- Ingress/Egress Rules: Fine-grained control by protocol, port, source/destination
- Security Group Chaining: Reference security groups in rules (elegant for VPN setup)
- VPN Subnet Restriction: Allow traffic only from VPN-connected subnet
Network ACLs (Optional)
- Stateless Firewall: Subnet-level access control
- Defense in Depth: Additional layer beyond security groups
- Use Case: Probably unnecessary for simple VPN boot server
Cost Implications
Data Transfer Costs
- VPN Traffic: Data transfer through VPN gateway charged at standard rates
- Intra-Region: Free for traffic within same region/VPC
- Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
- Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.14/month (US East egress)
Load Balancing Costs
- Application Load Balancer:
$0.0225/hour + $0.008 per LCU-hour ($16-20/month minimum) - Network Load Balancer:
$0.0225/hour + $0.006 per NLCU-hour ($16-18/month minimum) - For VPN Scenario: Load balancer unnecessary (single EC2 instance sufficient)
Compute Costs
- t3.micro Instance: ~$7.50/month (on-demand pricing, US East)
- t4g.micro Instance: ~$6.00/month (ARM-based, cheaper, sufficient for boot server)
- Reserved Instances: Up to 72% savings with 1-year or 3-year commitment
- Savings Plans: Flexible discounts for consistent compute usage
ACM Certificate Costs
- Public Certificates: Free when used with AWS services
- Private CA: $400/month (too expensive for home lab)
Comparison with Requirements
| Requirement | AWS Support | Implementation |
|---|---|---|
| TFTP | ⚠️ Via EC2, not ELB | Direct EC2 access via VPN |
| HTTP | ✅ Full support | EC2 or ALB |
| HTTPS | ✅ Full support | EC2 or ALB with ACM |
| VPN Integration | ✅ Native VPN | Site-to-Site VPN or self-managed |
| Load Balancing | ✅ ALB, NLB | Optional for HA |
| Certificate Mgmt | ✅ ACM (free) | Automatic renewal |
| Cost Efficiency | ✅ Low-cost instances | t4g.micro sufficient |
Recommendations
For VPN-Based Architecture (per ADR-0002)
EC2 Instance: Deploy single t4g.micro or t3.micro instance with:
- TFTP server (
tftpd-hpaordnsmasq) - HTTP server (nginx or simple Python HTTP server)
- Optional HTTPS with Let’s Encrypt or self-signed certificate
- TFTP server (
VPN Connection: Connect home lab to AWS via:
- Site-to-Site VPN (IPsec) - managed service, higher cost (~$36/month)
- Self-managed WireGuard on EC2 - lower cost, more control
Security Groups: Restrict access to:
- UDP/69 (TFTP) from VPN security group only
- TCP/80 (HTTP) from VPN security group only
- TCP/443 (HTTPS) from VPN security group only
No Load Balancer: For home lab scale, direct EC2 access is sufficient
Health Monitoring: Use CloudWatch for instance and service health
If HA Required (Future Enhancement)
- Deploy multi-AZ EC2 instances with Network Load Balancer
- Use S3 as backend for boot files with EC2 serving as cache
- Implement auto-recovery with Auto Scaling Group (min=max=1)
References
1.2.2 - AWS WireGuard VPN Support
WireGuard VPN Support on Amazon Web Services
This document analyzes options for deploying WireGuard VPN on AWS to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.
WireGuard Overview
WireGuard is a modern VPN protocol that provides:
- Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
- Performance: High throughput with low overhead
- Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
- Configuration: Simple key-based configuration
- Kernel Integration: Mainline Linux kernel support since 5.6
AWS Native VPN Support
Site-to-Site VPN (IPsec)
Status: ❌ WireGuard not natively supported
AWS’s managed Site-to-Site VPN supports:
- IPsec VPN: IKEv1, IKEv2 with pre-shared keys
- Redundancy: Two VPN tunnels per connection for high availability
- BGP Support: Dynamic routing via BGP
- Transit Gateway: Scalable multi-VPC VPN hub
Limitation: Site-to-Site VPN does not support WireGuard protocol natively.
Cost: Site-to-Site VPN
- VPN Connection: ~$0.05/hour = ~$36/month
- Data Transfer: Standard data transfer out rates (~$0.09/GB for first 10TB)
- Total Estimate: ~$36-50/month for managed IPsec VPN
Self-Managed WireGuard on EC2
Implementation Approach
Since AWS doesn’t offer managed WireGuard, deploy WireGuard on an EC2 instance:
Status: ✅ Fully supported via EC2
Architecture
graph LR
A[Home Lab] -->|WireGuard Tunnel| B[AWS EC2 Instance]
B -->|VPC Network| C[Boot Server EC2]
B -->|IP Forwarding| C
subgraph "Home Network"
A
D[UDM Pro]
D -.WireGuard Client.- A
end
subgraph "AWS VPC"
B[WireGuard Gateway EC2]
C[Boot Server EC2]
endEC2 Configuration
WireGuard Gateway Instance:
- Instance Type: t4g.micro or t3.micro ($6-7.50/month)
- OS: Ubuntu 22.04 LTS or Amazon Linux 2023 (native WireGuard support)
- Source/Dest Check: Disable to allow IP forwarding
- Elastic IP: Allocate Elastic IP for stable WireGuard endpoint
- Security Group: Allow UDP port 51820 from home lab public IP
Boot Server Instance:
- Network: Same VPC as WireGuard gateway
- Private IP Only: No Elastic IP (accessed via VPN)
- Route Traffic: Through WireGuard gateway instance
Installation Steps
# On EC2 Instance (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools
# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key
# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf
Example /etc/wireguard/wg0.conf on AWS EC2:
[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24
Corresponding config on UDM Pro:
[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>
[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <AWS_ELASTIC_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.0.0.0/16
PersistentKeepalive = 25
Enable and Start WireGuard
# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0
# Verify status
sudo wg show
AWS VPC Configuration
Security Groups
Create security group for WireGuard gateway:
aws ec2 create-security-group \
--group-name wireguard-gateway-sg \
--description "WireGuard VPN gateway" \
--vpc-id vpc-xxxxxx
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxxx \
--protocol udp \
--port 51820 \
--cidr <HOME_LAB_PUBLIC_IP>/32
Allow SSH for management (optional, restrict to trusted IP):
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxxx \
--protocol tcp \
--port 22 \
--cidr <TRUSTED_IP>/32
Disable Source/Destination Check
Required for IP forwarding to work:
aws ec2 modify-instance-attribute \
--instance-id i-xxxxxx \
--no-source-dest-check
Elastic IP Allocation
Allocate and associate Elastic IP for stable endpoint:
aws ec2 allocate-address --domain vpc
aws ec2 associate-address \
--instance-id i-xxxxxx \
--allocation-id eipalloc-xxxxxx
Cost: Elastic IP is free when associated with running instance, but charged ~$3.60/month if unattached.
Route Table Configuration
Add route to direct home lab subnet traffic through WireGuard gateway:
aws ec2 create-route \
--route-table-id rtb-xxxxxx \
--destination-cidr-block 192.168.1.0/24 \
--instance-id i-xxxxxx
This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway instance.
UDM Pro WireGuard Integration
Native Support
Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)
The UniFi Dream Machine Pro includes native WireGuard VPN support:
- GUI Configuration: Web UI for WireGuard VPN setup
- Site-to-Site: Support for site-to-site VPN tunnels
- Performance: Hardware acceleration for encryption (if available)
- Routing: Automatic route injection for remote subnets
Configuration Steps on UDM Pro
Network Settings → VPN:
- Create new VPN connection
- Select “WireGuard”
- Generate key pair or import existing
Peer Configuration:
- Peer Public Key: AWS EC2 WireGuard instance’s public key
- Endpoint: AWS Elastic IP address
- Port: 51820
- Allowed IPs: AWS VPC CIDR (e.g., 10.0.0.0/16)
- Persistent Keepalive: 25 seconds
Route Injection:
- UDM Pro automatically adds routes to AWS subnets
- Home lab servers can reach AWS boot server via VPN
Firewall Rules:
- Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN
Alternative: Manual WireGuard on UDM Pro
If native support is insufficient, use wireguard-go via udm-utilities:
- Repository: boostchicken/udm-utilities
- Script:
on_boot.dscript to start WireGuard on boot - Persistence: Survives firmware updates with on-boot script
Performance Considerations
Throughput
WireGuard on EC2 performance varies by instance type:
- t4g.micro (2 vCPU, ARM): ~100-300 Mbps
- t3.micro (2 vCPU, x86): ~100-300 Mbps
- t3.small (2 vCPU): ~500-800 Mbps
- t3.medium (2 vCPU): ~1+ Gbps
For network boot (typical boot = 50-200MB), even t4g.micro is sufficient:
- Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
- Recommendation: t4g.micro adequate and most cost-effective
Latency
- VPN Overhead: WireGuard adds minimal latency (~1-5ms)
- AWS Network: Low-latency network infrastructure
- Total Latency: Primarily dependent on home ISP and AWS region proximity
CPU Usage
- Encryption: ChaCha20 is CPU-efficient
- Kernel Module: Minimal CPU overhead in kernel space
- t4g.micro: Sufficient CPU for home lab VPN throughput
- ARM Advantage: t4g instances use Graviton processors (better price/performance)
Security Considerations
Key Management
- Private Keys: Store securely, never commit to version control
- Key Rotation: Rotate keys periodically (e.g., annually)
- Secrets Manager: Store WireGuard private keys in AWS Secrets Manager
- Retrieve at instance startup via user data script
- Avoid storing in AMIs or instance metadata
- IAM Role: Grant EC2 instance IAM role to read secret
Firewall Hardening
- Security Group Restriction: Limit WireGuard port to home lab public IP only
- Least Privilege: Boot server security group allows only VPN security group
- No Public Access: Boot server has no Elastic IP or public route
Monitoring and Alerts
- CloudWatch Logs: Stream WireGuard logs to CloudWatch
- CloudWatch Alarms: Alert on VPN tunnel down (no recent handshakes)
- VPC Flow Logs: Monitor VPN traffic patterns
DDoS Protection
- UDP Amplification: WireGuard resistant to DDoS amplification attacks
- AWS Shield: Basic DDoS protection included free on all AWS resources
- Shield Advanced: Optional ($3,000/month - overkill for VPN endpoint)
High Availability Options
Multi-AZ Failover
Deploy WireGuard gateways in multiple Availability Zones:
- Primary: us-east-1a WireGuard instance
- Secondary: us-east-1b WireGuard instance
- Failover: UDM Pro switches endpoints if primary fails
- Cost: Doubles instance costs (~$12-15/month for 2 instances)
Auto Scaling Group (Single Instance)
Use Auto Scaling Group with min=max=1 for auto-recovery:
- Health Checks: EC2 status checks
- Auto-Recovery: ASG replaces failed instance automatically
- Elastic IP: Reassociate Elastic IP to new instance via Lambda/script
- Limitation: Brief downtime during recovery (~2-5 minutes)
Health Monitoring
Monitor WireGuard tunnel health with CloudWatch custom metrics:
# On EC2 instance, run periodically via cron
#!/bin/bash
HANDSHAKE=$(wg show wg0 latest-handshakes | awk '{print $2}')
NOW=$(date +%s)
AGE=$((NOW - HANDSHAKE))
aws cloudwatch put-metric-data \
--namespace WireGuard \
--metric-name TunnelAge \
--value $AGE \
--unit Seconds
Alert if handshake age exceeds threshold (e.g., 180 seconds).
User Data Script for Auto-Configuration
EC2 user data script to configure WireGuard on launch:
#!/bin/bash
# Install WireGuard
apt update && apt install -y wireguard wireguard-tools
# Retrieve private key from Secrets Manager
aws secretsmanager get-secret-value \
--secret-id wireguard-server-key \
--query SecretString \
--output text > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key
# Configure interface (full config omitted for brevity)
# ...
# Enable and start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0
Requires IAM instance role with secretsmanager:GetSecretValue permission.
Cost Analysis
Self-Managed WireGuard on EC2
| Component | Cost (US East) |
|---|---|
| t4g.micro instance (730 hrs/month) | ~$6.00 |
| Elastic IP (attached) | $0.00 |
| Data transfer out (1GB/month) | ~$0.09 |
| Monthly Total | ~$6.09 |
| Annual Total | ~$73 |
With Reserved Instance (1-year, no upfront):
| Component | Cost |
|---|---|
| t4g.micro RI (1-year) | ~$3.50/month |
| Elastic IP | $0.00 |
| Data transfer | ~$0.09 |
| Monthly Total | ~$3.59 |
| Annual Total | ~$43 |
Site-to-Site VPN (IPsec - if WireGuard not used)
| Component | Cost |
|---|---|
| VPN Connection (2 tunnels) | ~$36 |
| Data transfer (1GB/month) | ~$0.09 |
| Monthly Total | ~$36 |
| Annual Total | ~$432 |
Cost Savings: Self-managed WireGuard saves ~$360/year vs Site-to-Site VPN (or ~$390/year with Reserved Instance).
Comparison with Requirements
| Requirement | AWS Support | Implementation |
|---|---|---|
| WireGuard Protocol | ✅ Via EC2 | Self-managed on instance |
| Site-to-Site VPN | ✅ Yes | WireGuard tunnel |
| UDM Pro Integration | ✅ Native support | WireGuard peer config |
| Cost Efficiency | ✅ Very low cost | t4g.micro ~$6/month (on-demand) |
| Performance | ✅ Sufficient | 100+ Mbps on t4g.micro |
| Security | ✅ Modern crypto | ChaCha20, Curve25519 |
| HA (optional) | ⚠️ Manual setup | Multi-AZ or ASG |
Recommendations
For Home Lab VPN (per ADR-0002)
Self-Managed WireGuard: Deploy on EC2 t4g.micro instance
- Cost: ~$6/month on-demand, ~$3.50/month with Reserved Instance
- Performance: Sufficient for network boot traffic
- Simplicity: Easy to configure and maintain
Single AZ Deployment: Unless HA required, single instance adequate
- Region Selection: Choose region closest to home lab for lowest latency
- AZ: Single AZ sufficient (boot server not mission-critical)
UDM Pro Native WireGuard: Use built-in WireGuard client
- Configuration: Add AWS instance as WireGuard peer in UDM Pro UI
- Route Injection: UDM Pro automatically routes AWS subnets
Security Best Practices:
- Store WireGuard private key in Secrets Manager
- Restrict security group to home lab public IP only
- Use user data script to retrieve key and configure on boot
- Enable CloudWatch logging for VPN events
- Assign IAM instance role with minimal permissions
Monitoring: Set up CloudWatch alarms for:
- Instance status check failures
- High CPU usage
- VPN tunnel age (custom metric)
Cost Optimization
- Reserved Instance: Commit to 1-year Reserved Instance for ~40% savings
- Spot Instance: Consider Spot for even lower cost (~70% savings), but adds complexity (handle interruptions)
- ARM Architecture: Use t4g (Graviton) for 20% better price/performance vs t3
Future Enhancements
- HA Setup: Deploy secondary WireGuard instance in different AZ
- Automated Failover: Lambda function to reassociate Elastic IP on failure
- IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
- Mesh VPN: Expand to mesh topology if multiple sites added
References
1.3 - Google Cloud Platform Analysis
This section contains detailed analysis of Google Cloud Platform (GCP) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.
Overview
Google Cloud Platform is Google’s suite of cloud computing services, offering compute, storage, networking, and managed services. This analysis focuses on GCP’s capabilities to support the network boot architecture decided in ADR-0002.
Key Services Evaluated
- Compute Engine: Virtual machine instances for hosting boot server
- Cloud VPN / VPC: Network connectivity and VPN capabilities
- Cloud Load Balancing: Layer 4 and Layer 7 load balancing for HTTP/HTTPS
- Cloud NAT: Network address translation for outbound connectivity
- VPC Network: Software-defined networking and routing
Documentation Sections
- Network Boot Support - Analysis of TFTP, HTTP, and HTTPS routing capabilities
- WireGuard Support - Evaluation of WireGuard VPN integration options
1.3.1 - Cloud Storage FUSE (gcsfuse)
Overview
Cloud Storage FUSE (gcsfuse) is a FUSE-based filesystem adapter that allows Google Cloud Storage (GCS) buckets to be mounted and accessed as local filesystems on Linux systems. This enables applications to interact with object storage using standard filesystem operations (open, read, write, etc.) rather than requiring GCS-specific APIs.
Project: GoogleCloudPlatform/gcsfuse License: Apache 2.0 Status: Generally Available (GA) Latest Version: v2.x (as of 2024)
How gcsfuse Works
gcsfuse translates filesystem operations into GCS API calls:
- Mount Operation:
gcsfuse bucket-name /mount/pointmaps a GCS bucket to a local directory - Directory Structure: Interprets
/in object names as directory separators - File Operations: Translates
read(),write(),open(), etc. into GCS API requests - Metadata: Maintains file attributes (size, modification time) via GCS metadata
- Caching: Optional stat, type, list, and file caching to reduce API calls
Example:
- GCS object:
gs://boot-assets/kernels/talos-v1.6.0.img - Mounted path:
/mnt/boot-assets/kernels/talos-v1.6.0.img
Relevance to Network Boot Infrastructure
In the context of ADR-0005 Network Boot Infrastructure, gcsfuse offers a potential approach for serving boot assets from Cloud Storage without custom integration code.
Potential Use Cases
- Boot Asset Storage: Mount
gs://boot-assets/to/var/lib/boot-server/assets/ - Configuration Sync: Access boot profiles and machine mappings from GCS as local files
- Matchbox Integration: Mount GCS bucket to
/var/lib/matchbox/for assets/profiles/groups - Simplified Development: Eliminate custom Cloud Storage SDK integration in boot server code
Architecture Pattern
┌─────────────────────────┐
│ Boot Server Process │
│ (Cloud Run/Compute) │
└───────────┬─────────────┘
│ filesystem operations
│ (read, open, stat)
▼
┌─────────────────────────┐
│ gcsfuse mount point │
│ /var/lib/boot-assets │
└───────────┬─────────────┘
│ FUSE layer
│ (translates to GCS API)
▼
┌─────────────────────────┐
│ Cloud Storage Bucket │
│ gs://boot-assets/ │
└─────────────────────────┘
Performance Characteristics
Latency
- Much higher latency than local filesystem: Every operation requires GCS API call(s)
- No default caching: Without caching enabled, every read re-fetches from GCS
- Network round-trip: Minimum ~10-50ms latency per operation (depending on region)
Throughput
Single Large File:
- Read: ~4.1 MiB/s (individual file), up to 63.3 MiB/s (archive files)
- Write: Comparable to
gsutil cpfor large files - With parallel downloads: Up to 9x faster for single-threaded reads of large files
Small Files:
- Poor performance for random I/O on small files
- Bulk operations on many small files create significant bottlenecks
lson directories with thousands of objects can take minutes
Concurrent Access:
- Performance degrades significantly with parallel readers (8 instances: ~30 hours vs 16 minutes with local data)
- Not recommended for high-concurrency scenarios (web servers, NAS)
Performance Improvements (Recent Features)
Streaming Writes (default): Upload data directly to GCS as written
- Up to 40% faster for large sequential writes
- Reduces local disk usage (no staging file)
Parallel Downloads: Download large files using multiple workers
- Up to 9x faster model load times
- Best for single-threaded reads of large files
File Cache: Cache file contents locally (Local SSD, Persistent Disk, or tmpfs)
- Up to 2.3x faster training time (AI/ML workloads)
- Up to 3.4x higher throughput
- Requires explicit cache directory configuration
Metadata Cache: Cache stat, type, and list operations
- Stat and type caches enabled by default
- Configurable TTL (default: 60s, set
-1for unlimited)
Caching Configuration
gcsfuse provides four types of caching:
1. Stat Cache
Caches file attributes (size, modification time, existence).
# Enable with unlimited size and TTL
gcsfuse \
--stat-cache-max-size-mb=-1 \
--metadata-cache-ttl-secs=-1 \
bucket-name /mount/point
Use case: Reduces API calls for repeated stat() operations (e.g., checking file existence).
2. Type Cache
Caches file vs directory type information.
gcsfuse \
--type-cache-max-size-mb=-1 \
--metadata-cache-ttl-secs=-1 \
bucket-name /mount/point
Use case: Speeds up directory traversal and ls operations.
3. List Cache
Caches directory listing results.
gcsfuse \
--max-conns-per-host=100 \
--metadata-cache-ttl-secs=-1 \
bucket-name /mount/point
Use case: Improves performance for applications that repeatedly list directory contents.
4. File Cache
Caches actual file contents locally.
gcsfuse \
--file-cache-max-size-mb=-1 \
--cache-dir=/mnt/local-ssd \
--file-cache-cache-file-for-range-read=true \
--file-cache-enable-parallel-downloads=true \
bucket-name /mount/point
Use case: Essential for AI/ML training, repeated reads of large files.
Recommended cache storage:
- Local SSD: Fastest, but ephemeral (data lost on restart)
- Persistent Disk: Persistent but slower than Local SSD
- tmpfs (RAM disk): Fastest but limited by memory
Production Configuration Example
# config.yaml for gcsfuse
metadata-cache:
ttl-secs: -1 # Never expire (use only if bucket is read-only or single-writer)
stat-cache-max-size-mb: -1
type-cache-max-size-mb: -1
file-cache:
max-size-mb: -1 # Unlimited (limited by disk space)
cache-file-for-range-read: true
enable-parallel-downloads: true
parallel-downloads-per-file: 16
download-chunk-size-mb: 50
write:
create-empty-file: false # Streaming writes (default)
logging:
severity: info
format: json
gcsfuse --config-file=config.yaml boot-assets /mnt/boot-assets
Limitations and Considerations
Filesystem Semantics
gcsfuse provides approximate POSIX semantics but is not fully POSIX-compliant:
- No atomic rename: Rename operations are copy-then-delete (not atomic)
- No hard links: GCS doesn’t support hard links
- No file locking:
flock()is a no-op - Limited permissions: GCS has simpler ACLs than POSIX permissions
- No sparse files: Writes always materialize full file content
Performance Anti-Patterns
❌ Avoid:
- Serving web content or acting as NAS (concurrent connections)
- Random I/O on many small files (image datasets, text corpora)
- Reading during ML training loops (download first, then train)
- High-concurrency workloads (multiple parallel readers/writers)
✅ Good for:
- Sequential reads of large files (models, checkpoints, kernels)
- Infrequent writes of entire files
- Read-mostly workloads with caching enabled
- Single-writer scenarios
Consistency Trade-offs
With caching enabled:
- Stale reads possible if cache TTL > 0 and external modifications occur
- Safe only for:
- Read-only buckets
- Single-writer, single-mount scenarios
- Workloads tolerant of eventual consistency
Without caching:
- Strong consistency (every read fetches latest from GCS)
- Much slower performance
Resource Requirements
- Disk space: File cache and streaming writes require local storage
- File cache: Size of cached files (can be large for ML datasets)
- Streaming writes: Temporary staging (proportional to concurrent writes)
- Memory: Metadata caches consume RAM
- File handles: Can exceed system limits with high concurrency
- Network bandwidth: All data transfers via GCS API
Installation
On Compute Engine (Container-Optimized OS)
# Install gcsfuse (Container-Optimized OS doesn't include package managers)
export GCSFUSE_VERSION=2.x.x
curl -L -O https://github.com/GoogleCloudPlatform/gcsfuse/releases/download/v${GCSFUSE_VERSION}/gcsfuse_${GCSFUSE_VERSION}_amd64.deb
sudo dpkg -i gcsfuse_${GCSFUSE_VERSION}_amd64.deb
On Debian/Ubuntu
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse
In Docker/Cloud Run
FROM ubuntu:22.04
# Install gcsfuse
RUN apt-get update && apt-get install -y \
curl \
gnupg \
lsb-release \
&& export GCSFUSE_REPO=gcsfuse-$(lsb_release -c -s) \
&& echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list \
&& curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
&& apt-get update \
&& apt-get install -y gcsfuse \
&& rm -rf /var/lib/apt/lists/*
# Create mount point
RUN mkdir -p /mnt/boot-assets
# Mount gcsfuse at startup
CMD gcsfuse --foreground boot-assets /mnt/boot-assets & \
/usr/local/bin/boot-server
Note: Cloud Run does not support FUSE filesystems (requires privileged mode). gcsfuse only works on Compute Engine or GKE.
Network Boot Infrastructure Evaluation
Applicability to ADR-0005
Based on the analysis, gcsfuse is not recommended for the network boot infrastructure for the following reasons:
❌ Cloud Run Incompatibility
- gcsfuse requires FUSE kernel module and privileged containers
- Cloud Run does not support FUSE or privileged mode
- ADR-0005 prefers Cloud Run deployment (HTTP-only boot enables serverless)
- Impact: Blocks Cloud Run deployment, forcing Compute Engine VM
❌ Boot Latency Requirements
- Boot file requests target < 100ms latency (ADR-0005 confirmation criteria)
- gcsfuse adds 10-50ms+ latency per operation (network round-trips)
- Kernel/initrd downloads are latency-sensitive (network boot timeout)
- Impact: May exceed boot timeout thresholds
❌ No Caching for Read-Write Workloads
- Boot server needs to write new assets and read existing ones
- File cache with unlimited TTL requires read-only or single-writer assumption
- Multiple boot server instances (autoscaling) violate single-writer constraint
- Impact: Either accept stale reads or disable caching (slow)
❌ Small File Performance
- Machine mapping configs, boot scripts, profiles are small files (KB range)
- gcsfuse performs poorly on small, random I/O
lsoperations on directories with many profiles can be slow- Impact: Slow boot configuration lookups
✅ Alternative: Direct Cloud Storage SDK
Using cloud.google.com/go/storage SDK directly offers:
- Lower latency: Direct API calls without FUSE overhead
- Cloud Run compatible: No kernel module or privileged mode required
- Better control: Explicit caching, parallel downloads, streaming
- Simpler deployment: No mount management, no FUSE dependencies
- Cost: Similar API call costs to gcsfuse
Recommended approach (from ADR-0005):
// Custom boot server using Cloud Storage SDK
storage := storage.NewClient(ctx)
bucket := storage.Bucket("boot-assets")
// Stream kernel to boot client
obj := bucket.Object("kernels/talos-v1.6.0.img")
reader, _ := obj.NewReader(ctx)
defer reader.Close()
io.Copy(w, reader) // Stream to HTTP response
When gcsfuse MIGHT Be Useful
Despite the above limitations, gcsfuse could be considered for:
Matchbox on Compute Engine:
- Matchbox expects filesystem paths for assets (
/var/lib/matchbox/assets/) - Compute Engine VM supports FUSE
- Read-heavy workload (boot assets rarely change)
- Could mount
gs://boot-assets/to/var/lib/matchbox/assets/with file cache
- Matchbox expects filesystem paths for assets (
Development/Testing:
- Quick prototyping without writing Cloud Storage integration
- Local development with production bucket access
- Not recommended for production deployment
Low-Throughput Scenarios:
- Home lab scale (< 10 boots/hour)
- File cache enabled with Local SSD
- Single Compute Engine VM (not autoscaled)
Configuration for Matchbox + gcsfuse:
#!/bin/bash
# Mount boot assets for Matchbox
BUCKET="boot-assets"
MOUNT_POINT="/var/lib/matchbox/assets"
CACHE_DIR="/mnt/disks/local-ssd/gcsfuse-cache"
mkdir -p "$MOUNT_POINT" "$CACHE_DIR"
gcsfuse \
--stat-cache-max-size-mb=-1 \
--type-cache-max-size-mb=-1 \
--metadata-cache-ttl-secs=-1 \
--file-cache-max-size-mb=-1 \
--cache-dir="$CACHE_DIR" \
--file-cache-cache-file-for-range-read=true \
--file-cache-enable-parallel-downloads=true \
--implicit-dirs \
--foreground \
"$BUCKET" "$MOUNT_POINT"
Monitoring and Troubleshooting
Metrics
gcsfuse exposes Prometheus metrics:
gcsfuse --prometheus --prometheus-port=9101 bucket /mnt/point
Key metrics:
gcs_read_count: Number of GCS read operationsgcs_write_count: Number of GCS write operationsgcs_read_bytes: Bytes read from GCSgcs_write_bytes: Bytes written to GCSfs_ops_count: Filesystem operations by type (open, read, write, etc.)fs_ops_error_count: Filesystem operation errors
Logging
# JSON logging for Cloud Logging integration
gcsfuse --log-format=json --log-file=/var/log/gcsfuse.log bucket /mnt/point
Common Issues
Issue: ls on large directories takes minutes
Solution:
- Enable list caching with
--metadata-cache-ttl-secs=-1 - Reduce directory depth (flatten object hierarchy)
- Consider prefix-based filtering instead of full listings
Issue: Stale reads after external bucket modifications
Solution:
- Reduce
--metadata-cache-ttl-secs(default 60s) - Disable caching entirely for strong consistency
- Use versioned object names (immutable assets)
Issue: Transport endpoint is not connected errors
Solution:
- Unmount cleanly before remounting:
fusermount -u /mnt/point - Check GCS bucket permissions (IAM roles)
- Verify network connectivity to
storage.googleapis.com
Issue: High memory usage
Solution:
- Limit metadata cache sizes:
--stat-cache-max-size-mb=1024 - Disable file cache if not needed
- Monitor with
--prometheusmetrics
Comparison to Alternatives
gcsfuse vs Direct Cloud Storage SDK
| Aspect | gcsfuse | Cloud Storage SDK |
|---|---|---|
| Latency | Higher (FUSE overhead + GCS API) | Lower (direct GCS API) |
| Cloud Run | ❌ Not supported | ✅ Fully supported |
| Development Effort | Low (standard filesystem code) | Medium (SDK integration) |
| Performance | Slower (filesystem abstraction) | Faster (optimized for use case) |
| Caching | Built-in (stat, type, list, file) | Manual (application-level) |
| Streaming | Automatic | Explicit (io.Copy) |
| Dependencies | FUSE kernel module, privileged mode | None (pure Go library) |
Recommendation: Use Cloud Storage SDK directly for production network boot infrastructure.
gcsfuse vs rsync/gsutil Sync
Periodic sync pattern:
# Sync bucket to local disk every 5 minutes
*/5 * * * * gsutil -m rsync -r gs://boot-assets /var/lib/boot-assets
| Aspect | gcsfuse | rsync/gsutil sync |
|---|---|---|
| Consistency | Eventual (with caching) | Strong (within sync interval) |
| Disk Usage | Minimal (file cache optional) | Full copy of assets |
| Latency | GCS API per request | Local disk (fast) |
| Sync Lag | Real-time (no caching) or TTL | Sync interval (minutes) |
| Deployment | Requires FUSE | Simple cron job |
Recommendation: For read-heavy, infrequent-write workloads on Compute Engine, rsync/gsutil sync is simpler and faster than gcsfuse.
Conclusion
Cloud Storage FUSE (gcsfuse) provides a convenient filesystem abstraction over GCS buckets, but is not recommended for the network boot infrastructure due to:
- Cloud Run incompatibility (requires FUSE kernel module)
- Added latency (FUSE overhead + network round-trips)
- Poor performance for small files and concurrent access
- Caching trade-offs (consistency vs performance)
Recommended alternatives:
- Custom Boot Server: Direct Cloud Storage SDK integration (
cloud.google.com/go/storage) - Matchbox on Compute Engine: rsync/gsutil sync to local disk
- Cloud Run Deployment: Direct SDK (no gcsfuse possible)
gcsfuse may be useful for development/testing or Matchbox prototyping on Compute Engine, but production deployments should use direct SDK integration or periodic sync for optimal performance and Cloud Run compatibility.
References
1.3.2 - GCP Network Boot Protocol Support
Network Boot Protocol Support on Google Cloud Platform
This document analyzes GCP’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.
TFTP (Trivial File Transfer Protocol) Support
Native Support
Status: ❌ Not natively supported by Cloud Load Balancing
GCP’s Cloud Load Balancing services (Application Load Balancer, Network Load Balancer) do not support TFTP protocol natively. TFTP operates on UDP port 69 and has unique protocol requirements that are not compatible with GCP’s load balancing services.
Implementation Options
Option 1: Direct VM Access (Recommended for VPN Scenario)
Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from a Compute Engine VM without load balancing:
- Approach: Run TFTP server (e.g.,
tftpd-hpa,dnsmasq) on a Compute Engine VM - Access: Home lab connects via VPN tunnel to the VM’s private IP
- Routing: VPC firewall rules allow UDP/69 from VPN subnet
- Pros:
- Simple implementation
- No need for load balancing (single boot server sufficient)
- TFTP traffic encrypted through VPN tunnel
- Direct VM-to-client communication
- Cons:
- Single point of failure (no load balancing/HA)
- Manual failover required if VM fails
Option 2: Network Load Balancer (NLB) Passthrough
While NLB doesn’t parse TFTP protocol, it can forward UDP traffic:
- Approach: Configure Network Load Balancer for UDP/69 passthrough
- Limitations:
- No protocol-aware health checks for TFTP
- Health checks would use TCP or HTTP on alternate port
- Adds complexity without significant benefit for single boot server
- Use Case: Only relevant for multi-region HA deployment (overkill for home lab)
TFTP Security Considerations
- Encryption: TFTP protocol itself is unencrypted, but VPN tunnel provides encryption
- Firewall Rules: Restrict UDP/69 to VPN subnet only (no public access)
- File Access Control: Configure TFTP server with restricted file access
- Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads
HTTP Support
Native Support
Status: ✅ Fully supported
GCP provides comprehensive HTTP support through multiple services:
Cloud Load Balancing - Application Load Balancer
- Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (QUIC)
- Port: Any port (typically 80 for HTTP)
- Routing: URL-based routing, host-based routing, path-based routing
- Health Checks: HTTP health checks with configurable paths
- SSL Offloading: Can terminate SSL at load balancer and use HTTP backend
- Backend: Compute Engine VMs, instance groups, Cloud Run, GKE
Compute Engine Direct Access
For VPN scenario, HTTP can be served directly from VM:
- Approach: Run HTTP server (nginx, Apache, custom service) on Compute Engine VM
- Access: Home lab accesses via VPN tunnel to private IP
- Firewall: VPC firewall rules allow TCP/80 from VPN subnet
- Pros: Simpler than load balancer for single boot server
HTTP Boot Flow for Network Boot
- PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
- iPXE → HTTP: iPXE chainloads boot files via HTTP from same server
- Kernel/Initrd: Large boot files served efficiently over HTTP
Performance Considerations
- Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
- Compression: gzip compression for text-based boot configs
- Caching: Cloud CDN can cache boot files for faster delivery
- TCP Optimization: GCP’s network optimized for low-latency TCP
HTTPS Support
Native Support
Status: ✅ Fully supported with advanced features
GCP provides enterprise-grade HTTPS support:
Cloud Load Balancing - Application Load Balancer
- Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 with QUIC
- SSL/TLS Termination: Terminate SSL at load balancer
- Certificate Management:
- Google-managed SSL certificates (automatic renewal)
- Self-managed certificates (bring your own)
- Certificate Map for multiple domains
- TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable minimum version)
- Cipher Suites: Modern, compatible, or custom cipher suites
- mTLS Support: Mutual TLS authentication (client certificates)
Certificate Manager
- Managed Certificates: Automatic provisioning and renewal via Let’s Encrypt integration
- Private CA: Integration with Google Cloud Certificate Authority Service
- Certificate Maps: Route different domains to different backends based on SNI
- Certificate Monitoring: Automatic alerts before expiration
HTTPS for Network Boot
Use Case
Modern UEFI firmware and iPXE support HTTPS boot:
- iPXE HTTPS: iPXE compiled with
DOWNLOAD_PROTO_HTTPScan fetch over HTTPS - UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot (RFC 3720 iSCSI boot)
- Security: Boot file integrity verified via HTTPS chain of trust
Implementation on GCP
Certificate Provisioning:
- Use Google-managed certificate for public domain (if boot server has public DNS)
- Use self-signed certificate for VPN-only access (add to iPXE trust store)
- Use private CA for internal PKI
Load Balancer Configuration:
- HTTPS frontend (port 443)
- Backend service to Compute Engine VM running boot server
- SSL policy with TLS 1.2+ minimum
Alternative: Direct VM HTTPS:
- Run nginx/Apache with TLS on Compute Engine VM
- Access via VPN tunnel to private IP with HTTPS
- Simpler setup for VPN-only scenario
mTLS Support for Enhanced Security
GCP’s Application Load Balancer supports mutual TLS authentication:
- Client Certificates: Require client certificates for additional authentication
- Certificate Validation: Validate client certificates against trusted CA
- Use Case: Ensure only authorized home lab servers can access boot files
- Integration: Combine with VPN for defense-in-depth
Routing and Load Balancing Capabilities
VPC Routing
- Custom Routes: Define routes to direct traffic through VPN gateway
- Route Priority: Configure route priorities for failover scenarios
- BGP Support: Dynamic routing with Cloud Router (for advanced VPN setups)
Firewall Rules
- Ingress/Egress Rules: Fine-grained control over traffic
- Source/Destination Filters: IP ranges, tags, service accounts
- Protocol Filtering: Allow specific protocols (UDP/69, TCP/80, TCP/443)
- VPN Subnet Restriction: Limit access to VPN-connected home lab subnet
Cloud Armor (Optional)
For additional security if boot server has public access:
- DDoS Protection: Layer 3/4 DDoS mitigation
- WAF Rules: Application-level filtering
- IP Allowlisting: Restrict to known public IPs
- Rate Limiting: Prevent abuse
Cost Implications
Network Egress Costs
- VPN Traffic: Egress to VPN endpoint charged at standard internet egress rates
- Intra-Region: Free for traffic within same region
- Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
- Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.18/month (US egress)
Load Balancing Costs
- Application Load Balancer: ~$0.025/hour + $0.008 per LCU-hour
- Network Load Balancer: ~$0.025/hour + data processing charges
- For VPN Scenario: Load balancer likely unnecessary (single VM sufficient)
Compute Costs
- e2-micro Instance: ~$6-7/month (suitable for boot server)
- f1-micro Instance: ~$4-5/month (even smaller, might suffice)
- Reserved/Committed Use: Discounts for long-term commitment
Comparison with Requirements
| Requirement | GCP Support | Implementation |
|---|---|---|
| TFTP | ⚠️ Via VM, not LB | Direct VM access via VPN |
| HTTP | ✅ Full support | VM or ALB |
| HTTPS | ✅ Full support | VM or ALB with Certificate Manager |
| VPN Integration | ✅ Native VPN | Cloud VPN or self-managed WireGuard |
| Load Balancing | ✅ ALB, NLB | Optional for HA |
| Certificate Mgmt | ✅ Managed certs | Certificate Manager |
| Cost Efficiency | ✅ Low-cost VMs | e2-micro sufficient |
Recommendations
For VPN-Based Architecture (per ADR-0002)
Compute Engine VM: Deploy single e2-micro VM with:
- TFTP server (
tftpd-hpaordnsmasq) - HTTP server (nginx or simple Python HTTP server)
- Optional HTTPS with self-signed certificate
- TFTP server (
VPN Tunnel: Connect home lab to GCP via:
- Cloud VPN (IPsec) - easier setup, higher cost
- Self-managed WireGuard on Compute Engine - lower cost, more control
VPC Firewall: Restrict access to:
- UDP/69 (TFTP) from VPN subnet only
- TCP/80 (HTTP) from VPN subnet only
- TCP/443 (HTTPS) from VPN subnet only
No Load Balancer: For home lab scale, direct VM access is sufficient
Health Monitoring: Use Cloud Monitoring for VM and service health
If HA Required (Future Enhancement)
- Deploy multi-zone VMs with Network Load Balancer
- Use Cloud Storage as backend for boot files with VM serving as cache
- Implement failover automation with Cloud Functions
References
1.3.3 - GCP WireGuard VPN Support
WireGuard VPN Support on Google Cloud Platform
This document analyzes options for deploying WireGuard VPN on GCP to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.
WireGuard Overview
WireGuard is a modern VPN protocol that provides:
- Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
- Performance: High throughput with low overhead
- Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
- Configuration: Simple key-based configuration
- Kernel Integration: Mainline Linux kernel support since 5.6
GCP Native VPN Support
Cloud VPN (IPsec)
Status: ❌ WireGuard not natively supported
GCP’s managed Cloud VPN service supports:
- IPsec VPN: IKEv1, IKEv2 with PSK or certificate authentication
- HA VPN: Highly available VPN with 99.99% SLA
- Classic VPN: Single-tunnel VPN (deprecated)
Limitation: Cloud VPN does not support WireGuard protocol natively.
Cost: Cloud VPN
- HA VPN: ~$0.05/hour per tunnel × 2 tunnels = ~$73/month
- Egress: Standard internet egress rates (~$0.12/GB for first 1TB)
- Total Estimate: ~$75-100/month for managed VPN
Self-Managed WireGuard on Compute Engine
Implementation Approach
Since GCP doesn’t offer managed WireGuard, deploy WireGuard on a Compute Engine VM:
Status: ✅ Fully supported via Compute Engine
Architecture
graph LR
A[Home Lab] -->|WireGuard Tunnel| B[GCP Compute Engine VM]
B -->|Private VPC Network| C[Boot Server VM]
B -->|IP Forwarding| C
subgraph "Home Network"
A
D[UDM Pro]
D -.WireGuard Client.- A
end
subgraph "GCP VPC"
B[WireGuard Gateway VM]
C[Boot Server VM]
endVM Configuration
WireGuard Gateway VM:
- Instance Type: e2-micro or f1-micro ($4-7/month)
- OS: Ubuntu 22.04 LTS or Debian 12 (native WireGuard kernel support)
- IP Forwarding: Enable IP forwarding to route traffic to other VMs
- External IP: Static external IP for stable WireGuard endpoint
- Firewall: Allow UDP port 51820 (WireGuard) from home lab public IP
Boot Server VM:
- Network: Same VPC as WireGuard gateway
- Private IP Only: No external IP (accessed via VPN)
- Route Traffic: Through WireGuard gateway VM
Installation Steps
# On GCP Compute Engine VM (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools
# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key
# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf
Example /etc/wireguard/wg0.conf on GCP VM:
[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o ens4 -j MASQUERADE
[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24
Corresponding config on UDM Pro:
[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>
[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <GCP_VM_EXTERNAL_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.128.0.0/20
PersistentKeepalive = 25
Enable and Start WireGuard
# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0
# Verify status
sudo wg show
GCP VPC Configuration
Firewall Rules
Create VPC firewall rule to allow WireGuard:
gcloud compute firewall-rules create allow-wireguard \
--direction=INGRESS \
--priority=1000 \
--network=default \
--action=ALLOW \
--rules=udp:51820 \
--source-ranges=<HOME_LAB_PUBLIC_IP>/32 \
--target-tags=wireguard-gateway
Tag the WireGuard VM:
gcloud compute instances add-tags wireguard-gateway-vm \
--tags=wireguard-gateway \
--zone=us-central1-a
Static External IP
Reserve static IP for stable WireGuard endpoint:
gcloud compute addresses create wireguard-gateway-ip \
--region=us-central1
gcloud compute instances delete-access-config wireguard-gateway-vm \
--access-config-name="external-nat" \
--zone=us-central1-a
gcloud compute instances add-access-config wireguard-gateway-vm \
--access-config-name="external-nat" \
--address=wireguard-gateway-ip \
--zone=us-central1-a
Cost: Static IP ~$3-4/month if VM is always running (free if attached to running VM in some regions).
Route Configuration
For traffic from boot server to reach home lab via WireGuard VM:
gcloud compute routes create route-to-homelab \
--network=default \
--priority=100 \
--destination-range=192.168.1.0/24 \
--next-hop-instance=wireguard-gateway-vm \
--next-hop-instance-zone=us-central1-a
This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway VM.
UDM Pro WireGuard Integration
Native Support
Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)
The UniFi Dream Machine Pro includes native WireGuard VPN support:
- GUI Configuration: Web UI for WireGuard VPN setup
- Site-to-Site: Support for site-to-site VPN tunnels
- Performance: Hardware acceleration for encryption (if available)
- Routing: Automatic route injection for remote subnets
Configuration Steps on UDM Pro
Network Settings → VPN:
- Create new VPN connection
- Select “WireGuard”
- Generate key pair or import existing
Peer Configuration:
- Peer Public Key: GCP WireGuard VM’s public key
- Endpoint: GCP VM’s static external IP
- Port: 51820
- Allowed IPs: GCP VPC subnet (e.g., 10.128.0.0/20)
- Persistent Keepalive: 25 seconds
Route Injection:
- UDM Pro automatically adds routes to GCP subnets
- Home lab servers can reach GCP boot server via VPN
Firewall Rules:
- Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN
Alternative: Manual WireGuard on UDM Pro
If native support is insufficient, use wireguard-go via udm-utilities:
- Repository: boostchicken/udm-utilities
- Script:
on_boot.dscript to start WireGuard - Persistence: Survives firmware updates with on-boot script
Performance Considerations
Throughput
WireGuard on Compute Engine performance:
- e2-micro (2 vCPU, shared core): ~100-300 Mbps
- e2-small (2 vCPU): ~500-800 Mbps
- e2-medium (2 vCPU): ~1+ Gbps
For network boot (typical boot = 50-200MB), even e2-micro is sufficient:
- Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
- Recommendation: e2-micro adequate for home lab scale
Latency
- VPN Overhead: WireGuard adds minimal latency (~1-5ms overhead)
- GCP Network: Low-latency network to most regions
- Total Latency: Primarily dependent on home ISP and GCP region proximity
CPU Usage
- Encryption: ChaCha20 is CPU-efficient
- Kernel Module: Minimal CPU overhead in kernel space
- e2-micro: Sufficient CPU for home lab VPN throughput
Security Considerations
Key Management
- Private Keys: Store securely, never commit to version control
- Key Rotation: Rotate keys periodically (e.g., annually)
- Secret Manager: Store WireGuard private keys in GCP Secret Manager
- Retrieve at VM startup via startup script
- Avoid storing in VM metadata or disk images
Firewall Hardening
- Source IP Restriction: Limit WireGuard port to home lab public IP only
- Least Privilege: Boot server firewall allows only VPN subnet
- No Public Access: Boot server has no external IP
Monitoring and Alerts
- Cloud Logging: Log WireGuard connection events
- Cloud Monitoring: Alert on VPN tunnel down
- Metrics: Monitor handshake failures, data transfer
DDoS Protection
- UDP Amplification: WireGuard resistant to DDoS amplification
- Cloud Armor: Optional layer for additional DDoS protection (overkill for VPN)
High Availability Options
Multi-Region Failover
Deploy WireGuard gateways in multiple regions:
- Primary: us-central1 WireGuard VM
- Secondary: us-east1 WireGuard VM
- Failover: UDM Pro switches endpoints if primary fails
- Cost: Doubles VM costs (~$8-14/month for 2 VMs)
Health Checks
Monitor WireGuard tunnel health:
# On UDM Pro (via SSH)
wg show wg0 latest-handshakes
# If handshake timestamp old (>3 minutes), tunnel may be down
Automate failover with script on UDM Pro or external monitoring.
Startup Scripts for Auto-Healing
GCP VM startup script to ensure WireGuard starts on boot:
#!/bin/bash
# /etc/startup-script.sh
# Retrieve WireGuard private key from Secret Manager
gcloud secrets versions access latest --secret="wireguard-server-key" > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key
# Start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0
Attach as metadata:
gcloud compute instances add-metadata wireguard-gateway-vm \
--metadata-from-file startup-script=/path/to/startup-script.sh \
--zone=us-central1-a
Cost Analysis
Self-Managed WireGuard on Compute Engine
| Component | Cost |
|---|---|
| e2-micro VM (730 hrs/month) | ~$6.50 |
| Static External IP | ~$3.50 |
| Egress (1GB/month boot traffic) | ~$0.12 |
| Monthly Total | ~$10.12 |
| Annual Total | ~$121 |
Cloud VPN (IPsec - if WireGuard not used)
| Component | Cost |
|---|---|
| HA VPN Gateway (2 tunnels) | ~$73 |
| Egress (1GB/month) | ~$0.12 |
| Monthly Total | ~$73 |
| Annual Total | ~$876 |
Cost Savings: Self-managed WireGuard saves ~$755/year vs Cloud VPN.
Comparison with Requirements
| Requirement | GCP Support | Implementation |
|---|---|---|
| WireGuard Protocol | ✅ Via Compute Engine | Self-managed on VM |
| Site-to-Site VPN | ✅ Yes | WireGuard tunnel |
| UDM Pro Integration | ✅ Native support | WireGuard peer config |
| Cost Efficiency | ✅ Low cost | e2-micro ~$10/month |
| Performance | ✅ Sufficient | 100+ Mbps on e2-micro |
| Security | ✅ Modern crypto | ChaCha20, Curve25519 |
| HA (optional) | ⚠️ Manual setup | Multi-region VMs |
Recommendations
For Home Lab VPN (per ADR-0002)
Self-Managed WireGuard: Deploy on Compute Engine e2-micro VM
- Cost: ~$10/month (vs ~$73/month for Cloud VPN)
- Performance: Sufficient for network boot traffic
- Simplicity: Easy to configure and maintain
Single Region Deployment: Unless HA required, single VM adequate
- Region Selection: Choose region closest to home lab for lowest latency
- Zone: Single zone sufficient (boot server not mission-critical)
UDM Pro Native WireGuard: Use built-in WireGuard client
- Configuration: Add GCP VM as WireGuard peer in UDM Pro UI
- Route Injection: UDM Pro automatically routes GCP subnets
Security Best Practices:
- Store WireGuard private key in Secret Manager
- Restrict WireGuard port to home public IP only
- Use startup script to configure VM on boot
- Enable Cloud Logging for VPN events
Monitoring: Set up Cloud Monitoring alerts for:
- VM down
- High CPU usage (indicates traffic spike or issue)
- Firewall rule blocks (indicates misconfiguration)
Future Enhancements
- HA Setup: Deploy secondary WireGuard VM in different region
- Automated Failover: Script on UDM Pro to switch endpoints
- IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
- Mesh VPN: Expand to mesh topology if multiple sites added
References
1.4 - HP ProLiant DL360 Gen9 Analysis
This section contains detailed analysis of the HP ProLiant DL360 Gen9 server platform, including hardware specifications, network boot capabilities, and configuration guidance for home lab deployments.
Overview
The HP ProLiant DL360 Gen9 is a 1U rack-mountable server released by HPE as part of their Generation 9 (Gen9) product line, introduced in 2014. It’s a popular choice for home labs due to its balance of performance, density, and relative power efficiency compared to earlier generations.
Key Features
- Form Factor: 1U rack-mountable
- Processor Support: Dual Intel Xeon E5-2600 v3/v4 processors (Haswell/Broadwell)
- Memory: Up to 768GB DDR4 RAM (24 DIMM slots)
- Storage: Flexible SFF/LFF drive configurations
- Network: Integrated quad-port 1GbE or 10GbE FlexibleLOM options
- Management: iLO 4 (Integrated Lights-Out) with remote KVM and virtual media
- Boot Options: UEFI and Legacy BIOS support with extensive network boot capabilities
Documentation Sections
- Network Boot Capabilities - Detailed analysis of PXE, iPXE, and UEFI HTTP boot support
- Hardware Specifications - Complete hardware configuration details
- Configuration Guide - Setup and optimization recommendations
1.4.1 - Configuration Guide
Initial Setup
Hardware Assembly
Install Processors:
- Use thermal paste (HPE thermal grease recommended)
- Align CPU carefully with socket (LGA 2011-3)
- Secure heatsink with proper torque (hand-tighten screws in cross pattern)
- Install both CPUs for dual-socket configuration
Install Memory:
- Populate channels evenly (see Memory Configuration below)
- Seat DIMMs firmly until retention clips engage
- Verify all DIMMs recognized in POST
Install Storage:
- Insert drives into hot-swap caddies
- Label drives clearly for identification
- Configure RAID controller (see Storage Configuration below)
Install Network Cards:
- FlexibleLOM: Slide into dedicated slot until seated
- PCIe cards: Ensure low-profile brackets, secure with screw
- Note MAC addresses for DHCP reservations
Connect Power:
- Install PSUs (both for redundancy)
- Connect power cords
- Verify PSU LEDs indicate proper operation
Initial Power-On:
- Press power button
- Monitor POST on screen or via iLO remote console
- Address any POST errors before proceeding
iLO 4 Initial Configuration
Physical iLO Connection
- Connect Ethernet cable to dedicated iLO port (not FlexibleLOM)
- Default iLO IP: Obtains via DHCP, or use temporary address via RBSU
- Check DHCP server logs for iLO MAC and assigned IP
First Login
- Access iLO web interface:
https://<ilo-ip> - Default credentials:
- Username:
Administrator - Password: On label on server pull-out tab (or rear label)
- Username:
- Immediately change default password (Administration > Access Settings)
Essential iLO Settings
Network Configuration (Administration > Network):
- Set static IP or DHCP reservation
- Configure DNS servers
- Set hostname (e.g.,
ilo-dl360-01) - Enable SNTP time sync
Security (Administration > Security):
- Enforce HTTPS only (disable HTTP)
- Configure SSH key authentication if using CLI
- Set strong password policy
- Enable iLO Security features
Access (Administration > Access Settings):
- Configure iLO username/password for automation
- Create additional user accounts (separation of duties)
- Set session timeout (default: 30 minutes)
Date and Time (Administration > Date and Time):
- Set NTP servers for accurate timestamps
- Configure timezone
Licenses (Administration > Licensing):
- Install iLO Advanced license key (required for full virtual media)
- License can be purchased or acquired from secondary market
iLO Firmware Update
Before production use, update iLO to latest version:
- Download latest iLO 4 firmware from HPE Support Portal
- Administration > Firmware > Update Firmware
- Upload
.binfile, apply update - iLO will reboot automatically (system stays running)
System ROM (BIOS/UEFI) Configuration
Accessing RBSU
- Local: Press F9 during POST
- Remote: iLO Remote Console > Power > Momentary Press > Press F9 when prompted
Boot Mode Selection
System Configuration > BIOS/Platform Configuration (RBSU) > Boot Mode:
UEFI Mode (recommended for modern OS):
- Supports GPT partitions (>2TB disks)
- Required for Secure Boot
- Better UEFI HTTP boot support
- IPv6 PXE boot support
Legacy BIOS Mode:
- For older OS or compatibility
- MBR partition tables only
- Traditional PXE boot
Recommendation: Use UEFI Mode unless legacy compatibility required
Boot Order Configuration
System Configuration > BIOS/Platform Configuration (RBSU) > Boot Options > UEFI Boot Order:
Recommended order for network boot deployment:
- Network Boot: FlexibleLOM or PCIe NIC
- Internal Storage: RAID controller or disk
- Virtual Media: iLO virtual CD/DVD (for installation media)
- USB: For rescue/recovery
Enable Network Boot:
- System Configuration > BIOS/Platform Configuration (RBSU) > Network Options > Network Boot
- Set to “Enabled”
Performance and Power Settings
System Configuration > BIOS/Platform Configuration (RBSU) > Power Management:
Power Regulator Mode:
- HP Dynamic Power Savings: Balanced power/performance (recommended for home lab)
- HP Static High Performance: Maximum performance, higher power draw
- HP Static Low Power: Minimize power, reduced performance
- OS Control: Let OS manage (e.g., Linux cpufreq)
Collaborative Power Control: Disabled (for standalone servers)
Minimum Processor Idle Power Core C-State: C6 (lower idle power)
Energy/Performance Bias: Balanced Performance (or Maximum Performance for compute workloads)
Recommendation: Start with “Dynamic Power Savings” and adjust based on workload
Memory Configuration
Optimal Population (dual-CPU configuration):
For maximum performance, populate all channels before adding second DIMM per channel:
64GB (8x 8GB):
- CPU1: Slots 1, 4, 7, 10 and CPU2: Slots 1, 4, 7, 10
- Result: 4 channels per CPU, 1 DIMM per channel
128GB (8x 16GB):
- Same as above with 16GB DIMMs
192GB (12x 16GB):
- CPU1: Slots 1, 4, 7, 10, 2, 5 and CPU2: Slots 1, 4, 7, 10, 2, 5
- Result: 4 channels per CPU, some with 2 DIMMs per channel
768GB (24x 32GB):
- All slots populated
Check Configuration: RBSU > System Information > Memory Information
Processor Options
System Configuration > BIOS/Platform Configuration (RBSU) > Processor Options:
Intel Hyperthreading: Enabled (recommended for most workloads)
- Doubles logical cores (e.g., 12-core CPU shows as 24 cores)
- Benefits most virtualization and multi-threaded workloads
- Disable only for specific security compliance (e.g., some cloud providers)
Intel Virtualization Technology (VT-x): Enabled (required for hypervisors)
Intel VT-d (IOMMU): Enabled (required for PCI passthrough, SR-IOV)
Turbo Boost: Enabled (allows CPU to exceed base clock)
Cores Enabled: All (or reduce to lower power/heat if needed)
Integrated Devices
System Configuration > BIOS/Platform Configuration (RBSU) > System Options > Integrated Devices:
- Embedded SATA Controller: Enabled (if using SATA drives)
- Embedded RAID Controller: Enabled (for Smart Array controllers)
- SR-IOV: Enabled (if using virtual network interfaces with VMs)
Network Controller Options
For each NIC (FlexibleLOM, PCIe):
System Configuration > BIOS/Platform Configuration (RBSU) > Network Options > [Adapter]:
- Network Boot: Enabled (for network boot on that NIC)
- PXE/iSCSI: Select PXE for standard network boot
- Link Speed: Auto-Negotiation (recommended) or force 1G/10G
- IPv4: Enabled (for IPv4 PXE boot)
- IPv6: Enabled (if using IPv6 PXE boot)
Boot Order: Configure which NIC boots first if multiple are enabled
Secure Boot Configuration
System Configuration > BIOS/Platform Configuration (RBSU) > Boot Options > Secure Boot:
- Secure Boot: Disabled (for unsigned boot loaders, custom kernels)
- Secure Boot: Enabled (for signed boot loaders, Windows, some Linux distros)
Note: If using PXE with unsigned images (e.g., custom iPXE), Secure Boot must be disabled
Firmware Updates
Update System ROM to latest version:
Via iLO:
- iLO web > Administration > Firmware > Update Firmware
- Upload System ROM
.fwpkgor.binfile - Server reboots automatically to apply
Via Service Pack for ProLiant (SPP):
- Download SPP ISO from HPE Support Portal
- Mount via iLO Virtual Media
- Boot server from SPP ISO
- Smart Update Manager (SUM) runs in Linux environment
- Select components to update (System ROM, iLO, controller firmware, NIC firmware)
- Apply updates, reboot
Recommendation: Use SPP for comprehensive updates on initial setup, then iLO for individual component updates
Storage Configuration
Smart Array Controller Setup
Access Smart Array Configuration
- During POST: Press F5 when “Smart Array Configuration Utility” message appears
- Via RBSU: System Configuration > BIOS/Platform Configuration (RBSU) > System Options > ROM-Based Setup Utility > Smart Array Configuration
Create RAID Arrays
Delete Existing Arrays (if reconfiguring):
- Select controller > Configuration > Delete Array
- Confirm deletion (data loss warning)
Create New Array:
- Select controller > Configuration > Create Array
- Select physical drives to include
- Choose RAID level:
- RAID 0: Striping, no redundancy (maximum performance, maximum capacity)
- RAID 1: Mirroring (redundancy, half capacity, good for boot drives)
- RAID 5: Striping + parity (redundancy, n-1 capacity, balanced)
- RAID 6: Striping + double parity (dual-drive failure tolerance, n-2 capacity)
- RAID 10: Mirror + stripe (high performance + redundancy, half capacity)
- Configure spare drives (hot spares for automatic rebuild)
- Create logical drive
- Set bootable flag if boot drive
Recommended Configurations:
- Boot/OS: 2x SSD in RAID 1 (redundancy, fast boot)
- Data (performance): 4-6x SSD in RAID 10 (fast, redundant)
- Data (capacity): 4-8x HDD in RAID 6 (capacity, dual-drive tolerance)
Controller Settings
Cache Settings:
- Write Cache: Enabled (requires battery/flash-backed cache)
- Read Cache: Enabled
- No-Battery Write Cache: Disabled (data safety) or Enabled (performance, risk)
Rebuild Priority: Medium or High (faster rebuild, may impact performance)
Surface Scan Delay: 3-7 days (periodic integrity check)
HBA Mode (Non-RAID)
For software RAID (ZFS, mdadm, Ceph):
- Access Smart Array Configuration (F5 during POST)
- Controller > Configuration > Enable HBA Mode
- Confirm (RAID arrays will be deleted)
- Reboot
Note: Not all Smart Array controllers support HBA mode. Check compatibility. Alternative: Use separate LSI HBA in PCIe slot.
Network Configuration for Boot
DHCP Server Setup
For PXE/UEFI network boot, configure DHCP server with appropriate options:
ISC DHCP Example (/etc/dhcp/dhcpd.conf):
# Define subnet
subnet 192.168.10.0 netmask 255.255.255.0 {
range 192.168.10.100 192.168.10.200;
option routers 192.168.10.1;
option domain-name-servers 192.168.10.1;
# PXE boot options
next-server 192.168.10.5; # TFTP server IP
# Differentiate UEFI vs BIOS
if exists user-class and option user-class = "iPXE" {
# iPXE boot script
filename "http://boot.example.com/boot.ipxe";
} elsif option arch = 00:07 or option arch = 00:09 {
# UEFI (x86-64)
filename "bootx64.efi";
} else {
# Legacy BIOS
filename "undionly.kpxe";
}
}
# Static reservation for DL360
host dl360-01 {
hardware ethernet xx:xx:xx:xx:xx:xx; # FlexibleLOM MAC
fixed-address 192.168.10.50;
option host-name "dl360-01";
}
FlexibleLOM Configuration
Configure FlexibleLOM NIC for network boot:
- RBSU > Network Options > FlexibleLOM
- Enable “Network Boot”
- Select PXE or iSCSI
- Configure IPv4/IPv6 as needed
- Set as first boot device in boot order
Multi-NIC Boot Priority
If multiple NICs have network boot enabled:
- RBSU > Network Options > Network Boot Order
- Drag/drop to prioritize NIC boot order
- First NIC in list attempts boot first
Recommendation: Enable network boot on one NIC (typically FlexibleLOM port 1) to avoid confusion
Operating System Installation
Traditional Installation (Virtual Media)
- Download OS ISO (e.g., Ubuntu Server, ESXi, Proxmox)
- Upload ISO to HTTP/HTTPS server or local file
- iLO Remote Console > Virtual Devices > Image File CD-ROM/DVD
- Browse to ISO location, click “Insert Media”
- Set boot order to prioritize virtual media
- Reboot server, boot from virtual CD/DVD
- Proceed with OS installation
Network Installation (PXE)
See Network Boot Capabilities for detailed PXE/UEFI boot setup
Quick workflow:
- Configure DHCP server with PXE options
- Setup TFTP server with boot files
- Enable network boot in BIOS
- Reboot, server PXE boots
- Select OS installer from PXE menu
- Automated installation proceeds (Kickstart/Preseed/Ignition)
Optimization for Specific Workloads
Virtualization (ESXi, Proxmox, Hyper-V)
BIOS Settings:
- Hyperthreading: Enabled
- VT-x: Enabled
- VT-d: Enabled
- Power Management: Dynamic or OS Control
- Turbo Boost: Enabled
Hardware:
- Maximum memory (384GB+ recommended)
- Fast storage (SSD RAID 10 for VM storage)
- 10GbE networking for VM traffic
Configuration:
- Pass through NICs to VMs (SR-IOV or PCI passthrough)
- Use storage controller in HBA mode for direct disk access to VM storage (ZFS, Ceph)
Kubernetes/Container Platforms
BIOS Settings:
- Hyperthreading: Enabled
- VT-x/VT-d: Enabled (for nested virtualization, kata containers)
- Power Management: Dynamic or High Performance
Hardware:
- 128GB+ RAM for multi-tenant workloads
- Fast local NVMe/SSD for container image cache and ephemeral storage
- 10GbE for pod networking
OS Recommendations:
- Talos Linux: Network-bootable, immutable k8s OS
- Flatcar Container Linux: Auto-updating, minimal OS
- Ubuntu Server: Broad compatibility, snap/docker native
Storage Server (NAS, SAN)
BIOS Settings:
- Disable Hyperthreading (slight performance improvement for ZFS)
- VT-d: Enabled (if passing through HBA to VM)
- Power Management: High Performance
Hardware:
- Maximum drive bays (8-10 SFF)
- HBA mode or separate LSI HBA controller
- 10GbE or bonded 1GbE for network storage traffic
- ECC memory (critical for ZFS)
Software:
- TrueNAS SCALE (Linux-based, k8s apps)
- OpenMediaVault (Debian-based, plugins)
- Ubuntu + ZFS (custom setup)
Compute/HPC Workloads
BIOS Settings:
- Hyperthreading: Depends on workload (test both)
- Turbo Boost: Enabled
- Power Management: Maximum Performance
- C-States: Disabled (reduce latency)
Hardware:
- High core count CPUs (E5-2680 v4, 2690 v4)
- Maximum memory bandwidth (populate all channels)
- Fast local scratch storage (NVMe)
Monitoring and Maintenance
iLO Health Monitoring
Information > System Information:
- CPU temperature and status
- Memory status
- Drive status (via controller)
- Fan speeds
- PSU status
- Overall system health LED status
Alerting (Administration > Alerting):
- Configure email alerts for:
- Fan failures
- Temperature warnings
- Drive failures
- Memory errors
- PSU failures
- Set up SNMP traps for integration with monitoring systems (Nagios, Zabbix, Prometheus)
Integrated Management Log (IML)
Information > Integrated Management Log:
- View hardware events and errors
- Filter by severity (Informational, Caution, Critical)
- Export log for troubleshooting
Regular Checks:
- Review IML weekly for early warning signs
- Address caution-level events before they become critical
Firmware Update Cadence
Recommendation:
- iLO: Update quarterly or when security advisories released
- System ROM: Update annually or for bug fixes
- Storage Controller: Update when issues arise or annually
- NIC Firmware: Update when issues arise
Method: Use SPP for annual comprehensive updates, iLO web interface for individual component updates
Physical Maintenance
Monthly:
- Check fan noise (increased noise may indicate clogged air filters or failing fan)
- Verify PSU and drive LEDs (no amber lights)
- Check iLO for alerts
Quarterly:
- Clean air filters (if accessible, depends on rack airflow)
- Verify backup of iLO configuration
- Test iLO Virtual Media functionality
Annually:
- Update all firmware via SPP
- Verify RAID battery/flash-backed cache status
- Review and update BIOS settings as workload evolves
Troubleshooting Common Issues
Server Won’t Power On
- Check PSU power cords connected
- Verify PSU LEDs indicate power
- Press iLO power button via web interface
- Check iLO IML for power-related errors
- Reseat PSUs, check for blown fuses
POST Errors
Memory Errors:
- Reseat memory DIMMs
- Test with minimal configuration (1 DIMM per CPU)
- Replace failing DIMMs identified in POST
CPU Errors:
- Verify heatsink properly seated
- Check thermal paste application
- Reseat CPU (careful with pins)
Drive Errors:
- Check drive connection to caddy
- Verify controller recognizes drive
- Replace failing drive
No Network Boot
See Network Boot Troubleshooting for detailed diagnostics
Quick checks:
- Verify NIC link light
- Confirm network boot enabled in BIOS
- Check DHCP server logs for PXE request
- Test TFTP server accessibility
iLO Not Accessible
- Check physical Ethernet connection to iLO port
- Verify switch port active
- Reset iLO: Press and hold iLO NMI button (rear) for 5 seconds
- Factory reset iLO via jumper (see maintenance guide)
- Check iLO firmware version, update if outdated
High Fan Noise
- Check ambient temperature (<25°C recommended)
- Verify airflow not blocked (front/rear clearance)
- Clean dust from intake (compressed air)
- Check iLO temperature sensors for elevated temps
- Lower CPU TDP if temperatures excessive (lower power CPUs)
- Verify all fans operational (replace failed fans)
Security Hardening
iLO Security
- Change Default Credentials: Immediately on first boot
- Disable Unused Services: SSH, IPMI if not needed
- Use HTTPS Only: Disable HTTP (Administration > Network > HTTP Port)
- Network Isolation: Dedicated management VLAN, firewall iLO access
- Update Firmware: Apply security patches promptly
- Account Management: Use separate accounts, least privilege
BIOS/UEFI Security
- BIOS Password: Set administrator password (RBSU > System Options > BIOS Admin Password)
- Secure Boot: Enable if using signed boot loaders
- Boot Order Lock: Prevent unauthorized boot device changes
- TPM: Enable if using BitLocker or LUKS disk encryption
Operating System Security
- Minimal Installation: Install only required packages
- Firewall: Enable host firewall (iptables, firewalld, ufw)
- SSH Hardening: Key-based auth, disable password auth, non-standard port
- Automatic Updates: Enable for security patches
- Monitoring: Deploy intrusion detection (fail2ban, OSSEC)
Conclusion
Proper configuration of the HP ProLiant DL360 Gen9 ensures optimal performance, reliability, and manageability for home lab and production deployments. The combination of UEFI boot capabilities, iLO remote management, and flexible hardware configuration makes the DL360 Gen9 a versatile platform for virtualization, containerization, storage, and compute workloads.
Key takeaways:
- Update firmware early (iLO, System ROM, controllers)
- Configure iLO for remote management and monitoring
- Choose boot mode (UEFI recommended) and configure network boot appropriately
- Optimize BIOS settings for specific workload (virtualization, storage, compute)
- Implement security hardening (iLO, BIOS, OS)
- Establish monitoring and maintenance schedule
For network boot-specific configuration, refer to the Network Boot Capabilities guide.
1.4.2 - Hardware Specifications
System Overview
The HP ProLiant DL360 Gen9 is a dual-socket 1U rack server designed for data center and enterprise deployments, also popular in home lab environments due to its performance and manageability.
Generation: Gen9 (2014-2017 product cycle)
Form Factor: 1U rack-mountable (19-inch standard rack)
Dimensions: 43.46 x 67.31 x 4.29 cm (17.1 x 26.5 x 1.69 in)
Processor Support
Supported CPU Families
The DL360 Gen9 supports Intel Xeon E5-2600 v3 and v4 series processors:
E5-2600 v3 (Haswell-EP): Released Q3 2014
- Process: 22nm
- Cores: 4-18 per socket
- TDP: 55W-145W
- Max Memory Speed: DDR4-2133
E5-2600 v4 (Broadwell-EP): Released Q1 2016
- Process: 14nm
- Cores: 4-22 per socket
- TDP: 55W-145W
- Max Memory Speed: DDR4-2400
Popular CPU Options
Value: E5-2620 v3/v4 (6 cores, 15MB cache, 85W)
Balanced: E5-2650 v3/v4 (10-12 cores, 25-30MB cache, 105W)
Performance: E5-2680 v3/v4 (12-14 cores, 30-35MB cache, 120W)
High Core Count: E5-2699 v4 (22 cores, 55MB cache, 145W)
Configuration Options
- Single Processor: One CPU socket populated (budget option)
- Dual Processor: Both sockets populated (full performance)
Note: Memory and I/O performance scales with processor count. Single-CPU configuration limits memory channels and PCIe lanes.
Memory Architecture
Memory Specifications
- Type: DDR4 RDIMM or LRDIMM
- Speed: DDR4-2133 (v3) or DDR4-2400 (v4)
- Slots: 24 DIMM slots (12 per processor)
- Maximum Capacity:
- 768GB with 32GB RDIMMs
- 1.5TB with 64GB LRDIMMs (v4 processors)
- Minimum: 8GB (1x 8GB DIMM)
Memory Configuration Rules
- Channels per CPU: 4 channels, 3 DIMMs per channel
- Population: Populate channels evenly for optimal bandwidth
- Mixing: Do not mix RDIMM and LRDIMM types
- Speed: All DIMMs run at speed of slowest DIMM
Recommended Configurations
Basic Home Lab (Single CPU):
- 4x 16GB = 64GB (one DIMM per channel on both memory boards)
Standard (Dual CPU):
- 8x 16GB = 128GB (one DIMM per channel)
- 12x 16GB = 192GB (two DIMMs per channel on primary channels)
High Capacity (Dual CPU):
- 24x 32GB = 768GB (all slots populated, RDIMM)
Performance Priority: Populate all channels before adding second DIMM per channel
Storage Options
Drive Bay Configurations
The DL360 Gen9 offers multiple drive bay configurations:
- 8 SFF (2.5-inch): Most common configuration
- 10 SFF: Extended bay version
- 4 LFF (3.5-inch): Less common in 1U form factor
Drive Types Supported
- SAS: 12Gb/s, 6Gb/s (enterprise-grade)
- SATA: 6Gb/s, 3Gb/s (value option)
- SSD: SAS/SATA SSD, NVMe (with appropriate controller)
Storage Controllers
Smart Array Controllers (HPE proprietary RAID):
- P440ar: Entry-level, 2GB FBWC (Flash-Backed Write Cache), RAID 0/1/5/6/10
- P840ar: High-performance, 4GB FBWC, RAID 0/1/5/6/10/50/60
- P440: PCIe card version, 2GB FBWC
- P840: PCIe card version, 4GB FBWC
HBA Mode (non-RAID pass-through):
- Smart Array controllers in HBA mode for software RAID (ZFS, mdadm)
- Limited support; check firmware version
Alternative Controllers:
- LSI/Broadcom HBA controllers in PCIe slots
- H240ar (12Gb/s HBA mode)
Boot Drive Options
For network-focused deployments:
- Minimal Local Storage: 2x SSD in RAID 1 for hypervisor/OS
- USB/SD Boot: iLO supports USB boot, SD card (internal USB)
- Diskless: Pure network boot (subject of network-boot.md)
Network Connectivity
Integrated FlexibleLOM
The DL360 Gen9 includes a FlexibleLOM slot for swappable network adapters:
Common FlexibleLOM Options:
HPE 366FLR: 4x 1GbE (Broadcom BCM5719)
- Most common, good for general use
- Supports PXE, UEFI network boot, SR-IOV
HPE 560FLR-SFP+: 2x 10GbE SFP+ (Intel X710)
- High performance, fiber or DAC
- Supports PXE, UEFI boot, SR-IOV, RDMA (RoCE)
HPE 361i: 2x 1GbE (Intel I350)
- Entry-level, good driver support
PCIe Expansion Slots
Slot Configuration:
- Slot 1: PCIe 3.0 x16 (low-profile)
- Slot 2: PCIe 3.0 x8 (low-profile)
- Slot 3: PCIe 3.0 x8 (low-profile) - optional, depends on riser
Network Card Options:
- Intel X520/X710 (10GbE)
- Mellanox ConnectX-3/ConnectX-4 (10/25/40GbE, InfiniBand)
- Broadcom NetXtreme (1/10/25GbE)
Note: Ensure cards are low-profile for 1U chassis compatibility
Power Supply
PSU Options
- 500W: Single PSU, non-redundant (not recommended)
- 800W: Common, supports dual CPU + moderate expansion
- 1400W: High-power, dual CPU with high TDP + GPUs
- Redundancy: 1+1 redundant hot-plug recommended
Power Configuration
- Platinum Efficiency: 94%+ at 50% load
- Hot-Plug: Replace without powering down
- Auto-Switching: 100-240V AC, 50/60Hz
Home Lab Power Draw (typical):
- Idle (dual E5-2650 v3, 128GB RAM): 100-130W
- Load: 200-350W depending on CPU and drive configuration
Power Management
- HPE Dynamic Power Capping: Limit max power via iLO
- Collaborative Power: Share power budget across chassis in blade environments
- Energy Efficient Ethernet (EEE): Reduce NIC power during low utilization
Cooling and Acoustics
Fan Configuration
- 6x Hot-Plug Fans: Front-mounted, redundant (N+1)
- Variable Speed: Controlled by System ROM based on thermal sensors
- iLO Management: Monitor fan speed, temperature via iLO
Thermal Management
- Temperature Range: 10-35°C (50-95°F) operating
- Altitude: Up to 3,050m (10,000 ft) at reduced temperature
- Airflow: Front-to-back, ensure clear intake and exhaust
Noise Level
- Idle: ~45 dBA (quiet for 1U server)
- Load: 55-70 dBA depending on thermal demand
- Home Lab Consideration: Audible but acceptable in dedicated space; louder than desktop workstation
Noise Reduction:
- Run lower TDP CPUs (e.g., E5-2620 series)
- Maintain ambient temperature <25°C
- Ensure adequate airflow (not in enclosed cabinet without ventilation)
Management - iLO 4
iLO 4 Features
The Integrated Lights-Out 4 (iLO 4) provides out-of-band management:
- Web Interface: HTTPS management console
- Remote Console: HTML5 or Java-based KVM
- Virtual Media: Mount ISOs/images remotely
- Power Control: Power on/off, reset, cold boot
- Monitoring: Sensors, event logs, hardware health
- Alerting: Email alerts, SNMP traps, syslog
- Scripting: RESTful API (Redfish standard)
iLO Licensing
- iLO Standard (included): Basic management, remote console
- iLO Advanced (license required):
- Virtual media
- Remote console performance improvements
- Directory integration (LDAP/AD)
- Graphical remote console
- iLO Advanced Premium (license required):
- Insight Remote Support
- Federation
- Jitter smoothing
Home Lab: iLO Advanced license highly recommended for virtual media and full remote console features
iLO Network Configuration
- Dedicated iLO Port: Separate 1GbE management port (recommended)
- Shared LOM: Share FlexibleLOM port with OS (not recommended for isolation)
Security: Isolate iLO on dedicated management VLAN, disable if not needed
BIOS and Firmware
System ROM (BIOS/UEFI)
- Firmware Type: UEFI 2.31 or later
- Boot Modes: UEFI, Legacy BIOS, or hybrid
- Configuration: RBSU (ROM-Based Setup Utility) accessible via F9
Firmware Update Methods
- Service Pack for ProLiant (SPP): Comprehensive bundle of all firmware
- iLO Online Flash: Update via web interface
- Online ROM Flash: Linux utility for online updates
- USB Flash: Boot from USB with firmware update utility
Recommended Practice: Update to latest SPP for security patches and feature improvements
Secure Boot
- UEFI Secure Boot: Supported, validates boot loader signatures
- TPM: Optional Trusted Platform Module 1.2 or 2.0
- Boot Order Protection: Prevent unauthorized boot device changes
Expansion and Modularity
GPU Support
Limited GPU support due to 1U form factor and power constraints:
- Low-Profile GPUs: Nvidia T4, AMD Instinct MI25 (may require custom cooling)
- Power: Consider 1400W PSU for high-power GPUs
- Not Ideal: For GPU-heavy workloads, consider 2U+ servers (e.g., DL380 Gen9)
USB Ports
- Front: 1x USB 3.0
- Rear: 2x USB 3.0
- Internal: 1x USB 2.0 (for SD/USB boot device)
Serial Port
- Rear serial port for legacy console access
- Useful for network equipment serial console, debug
Home Lab Considerations
Pros for Home Lab
- Density: 1U form factor saves rack space
- iLO Management: Enterprise remote management without KVM
- Network Boot: Excellent PXE/UEFI boot support (see network-boot.md)
- Serviceability: Hot-swap drives, PSU, fans
- Documentation: Extensive HPE documentation and community support
- Parts Availability: Common on secondary market, affordable
Cons for Home Lab
- Noise: Louder than tower servers or workstations
- Power: Higher idle power than consumer hardware (100-130W idle)
- 1U Limitations: Limited GPU, PCIe expansion vs 2U/4U chassis
- Firmware: Requires HPE account for SPP downloads (free but registration required)
Recommended Home Lab Configuration
Budget (~$500-800 used):
- Dual E5-2620 v3 or v4 (6 cores each, 85W TDP)
- 128GB RAM (8x 16GB DDR4)
- 2x SSD (boot), 4-6x HDD/SSD (data)
- HPE 366FLR (4x 1GbE)
- Dual 500W or 800W PSU (redundant)
- iLO Advanced license
Performance (~$1000-1500 used):
- Dual E5-2680 v4 (14 cores each, 120W TDP)
- 256GB RAM (16x 16GB DDR4)
- 2x NVMe SSD (boot/cache), 6-8x SSD (data)
- HPE 560FLR-SFP+ (2x 10GbE) + PCIe 4x1GbE card
- Dual 800W PSU
- iLO Advanced license
Comparison with Other Generations
vs Gen8 (Previous)
Gen9 Advantages:
- DDR4 vs DDR3 (lower power, higher capacity)
- Better UEFI support and HTTP boot
- Newer processor architecture (Haswell/Broadwell vs Sandy Bridge/Ivy Bridge)
- iLO 4 vs iLO 3 (better HTML5 console)
Gen8 Advantages:
- Lower cost on secondary market
- Adequate for light workloads
vs Gen10 (Next)
Gen10 Advantages:
- Newer CPUs (Skylake-SP/Cascade Lake)
- More PCIe lanes
- Better UEFI firmware and security features
- DDR4-2666/2933 support
Gen9 Advantages:
- Lower cost (mature product cycle)
- Excellent value for performance/dollar
- Still well-supported by modern OS and firmware
Technical Resources
- QuickSpecs: HPE ProLiant DL360 Gen9 Server QuickSpecs
- User Guide: HPE ProLiant DL360 Gen9 Server User Guide
- Maintenance and Service Guide: Detailed disassembly and part replacement
- Firmware Downloads: HPE Support Portal (requires free account)
Summary
The HP ProLiant DL360 Gen9 remains an excellent choice for home labs and small deployments in 2024-2025. Its balance of performance (dual Xeon v4, 768GB RAM capacity), manageability (iLO 4), and network boot capabilities make it particularly well-suited for virtualization, container hosting, and infrastructure automation workflows. While not the latest generation, it offers strong value with robust firmware support and wide secondary market availability.
Best For:
- Virtualization hosts (ESXi, Proxmox, Hyper-V)
- Kubernetes/container platforms
- Network boot/diskless deployments
- Storage servers (with appropriate controller)
- General compute workloads
Avoid For:
- GPU-intensive workloads (1U constraints)
- Noise-sensitive environments (unless isolated)
- Extreme low-power requirements (100W+ idle)
1.4.3 - Network Boot Capabilities
Overview
The HP ProLiant DL360 Gen9 provides robust network boot capabilities through multiple protocols and firmware interfaces. This makes it particularly well-suited for diskless deployments, automated provisioning, and infrastructure-as-code workflows.
Supported Network Boot Protocols
PXE (Preboot Execution Environment)
The DL360 Gen9 fully supports PXE boot via both legacy BIOS and UEFI firmware modes:
Legacy BIOS PXE: Traditional PXE implementation using TFTP
- Protocol: PXEv2 (PXE 2.1)
- Network Stack: IPv4 only in legacy mode
- Boot files:
pxelinux.0,undionly.kpxe, or custom NBP - DHCP options: Standard options 66 (TFTP server) and 67 (boot filename)
UEFI PXE: Modern UEFI network boot implementation
- Protocol: PXEv2 with UEFI extensions
- Network Stack: IPv4 and IPv6 support
- Boot files:
bootx64.efi,grubx64.efi,shimx64.efi - Architecture: x64 (EFI BC)
- DHCP Architecture ID: 0x0007 (EFI BC) or 0x0009 (EFI x86-64)
iPXE Support
The DL360 Gen9 can boot iPXE, enabling advanced features:
- Chainloading: Boot standard PXE, then chainload iPXE for enhanced capabilities
- HTTP/HTTPS Boot: Download kernels and images over HTTP(S) instead of TFTP
- SAN Boot: iSCSI and AoE (ATA over Ethernet) support
- Scripting: Conditional boot logic and dynamic configuration
- Embedded Scripts: iPXE can be compiled with embedded boot scripts
Implementation Methods:
- Chainload from standard PXE: DHCP points to
undionly.kpxeoripxe.efi - Flash iPXE to FlexibleLOM option ROM (advanced, requires care)
- Boot iPXE from USB, then continue network boot
UEFI HTTP Boot
Native UEFI HTTP boot is supported on Gen9 servers with recent firmware:
- Protocol: RFC 7230 HTTP/1.1
- Requirements:
- UEFI firmware version 2.40 or later (check via iLO)
- DHCP option 60 (vendor class identifier) = “HTTPClient”
- DHCP option 67 pointing to HTTP(S) URL
- Advantages:
- No TFTP server required
- Faster transfers than TFTP
- Support for HTTPS with certificate validation
- Better suited for large images (kernels, initramfs)
- Limitations:
- UEFI mode only (not available in legacy BIOS)
- Requires DHCP server with HTTP URL support
HTTP(S) Boot Configuration
For UEFI HTTP boot on DL360 Gen9:
# Example ISC DHCP configuration for UEFI HTTP boot
class "httpclients" {
match if substring(option vendor-class-identifier, 0, 10) = "HTTPClient";
}
pool {
allow members of "httpclients";
option vendor-class-identifier "HTTPClient";
# Point to HTTP boot URI
filename "http://boot.example.com/boot/efi/bootx64.efi";
}
Network Interface Options
The DL360 Gen9 supports multiple network adapter configurations for boot:
FlexibleLOM (LOM = LAN on Motherboard)
HPE FlexibleLOM slot supports:
- HPE 366FLR: Quad-port 1GbE (Broadcom BCM5719)
- HPE 560FLR-SFP+: Dual-port 10GbE (Intel X710)
- HPE 361i: Dual-port 1GbE (Intel I350)
All FlexibleLOM adapters support PXE and UEFI network boot. The option ROM can be configured via BIOS/UEFI settings.
PCIe Network Adapters
Standard PCIe network cards with PXE/UEFI boot ROM support:
- Intel X520, X710 series (10GbE)
- Broadcom NetXtreme series
- Mellanox ConnectX-3/4 (with appropriate firmware)
Boot Priority: Configure via System ROM > Network Boot Options to select which NIC boots first.
Firmware Configuration
Accessing Boot Configuration
- RBSU (ROM-Based Setup Utility): Press F9 during POST
- iLO 4 Remote Console: Access via network, then virtual F9
- UEFI System Utilities: Modern interface for UEFI firmware settings
Key Settings
Navigate to: System Configuration > BIOS/Platform Configuration (RBSU) > Network Boot Options
- Network Boot: Enable/Disable
- Boot Mode: UEFI or Legacy BIOS
- IPv4/IPv6: Enable protocol support
- Boot Retry: Number of attempts before falling back to next boot device
- Boot Order: Prioritize network boot in boot sequence
Per-NIC Configuration
In RBSU > Network Options:
- Option ROM: Enable/Disable per adapter
- Link Speed: Force speed/duplex or auto-negotiate
- VLAN: VLAN tagging for boot (if supported by DHCP/PXE environment)
- PXE Menu: Enable interactive PXE menu (Ctrl+S during PXE boot)
iLO 4 Integration
The DL360 Gen9’s iLO 4 provides additional network boot features:
Virtual Media Network Boot
- Mount ISO images remotely via iLO Virtual Media
- Boot from network-attached ISO without physical media
- Useful for OS installation or diagnostics
Workflow:
- Upload ISO to HTTP/HTTPS server or use SMB/NFS share
- iLO Remote Console > Virtual Devices > Image File CD-ROM/DVD
- Set boot order to prioritize virtual optical drive
- Reboot server
Scripted Deployment via iLO
iLO 4 RESTful API allows:
- Setting one-time boot to network via API call
- Automating PXE boot for provisioning pipelines
- Integration with tools like Terraform, Ansible
Example using iLO RESTful API:
curl -k -u admin:password -X PATCH \
https://ilo-hostname/redfish/v1/Systems/1/ \
-d '{"Boot":{"BootSourceOverrideTarget":"Pxe","BootSourceOverrideEnabled":"Once"}}'
Boot Process Flow
Legacy BIOS PXE Boot
- Server powers on, initializes NICs
- NIC sends DHCPDISCOVER with PXE vendor options
- DHCP server responds with IP, TFTP server (option 66), boot file (option 67)
- NIC downloads NBP (Network Bootstrap Program) via TFTP
- NBP executes (e.g., pxelinux.0 loads syslinux menu)
- User selects boot target or automated script continues
- Kernel and initramfs download and boot
UEFI PXE Boot
- UEFI firmware initializes network stack
- UEFI PXE driver sends DHCPv4/v6 DISCOVER
- DHCP responds with boot file (e.g.,
bootx64.efi) - UEFI downloads boot file via TFTP
- UEFI loads and executes boot loader (GRUB2, systemd-boot, iPXE)
- Boot loader may download additional files (kernel, initrd, config)
- OS boots
UEFI HTTP Boot
- UEFI firmware with HTTP Boot support enabled
- DHCP request includes “HTTPClient” vendor class
- DHCP responds with HTTP(S) URL in option 67
- UEFI HTTP client downloads boot file over HTTP(S)
- Execution continues as with UEFI PXE
Performance Considerations
TFTP vs HTTP
- TFTP: Slow for large files (typical: 1-5 MB/s)
- Use for small boot loaders only
- Chainload to iPXE or HTTP boot for better performance
- HTTP: 10-100x faster depending on network and server
- Recommended for kernels, initramfs, live OS images
- iPXE or UEFI HTTP boot required
Network Speed Impact
DL360 Gen9 boot performance by NIC speed:
- 1GbE: Adequate for most PXE deployments (100-125 MB/s theoretical max)
- 10GbE: Significant improvement for large image downloads (1-2 GB/s)
- Bonding/Teaming: Not typically used for boot (single NIC boots)
Recommendation: For production diskless nodes or frequent re-provisioning, 10GbE with HTTP boot provides best performance.
Common Use Cases
1. Automated OS Provisioning
Boot into installer via PXE:
- Kickstart (RHEL/CentOS/Rocky)
- Preseed (Debian/Ubuntu)
- Ignition (Fedora CoreOS, Flatcar)
2. Diskless Boot
Boot OS entirely from network/RAM:
- Network root: NFS or iSCSI root filesystem
- Overlay: Persistent storage via network overlay
- Stateless: Boot identical image, no local state
3. Rescue and Diagnostics
Boot live environments:
- SystemRescue
- Clonezilla
- Memtest86+
- Hardware diagnostics (HPE Service Pack for ProLiant)
4. Kubernetes/Container Hosts
PXE boot immutable OS images:
- Talos Linux: API-driven, diskless k8s nodes
- Flatcar Container Linux: Automated updates
- k3OS: Lightweight k8s OS
Troubleshooting
PXE Boot Fails
Symptoms: “PXE-E51: No DHCP or proxy DHCP offers received” or timeout
Checks:
- Verify NIC link light and switch port status
- Confirm DHCP server is responding (check DHCP logs)
- Ensure DHCP options 66 and 67 are set correctly
- Test TFTP server accessibility (
tftp -i <server> GET <file>) - Check BIOS/UEFI network boot is enabled
- Verify boot order prioritizes network boot
- Disable Secure Boot if using unsigned boot files
UEFI Network Boot Not Available
Symptoms: Network boot option missing in UEFI boot menu
Resolution:
- Enter RBSU (F9), navigate to Network Options
- Ensure at least one NIC has “Option ROM” enabled
- Verify Boot Mode is set to UEFI (not Legacy)
- Update System ROM to latest version if option is missing
- Some FlexibleLOM cards require firmware update for UEFI boot support
HTTP Boot Fails
Symptoms: UEFI HTTP boot option present but fails to download
Checks:
- Verify firmware version supports HTTP boot (>=2.40)
- Ensure DHCP option 67 contains valid HTTP(S) URL
- Test URL accessibility from another client
- Check DNS resolution if using hostname in URL
- For HTTPS: Verify certificate is trusted (or disable cert validation in test)
Slow PXE Boot
Symptoms: Boot process takes minutes instead of seconds
Optimizations:
- Switch from TFTP to HTTP (chainload iPXE or use UEFI HTTP boot)
- Increase TFTP server block size (
tftp-hpa --blocksize 1468) - Tune DHCP response times (reduce lease query delays)
- Use local network segment for boot server (avoid WAN/VPN)
- Enable NIC interrupt coalescing in BIOS for 10GbE
Security Considerations
Secure Boot
DL360 Gen9 supports UEFI Secure Boot:
- Validates signed boot loaders (shim, GRUB, kernel)
- Prevents unsigned code execution during boot
- Required for some compliance scenarios
Configuration: RBSU > Boot Options > Secure Boot = Enabled
Implications for Network Boot:
- Must use signed boot loaders (e.g., shim.efi signed by Microsoft/vendor)
- Custom kernels require signing or disabling Secure Boot
- iPXE must be signed or chainloaded from signed shim
Network Security
Risks:
- PXE/TFTP is unencrypted and unauthenticated
- Attacker on network can serve malicious boot images
- DHCP spoofing can redirect to malicious boot server
Mitigations:
- Network Segmentation: Isolate PXE boot to management VLAN
- DHCP Snooping: Prevent rogue DHCP servers on switch
- HTTPS Boot: Use UEFI HTTP boot with TLS and certificate validation
- iPXE with HTTPS: Chainload iPXE, then use HTTPS for all downloads
- Signed Images: Use Secure Boot with signed boot chain
- 802.1X: Require network authentication before DHCP (complex for PXE)
iLO Security
- Change default iLO password immediately
- Use TLS for iLO web interface and API
- Restrict iLO network access (firewall, separate VLAN)
- Disable iLO Virtual Media if not needed
- Enable iLO Security Override for extra security during boot
Firmware and Driver Resources
Required Firmware Versions
For optimal network boot support:
- System ROM: v2.60 or later (latest recommended)
- iLO 4 Firmware: v2.80 or later
- NIC Firmware: Latest for specific FlexibleLOM/PCIe card
Check current versions: iLO web interface > Information > Firmware Information
Updating Firmware
Methods:
HPE Service Pack for ProLiant (SPP): Comprehensive update bundle
- Boot from SPP ISO (via iLO Virtual Media or USB)
- Runs Smart Update Manager (SUM) in Linux environment
- Updates all firmware, drivers, system ROM automatically
iLO Web Interface: Individual component updates
- System ROM: Administration > Firmware > Update Firmware
- Upload .fwpkg or .bin files from HPE support site
Online Flash Component: Linux Online ROM Flash utility
- Install
hp-firmware-*packages - Run updates while OS is running (requires reboot to apply)
- Install
Download Source: https://support.hpe.com/connect/s/product?language=en_US&kmpmoid=1010026910 (requires HPE Passport account, free registration)
Best Practices
- Use UEFI Mode: Better security, IPv6 support, larger disk support
- Enable HTTP Boot: Faster and more reliable than TFTP for large files
- Chainload iPXE: Flexibility of iPXE with standard PXE infrastructure
- Update Firmware: Keep System ROM and iLO current for bug fixes and features
- Isolate Boot Network: Use dedicated management VLAN for PXE/provisioning
- Test Failover: Configure multiple DHCP servers and boot mirrors for redundancy
- Document Configuration: Record BIOS settings, DHCP config, and boot infrastructure
- Monitor iLO Logs: Track boot failures and hardware issues via iLO event log
References
- HPE ProLiant DL360 Gen9 Server User Guide
- HPE UEFI System Utilities User Guide
- iLO 4 User Guide (firmware version 2.80)
- Intel PXE Specification v2.1
- UEFI Specification v2.8 (HTTP Boot)
- iPXE Documentation: https://ipxe.org/
Conclusion
The HP ProLiant DL360 Gen9 provides enterprise-grade network boot capabilities suitable for both traditional PXE deployments and modern UEFI HTTP boot scenarios. Its flexible configuration options, mature firmware support, and iLO integration make it an excellent platform for automated provisioning, diskless computing, and infrastructure-as-code workflows in home lab environments.
For home lab use, the recommended configuration is:
- UEFI boot mode with Secure Boot disabled (unless required)
- iPXE chainloading for flexibility and HTTP performance
- iLO 4 configured for remote management and scripted provisioning
- Latest firmware for stability and feature support
1.5 - Matchbox Analysis
Matchbox Network Boot Analysis
This section contains a comprehensive analysis of Matchbox, a network boot service for provisioning bare-metal machines.
Overview
Matchbox is an HTTP and gRPC service developed by Poseidon that automates bare-metal machine provisioning through network booting. It matches machines to configuration profiles based on hardware attributes and serves boot configurations, kernel images, and provisioning configs.
Primary Repository: poseidon/matchbox
Documentation: https://matchbox.psdn.io/
License: Apache 2.0
Key Features
- Network Boot Support: iPXE, PXELINUX, GRUB2 chainloading
- OS Provisioning: Fedora CoreOS, Flatcar Linux, RHEL CoreOS
- Configuration Management: Ignition v3.x configs, Butane transpilation
- Machine Matching: Label-based matching (MAC, UUID, hostname, serial, custom)
- API: Read-only HTTP API + authenticated gRPC API
- Asset Serving: Local caching of OS images for faster deployment
- Templating: Go template support for dynamic configuration
Use Cases
- Bare-metal Kubernetes clusters - Provision CoreOS nodes for k8s
- Lab/development environments - Quick PXE boot for testing
- Datacenter provisioning - Automate OS installation across fleets
- Immutable infrastructure - Declarative machine provisioning via Terraform
Analysis Contents
- Network Boot Architecture - Deep dive into PXE, iPXE, and GRUB support
- Configuration Model - Profiles, Groups, and templating system
- Deployment Patterns - Installation options and operational considerations
Quick Architecture
┌─────────────┐
│ Machine │ PXE Boot
│ (BIOS/UEFI)│───┐
└─────────────┘ │
│
┌─────────────┐ │ DHCP/TFTP
│ dnsmasq │◄──┘ (chainload to iPXE)
│ DHCP+TFTP │
└─────────────┘
│
│ HTTP
▼
┌─────────────────────────┐
│ Matchbox │
│ ┌──────────────────┐ │
│ │ HTTP Endpoints │ │ /boot.ipxe, /ignition
│ └──────────────────┘ │
│ ┌──────────────────┐ │
│ │ gRPC API │ │ Terraform provider
│ └──────────────────┘ │
│ ┌──────────────────┐ │
│ │ Profile/Group │ │ Match machines
│ │ Matcher │ │ to configs
│ └──────────────────┘ │
└─────────────────────────┘
Technology Stack
- Language: Go
- Config Formats: Ignition JSON, Butane YAML
- Boot Protocols: PXE, iPXE, GRUB2
- APIs: HTTP (read-only), gRPC (authenticated)
- Deployment: Binary, container (Podman/Docker), Kubernetes
Integration Points
- Terraform:
terraform-provider-matchboxfor declarative provisioning - Ignition/Butane: CoreOS provisioning configs
- dnsmasq: Reference DHCP/TFTP/DNS implementation (
quay.io/poseidon/dnsmasq) - Asset sources: Can serve local or remote (HTTPS) OS images
1.5.1 - Configuration Model
Matchbox Configuration Model
Matchbox uses a flexible configuration model based on Profiles (what to provision) and Groups (which machines get which profile), with support for templating and metadata.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Matchbox Store │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Profiles │ │ Groups │ │ Assets │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Matcher Engine │ │
│ │ (Label-based group selection) │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Template Renderer │ │
│ │ (Go templates + metadata) │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
Rendered Config (iPXE, Ignition, etc.)
Data Directory Structure
Matchbox uses a FileStore (default) that reads from -data-path (default: /var/lib/matchbox):
/var/lib/matchbox/
├── groups/ # Machine group definitions (JSON)
│ ├── default.json
│ ├── node1.json
│ └── us-west.json
├── profiles/ # Profile definitions (JSON)
│ ├── worker.json
│ ├── controller.json
│ └── etcd.json
├── ignition/ # Ignition configs (.ign) or Butane (.yaml)
│ ├── worker.ign
│ ├── controller.ign
│ └── butane-example.yaml
├── cloud/ # Cloud-Config templates (DEPRECATED)
│ └── legacy.yaml.tmpl
├── generic/ # Arbitrary config templates
│ ├── setup.cfg
│ └── metadata.yaml.tmpl
└── assets/ # Static files (kernel, initrd)
├── fedora-coreos/
└── flatcar/
Version control: Poseidon recommends keeping /var/lib/matchbox under git for auditability and rollback.
Profiles
Profiles define what to provision: network boot settings (kernel, initrd, args) and config references (Ignition, Cloud-Config, generic).
Profile Schema
{
"id": "worker",
"name": "Fedora CoreOS Worker Node",
"boot": {
"kernel": "/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-kernel-x86_64",
"initrd": [
"--name main /assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img"
],
"args": [
"initrd=main",
"coreos.live.rootfs_url=http://matchbox.example.com:8080/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-rootfs.x86_64.img",
"coreos.inst.install_dev=/dev/sda",
"coreos.inst.ignition_url=http://matchbox.example.com:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
]
},
"ignition_id": "worker.ign",
"cloud_id": "",
"generic_id": ""
}
Profile Fields
| Field | Type | Required | Description |
|---|---|---|---|
id | string | ✅ | Unique profile identifier (referenced by groups) |
name | string | ❌ | Human-readable description |
boot | object | ❌ | Network boot configuration |
boot.kernel | string | ❌ | Kernel URL (HTTP/HTTPS or /assets path) |
boot.initrd | array | ❌ | Initrd URLs (can specify --name for multi-initrd) |
boot.args | array | ❌ | Kernel command-line arguments |
ignition_id | string | ❌ | Ignition/Butane config filename in ignition/ |
cloud_id | string | ❌ | Cloud-Config filename in cloud/ (deprecated) |
generic_id | string | ❌ | Generic config filename in generic/ |
Boot Configuration Patterns
Pattern 1: Live PXE (RAM-based, ephemeral)
Boot and run OS entirely from RAM, no disk install:
{
"boot": {
"kernel": "/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-kernel-x86_64",
"initrd": [
"--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img"
],
"args": [
"initrd=main",
"coreos.live.rootfs_url=http://matchbox/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-rootfs.x86_64.img",
"ignition.config.url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
]
}
}
Use case: Diskless workers, testing, ephemeral compute
Pattern 2: Disk Install (persistent)
PXE boot live image, install to disk, reboot to disk:
{
"boot": {
"kernel": "/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-kernel-x86_64",
"initrd": [
"--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img"
],
"args": [
"initrd=main",
"coreos.live.rootfs_url=http://matchbox/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-rootfs.x86_64.img",
"coreos.inst.install_dev=/dev/sda",
"coreos.inst.ignition_url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
]
}
}
Key difference: coreos.inst.install_dev triggers disk install before reboot
Pattern 3: Multi-initrd (layered)
Multiple initrds can be loaded (e.g., base + drivers):
{
"initrd": [
"--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img",
"--name drivers /assets/drivers/custom-drivers.img"
],
"args": [
"initrd=main,drivers",
"..."
]
}
Config References
Ignition Configs
Direct Ignition (.ign files):
{
"ignition_id": "worker.ign"
}
File: /var/lib/matchbox/ignition/worker.ign
{
"ignition": { "version": "3.3.0" },
"systemd": {
"units": [{
"name": "example.service",
"enabled": true,
"contents": "[Service]\nType=oneshot\nExecStart=/usr/bin/echo Hello\n\n[Install]\nWantedBy=multi-user.target"
}]
}
}
Butane Configs (transpiled to Ignition):
{
"ignition_id": "worker.yaml"
}
File: /var/lib/matchbox/ignition/worker.yaml
variant: fcos
version: 1.5.0
passwd:
users:
- name: core
ssh_authorized_keys:
- ssh-ed25519 AAAA...
systemd:
units:
- name: etcd.service
enabled: true
Matchbox automatically:
- Detects Butane format (file doesn’t end in
.ignor.ignition) - Transpiles Butane → Ignition using embedded library
- Renders templates with group metadata
- Serves as Ignition v3.3.0
Generic Configs
For non-Ignition configs (scripts, YAML, arbitrary data):
{
"generic_id": "setup-script.sh.tmpl"
}
File: /var/lib/matchbox/generic/setup-script.sh.tmpl
#!/bin/bash
# Rendered with group metadata
NODE_NAME={{.node_name}}
CLUSTER_ID={{.cluster_id}}
echo "Provisioning ${NODE_NAME} in cluster ${CLUSTER_ID}"
Access via: GET /generic?uuid=...&mac=...
Groups
Groups match machines to profiles using selectors (label matching) and provide metadata for template rendering.
Group Schema
{
"id": "node1-worker",
"name": "Worker Node 1",
"profile": "worker",
"selector": {
"mac": "52:54:00:89:d8:10",
"uuid": "550e8400-e29b-41d4-a716-446655440000"
},
"metadata": {
"node_name": "worker-01",
"cluster_id": "prod-cluster",
"etcd_endpoints": "https://10.0.1.10:2379,https://10.0.1.11:2379",
"ssh_authorized_keys": [
"ssh-ed25519 AAAA...",
"ssh-rsa AAAA..."
]
}
}
Group Fields
| Field | Type | Required | Description |
|---|---|---|---|
id | string | ✅ | Unique group identifier |
name | string | ❌ | Human-readable description |
profile | string | ✅ | Profile ID to apply |
selector | object | ❌ | Label match criteria (omit for default group) |
metadata | object | ❌ | Key-value data for template rendering |
Selector Matching
Reserved selectors (automatically populated from machine attributes):
| Selector | Source | Example | Normalized |
|---|---|---|---|
uuid | SMBIOS UUID | 550e8400-e29b-41d4-a716-446655440000 | Lowercase |
mac | Primary NIC MAC | 52:54:00:89:d8:10 | Colon-separated |
hostname | Network hostname | node1.example.com | As reported |
serial | Hardware serial | VMware-42 1a... | As reported |
Custom selectors (passed as query params):
{
"selector": {
"region": "us-west",
"environment": "production",
"rack": "A23"
}
}
Matching request: /ipxe?mac=52:54:00:89:d8:10®ion=us-west&environment=production&rack=A23
Matching logic:
- All selector key-value pairs must match request labels (AND logic)
- Most specific group wins (most selector matches)
- If multiple groups have same specificity, first match wins (undefined order)
- Groups with no selectors = default group (matches all)
Default Groups
Group with empty selector matches all machines:
{
"id": "default-worker",
"name": "Default Worker",
"profile": "worker",
"metadata": {
"environment": "dev"
}
}
⚠️ Warning: Avoid multiple default groups (non-deterministic matching)
Example: Region-based Matching
Group 1: US-West Workers
{
"id": "us-west-workers",
"profile": "worker",
"selector": {
"region": "us-west"
},
"metadata": {
"etcd_endpoints": "https://etcd-usw.example.com:2379"
}
}
Group 2: EU Workers
{
"id": "eu-workers",
"profile": "worker",
"selector": {
"region": "eu"
},
"metadata": {
"etcd_endpoints": "https://etcd-eu.example.com:2379"
}
}
Group 3: Specific Machine Override
{
"id": "node-special",
"profile": "controller",
"selector": {
"mac": "52:54:00:89:d8:10",
"region": "us-west"
},
"metadata": {
"role": "controller"
}
}
Matching precedence:
- Machine with
mac=52:54:00:89:d8:10®ion=us-west→node-special(2 selectors) - Machine with
region=us-west→us-west-workers(1 selector) - Machine with
region=eu→eu-workers(1 selector)
Templating System
Matchbox uses Go’s text/template for rendering configs with group metadata.
Template Context
Available variables in Ignition/Butane/Cloud-Config/generic templates:
// Group metadata (all keys from group.metadata)
{{.node_name}}
{{.cluster_id}}
{{.etcd_endpoints}}
// Group selectors (normalized)
{{.mac}} // e.g., "52:54:00:89:d8:10"
{{.uuid}} // e.g., "550e8400-..."
{{.region}} // Custom selector
// Request query params (raw)
{{.request.query.mac}} // As passed in URL
{{.request.query.foo}} // Custom query param
{{.request.raw_query}} // Full query string
// Special functions
{{if index . "ssh_authorized_keys"}} // Check if key exists
{{range $element := .ssh_authorized_keys}} // Iterate arrays
Example: Templated Butane Config
Group metadata:
{
"metadata": {
"node_name": "worker-01",
"ssh_authorized_keys": [
"ssh-ed25519 AAA...",
"ssh-rsa BBB..."
],
"ntp_servers": ["time1.google.com", "time2.google.com"]
}
}
Butane template: /var/lib/matchbox/ignition/worker.yaml
variant: fcos
version: 1.5.0
storage:
files:
- path: /etc/hostname
mode: 0644
contents:
inline: {{.node_name}}
- path: /etc/systemd/timesyncd.conf
mode: 0644
contents:
inline: |
[Time]
{{range $server := .ntp_servers}}
NTP={{$server}}
{{end}}
{{if index . "ssh_authorized_keys"}}
passwd:
users:
- name: core
ssh_authorized_keys:
{{range $key := .ssh_authorized_keys}}
- {{$key}}
{{end}}
{{end}}
Rendered Ignition (simplified):
{
"ignition": {"version": "3.3.0"},
"storage": {
"files": [
{
"path": "/etc/hostname",
"contents": {"source": "data:,worker-01"},
"mode": 420
},
{
"path": "/etc/systemd/timesyncd.conf",
"contents": {"source": "data:,%5BTime%5D%0ANTP%3Dtime1.google.com%0ANTP%3Dtime2.google.com"},
"mode": 420
}
]
},
"passwd": {
"users": [{
"name": "core",
"sshAuthorizedKeys": ["ssh-ed25519 AAA...", "ssh-rsa BBB..."]
}]
}
}
Template Best Practices
- Prefer external rendering: Use Terraform +
ct_configprovider for complex templates - Validate Butane: Use
strict: truein Terraform orfcct --strict - Escape carefully: Go templates use
{{}}, Butane uses YAML - mind the interaction - Test rendering: Request
/ignition?mac=...directly to inspect output - Version control: Keep templates + groups in git for auditability
Reserved Metadata Keys
Warning: .request is reserved for query param access. Group metadata with "request": {...} will be overwritten.
Reserved keys:
request.query.*- Query parametersrequest.raw_query- Raw query string
API Integration
HTTP Endpoints (Read-only)
| Endpoint | Purpose | Template Context |
|---|---|---|
/ipxe | iPXE boot script | Profile boot section |
/grub | GRUB config | Profile boot section |
/ignition | Ignition config | Group metadata + selectors + query |
/cloud | Cloud-Config (deprecated) | Group metadata + selectors + query |
/generic | Generic config | Group metadata + selectors + query |
/metadata | Key-value env format | Group metadata + selectors + query |
Example metadata endpoint response:
GET /metadata?mac=52:54:00:89:d8:10&foo=bar
NODE_NAME=worker-01
CLUSTER_ID=prod
MAC=52:54:00:89:d8:10
REQUEST_QUERY_MAC=52:54:00:89:d8:10
REQUEST_QUERY_FOO=bar
REQUEST_RAW_QUERY=mac=52:54:00:89:d8:10&foo=bar
gRPC API (Authenticated, mutable)
Used by terraform-provider-matchbox for declarative infrastructure:
Terraform example:
provider "matchbox" {
endpoint = "matchbox.example.com:8081"
client_cert = file("~/.matchbox/client.crt")
client_key = file("~/.matchbox/client.key")
ca = file("~/.matchbox/ca.crt")
}
resource "matchbox_profile" "worker" {
name = "worker"
kernel = "/assets/fedora-coreos/.../kernel"
initrd = ["--name main /assets/fedora-coreos/.../initramfs.img"]
args = [
"initrd=main",
"coreos.inst.install_dev=/dev/sda",
"coreos.inst.ignition_url=${var.matchbox_http_endpoint}/ignition?uuid=$${uuid}&mac=$${mac:hexhyp}"
]
raw_ignition = data.ct_config.worker.rendered
}
resource "matchbox_group" "node1" {
name = "node1"
profile = matchbox_profile.worker.name
selector = {
mac = "52:54:00:89:d8:10"
}
metadata = {
node_name = "worker-01"
}
}
Operations:
CreateProfile,GetProfile,UpdateProfile,DeleteProfileCreateGroup,GetGroup,UpdateGroup,DeleteGroup
TLS client authentication required (see deployment docs)
Configuration Workflow
Recommended: Terraform + External Configs
┌─────────────────────────────────────────────────────────────┐
│ 1. Write Butane configs (YAML) │
│ - worker.yaml, controller.yaml │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Terraform ct_config transpiles Butane → Ignition │
│ data "ct_config" "worker" { │
│ content = file("worker.yaml") │
│ strict = true │
│ } │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Terraform creates profiles + groups in Matchbox │
│ matchbox_profile.worker → gRPC CreateProfile() │
│ matchbox_group.node1 → gRPC CreateGroup() │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Machine PXE boots, queries Matchbox │
│ GET /ipxe?mac=... → matches group → returns profile │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Ignition fetches rendered config │
│ GET /ignition?mac=... → Matchbox returns Ignition │
└─────────────────────────────────────────────────────────────┘
Benefits:
- Rich Terraform templating (loops, conditionals, external data sources)
- Butane validation before deployment
- Declarative infrastructure (can
terraform planbefore apply) - Version control workflow (git + CI/CD)
Alternative: Manual FileStore
┌─────────────────────────────────────────────────────────────┐
│ 1. Create profile JSON manually │
│ /var/lib/matchbox/profiles/worker.json │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Create group JSON manually │
│ /var/lib/matchbox/groups/node1.json │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Write Ignition/Butane config │
│ /var/lib/matchbox/ignition/worker.ign │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Restart matchbox (to reload FileStore) │
│ systemctl restart matchbox │
└─────────────────────────────────────────────────────────────┘
Drawbacks:
- Manual file management
- No validation before deployment
- Requires matchbox restart to pick up changes
- Error-prone for large fleets
Storage Backends
FileStore (Default)
Config: -data-path=/var/lib/matchbox
Pros:
- Simple file-based storage
- Easy to version control (git)
- Human-readable JSON
Cons:
- Requires file system access
- Manual reload for gRPC-created resources
Custom Store (Extensible)
Matchbox’s Store interface allows custom backends:
type Store interface {
ProfileGet(id string) (*Profile, error)
GroupGet(id string) (*Group, error)
IgnitionGet(name string) (string, error)
// ... other methods
}
Potential custom stores:
- etcd backend (for HA Matchbox)
- Database backend (PostgreSQL, MySQL)
- S3/object storage backend
Note: Not officially provided by Matchbox project; requires custom implementation
Security Considerations
gRPC API authentication: Requires TLS client certificates
ca.crt- CA that signed client certsserver.crt/server.key- Server TLS identityclient.crt/client.key- Client credentials (Terraform)
HTTP endpoints are read-only: No auth, machines fetch configs
- Do NOT put secrets in Ignition configs
- Use external secret stores (Vault, GCP Secret Manager)
- Reference secrets via Ignition
files.sourcewith auth headers
Network segmentation: Matchbox on provisioning VLAN, isolate from production
Config validation: Validate Ignition/Butane before deployment to avoid boot failures
Audit logging: Version control groups/profiles; log gRPC API changes
Operational Tips
Test groups with curl:
curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10'List profiles:
ls -la /var/lib/matchbox/profiles/Validate Butane:
podman run -i --rm quay.io/coreos/fcct:release --strict < worker.yamlCheck group matching:
# Default group (no selectors) curl http://matchbox.example.com:8080/ignition # Specific machine curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10&uuid=550e8400-e29b-41d4-a716-446655440000'Backup configs:
tar -czf matchbox-backup-$(date +%F).tar.gz /var/lib/matchbox/{groups,profiles,ignition}
Summary
Matchbox’s configuration model provides:
- Separation of concerns: Profiles (what) vs Groups (who/where)
- Flexible matching: Label-based, multi-attribute, custom selectors
- Template support: Go templates for dynamic configs (but prefer external rendering)
- API-driven: Terraform integration for GitOps workflows
- Storage options: FileStore (simple) or custom backends (extensible)
- OS-agnostic: Works with any Ignition-based distro (FCOS, Flatcar, RHCOS)
Best practice: Use Terraform + external Butane configs for production; manual FileStore for labs/development.
1.5.2 - Deployment Patterns
Matchbox Deployment Patterns
Analysis of deployment architectures, installation methods, and operational considerations for running Matchbox in production.
Deployment Architectures
Single-Host Deployment
┌─────────────────────────────────────────────────────┐
│ Provisioning Host │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Matchbox │ │ dnsmasq │ │
│ │ :8080 HTTP │ │ DHCP/TFTP │ │
│ │ :8081 gRPC │ │ :67,:69 │ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ └──────────┬───────────┘ │
│ │ │
│ /var/lib/matchbox/ │
│ ├── groups/ │
│ ├── profiles/ │
│ ├── ignition/ │
│ └── assets/ │
└─────────────────────────────────────────────────────┘
│
│ Network
▼
┌──────────────┐
│ PXE Clients │
└──────────────┘
Use case: Lab, development, small deployments (<50 machines)
Pros:
- Simple setup
- Single service to manage
- Minimal resource requirements
Cons:
- Single point of failure
- No scalability
- Downtime during updates
HA Deployment (Multiple Matchbox Instances)
┌─────────────────────────────────────────────────────┐
│ Load Balancer (Ingress/HAProxy) │
│ :8080 HTTP :8081 gRPC │
└─────────────────────────────────────────────────────┘
│ │
├─────────────┬────────────────┤
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│Matchbox 1│ │Matchbox 2│ │Matchbox N│
│ (Pod/VM) │ │ (Pod/VM) │ │ (Pod/VM) │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└─────────────┴────────────────┘
│
▼
┌────────────────────────┐
│ Shared Storage │
│ /var/lib/matchbox │
│ (NFS, PV, ConfigMap) │
└────────────────────────┘
Use case: Production, datacenter-scale (100+ machines)
Pros:
- High availability (no single point of failure)
- Rolling updates (zero downtime)
- Load distribution
Cons:
- Complex storage (shared volume or etcd backend)
- More infrastructure required
Storage options:
- Kubernetes PersistentVolume (RWX mode)
- NFS share mounted on multiple hosts
- Custom etcd-backed Store (requires custom implementation)
- Git-sync sidecar (read-only, periodic pull)
Kubernetes Deployment
┌─────────────────────────────────────────────────────┐
│ Ingress Controller │
│ matchbox.example.com → Service matchbox:8080 │
│ matchbox-rpc.example.com → Service matchbox:8081 │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Service: matchbox (ClusterIP) │
│ ports: 8080/TCP, 8081/TCP │
└─────────────────────────────────────────────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Pod: matchbox │ │ Pod: matchbox │
│ replicas: 2+ │ │ replicas: 2+ │
└─────────────────┘ └─────────────────┘
│ │
└───────────┬───────────┘
▼
┌─────────────────────────────────────────────────────┐
│ PersistentVolumeClaim: matchbox-data │
│ /var/lib/matchbox (RWX mode) │
└─────────────────────────────────────────────────────┘
Manifest structure:
contrib/k8s/
├── matchbox-deployment.yaml # Deployment + replicas
├── matchbox-service.yaml # Service (8080, 8081)
├── matchbox-ingress.yaml # Ingress (HTTP + gRPC TLS)
└── matchbox-pvc.yaml # PersistentVolumeClaim
Key configurations:
Secret for gRPC TLS:
kubectl create secret generic matchbox-rpc \ --from-file=ca.crt \ --from-file=server.crt \ --from-file=server.keyIngress for gRPC (TLS passthrough):
metadata: annotations: nginx.ingress.kubernetes.io/ssl-passthrough: "true" nginx.ingress.kubernetes.io/backend-protocol: "GRPC"Volume mount:
volumes: - name: data persistentVolumeClaim: claimName: matchbox-data volumeMounts: - name: data mountPath: /var/lib/matchbox
Use case: Cloud-native deployments, Kubernetes-based infrastructure
Pros:
- Native Kubernetes primitives (Deployments, Services, Ingress)
- Rolling updates via Deployment strategy
- Easy scaling (
kubectl scale) - Health checks + auto-restart
Cons:
- Requires RWX PersistentVolume or shared storage
- Ingress TLS configuration complexity (gRPC passthrough)
- Cluster dependency (can’t provision cluster bootstrap nodes)
⚠️ Bootstrap problem: Kubernetes-hosted Matchbox can’t PXE boot its own cluster nodes (chicken-and-egg). Use external Matchbox for initial cluster bootstrap, then migrate.
Installation Methods
1. Binary Installation (systemd)
Recommended for: Bare-metal hosts, VMs, traditional Linux servers
Steps:
Download and verify:
wget https://github.com/poseidon/matchbox/releases/download/v0.10.0/matchbox-v0.10.0-linux-amd64.tar.gz wget https://github.com/poseidon/matchbox/releases/download/v0.10.0/matchbox-v0.10.0-linux-amd64.tar.gz.asc gpg --verify matchbox-v0.10.0-linux-amd64.tar.gz.ascExtract and install:
tar xzf matchbox-v0.10.0-linux-amd64.tar.gz sudo cp matchbox-v0.10.0-linux-amd64/matchbox /usr/local/bin/Create user and directories:
sudo useradd -U matchbox sudo mkdir -p /var/lib/matchbox/{assets,groups,profiles,ignition} sudo chown -R matchbox:matchbox /var/lib/matchboxInstall systemd unit:
sudo cp contrib/systemd/matchbox.service /etc/systemd/system/Configure via systemd dropin:
sudo systemctl edit matchbox[Service] Environment="MATCHBOX_ADDRESS=0.0.0.0:8080" Environment="MATCHBOX_RPC_ADDRESS=0.0.0.0:8081" Environment="MATCHBOX_LOG_LEVEL=debug"Start service:
sudo systemctl daemon-reload sudo systemctl start matchbox sudo systemctl enable matchbox
Pros:
- Direct control over service
- Easy log access (
journalctl -u matchbox) - Native OS integration
Cons:
- Manual updates required
- OS dependency (package compatibility)
2. Container Deployment (Docker/Podman)
Recommended for: Docker hosts, quick testing, immutable infrastructure
Docker:
mkdir -p /var/lib/matchbox/assets
docker run -d --name matchbox \
--net=host \
-v /var/lib/matchbox:/var/lib/matchbox:Z \
-v /etc/matchbox:/etc/matchbox:Z,ro \
quay.io/poseidon/matchbox:v0.10.0 \
-address=0.0.0.0:8080 \
-rpc-address=0.0.0.0:8081 \
-log-level=debug
Podman:
podman run -d --name matchbox \
--net=host \
-v /var/lib/matchbox:/var/lib/matchbox:Z \
-v /etc/matchbox:/etc/matchbox:Z,ro \
quay.io/poseidon/matchbox:v0.10.0 \
-address=0.0.0.0:8080 \
-rpc-address=0.0.0.0:8081 \
-log-level=debug
Volume mounts:
/var/lib/matchbox- Data directory (groups, profiles, configs, assets)/etc/matchbox- TLS certificates (ca.crt, server.crt, server.key)
Network mode:
--net=host- Required for DHCP/TFTP interaction on same host- Bridge mode possible if Matchbox is on separate host from dnsmasq
Pros:
- Immutable deployments
- Easy updates (pull new image)
- Portable across hosts
Cons:
- Volume management complexity
- SELinux considerations (
:Zflag)
3. Kubernetes Deployment
Recommended for: Kubernetes environments, cloud platforms
Quick start:
# Create TLS secret for gRPC
kubectl create secret generic matchbox-rpc \
--from-file=ca.crt=~/.matchbox/ca.crt \
--from-file=server.crt=~/.matchbox/server.crt \
--from-file=server.key=~/.matchbox/server.key
# Deploy manifests
kubectl apply -R -f contrib/k8s/
# Check status
kubectl get pods -l app=matchbox
kubectl get svc matchbox
kubectl get ingress matchbox matchbox-rpc
Persistence options:
Option 1: emptyDir (ephemeral, dev only):
volumes:
- name: data
emptyDir: {}
Option 2: PersistentVolumeClaim (production):
volumes:
- name: data
persistentVolumeClaim:
claimName: matchbox-data
Option 3: ConfigMap (static configs):
volumes:
- name: groups
configMap:
name: matchbox-groups
- name: profiles
configMap:
name: matchbox-profiles
Option 4: Git-sync sidecar (GitOps):
initContainers:
- name: git-sync
image: k8s.gcr.io/git-sync:v3.6.3
env:
- name: GIT_SYNC_REPO
value: https://github.com/example/matchbox-configs
- name: GIT_SYNC_DEST
value: /var/lib/matchbox
volumeMounts:
- name: data
mountPath: /var/lib/matchbox
Pros:
- Native k8s features (scaling, health checks, rolling updates)
- Ingress integration
- GitOps workflows
Cons:
- Complexity (Ingress, PVC, TLS)
- Can’t bootstrap own cluster
Network Boot Environment Setup
Matchbox requires separate DHCP/TFTP/DNS services. Options:
Option 1: dnsmasq Container (Quickest)
Use case: Lab, testing, environments without existing DHCP
Full DHCP + TFTP + DNS:
docker run -d --name dnsmasq \
--cap-add=NET_ADMIN \
--net=host \
quay.io/poseidon/dnsmasq:latest \
-d -q \
--dhcp-range=192.168.1.3,192.168.1.254,30m \
--enable-tftp \
--tftp-root=/var/lib/tftpboot \
--dhcp-match=set:bios,option:client-arch,0 \
--dhcp-boot=tag:bios,undionly.kpxe \
--dhcp-match=set:efi64,option:client-arch,9 \
--dhcp-boot=tag:efi64,ipxe.efi \
--dhcp-userclass=set:ipxe,iPXE \
--dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
--address=/matchbox.example.com/192.168.1.2 \
--log-queries \
--log-dhcp
Proxy DHCP (alongside existing DHCP):
docker run -d --name dnsmasq \
--cap-add=NET_ADMIN \
--net=host \
quay.io/poseidon/dnsmasq:latest \
-d -q \
--dhcp-range=192.168.1.1,proxy,255.255.255.0 \
--enable-tftp \
--tftp-root=/var/lib/tftpboot \
--dhcp-userclass=set:ipxe,iPXE \
--pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe \
--pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe \
--log-queries \
--log-dhcp
Included files: undionly.kpxe, ipxe.efi, grub.efi (bundled in image)
Option 2: Existing DHCP/TFTP Infrastructure
Use case: Enterprise environments with network admin policies
Required DHCP options (ISC DHCP example):
subnet 192.168.1.0 netmask 255.255.255.0 {
range 192.168.1.10 192.168.1.250;
# BIOS clients
if option architecture-type = 00:00 {
filename "undionly.kpxe";
}
# UEFI clients
elsif option architecture-type = 00:09 {
filename "ipxe.efi";
}
# iPXE clients
elsif exists user-class and option user-class = "iPXE" {
filename "http://matchbox.example.com:8080/boot.ipxe";
}
next-server 192.168.1.100; # TFTP server IP
}
TFTP files (place in tftp root):
- Download from http://boot.ipxe.org/undionly.kpxe
- Download from http://boot.ipxe.org/ipxe.efi
Option 3: iPXE-only (No PXE Chainload)
Use case: Modern hardware with native iPXE firmware
DHCP config (simpler):
filename "http://matchbox.example.com:8080/boot.ipxe";
No TFTP server needed (iPXE fetches directly via HTTP)
Limitation: Doesn’t support legacy BIOS with basic PXE ROM
TLS Certificate Setup
gRPC API requires TLS client certificates for authentication.
Option 1: Provided cert-gen Script
cd scripts/tls
export SAN=DNS.1:matchbox.example.com,IP.1:192.168.1.100
./cert-gen
Generates:
ca.crt- Self-signed CAserver.crt,server.key- Server credentialsclient.crt,client.key- Client credentials (for Terraform)
Install server certs:
sudo mkdir -p /etc/matchbox
sudo cp ca.crt server.crt server.key /etc/matchbox/
sudo chown -R matchbox:matchbox /etc/matchbox
Save client certs for Terraform:
mkdir -p ~/.matchbox
cp client.crt client.key ca.crt ~/.matchbox/
Option 2: Corporate PKI
Preferred for production: Use organization’s certificate authority
Requirements:
- Server cert with SAN:
DNS:matchbox.example.com - Client cert issued by same CA
- CA cert for validation
Matchbox flags:
-ca-file=/etc/matchbox/ca.crt
-cert-file=/etc/matchbox/server.crt
-key-file=/etc/matchbox/server.key
Terraform provider config:
provider "matchbox" {
endpoint = "matchbox.example.com:8081"
client_cert = file("/path/to/client.crt")
client_key = file("/path/to/client.key")
ca = file("/path/to/ca.crt")
}
Option 3: Let’s Encrypt (HTTP API only)
Note: gRPC requires client cert auth (incompatible with Let’s Encrypt)
Use case: TLS for HTTP endpoints only (read-only API)
Matchbox flags:
-web-ssl=true
-web-cert-file=/etc/letsencrypt/live/matchbox.example.com/fullchain.pem
-web-key-file=/etc/letsencrypt/live/matchbox.example.com/privkey.pem
Limitation: Still need self-signed certs for gRPC API
Configuration Flags
Core Flags
| Flag | Default | Description |
|---|---|---|
-address | 127.0.0.1:8080 | HTTP API listen address |
-rpc-address | `` | gRPC API listen address (empty = disabled) |
-data-path | /var/lib/matchbox | Data directory (FileStore) |
-assets-path | /var/lib/matchbox/assets | Static assets directory |
-log-level | info | Logging level (debug, info, warn, error) |
TLS Flags (gRPC)
| Flag | Default | Description |
|---|---|---|
-ca-file | /etc/matchbox/ca.crt | CA certificate for client verification |
-cert-file | /etc/matchbox/server.crt | Server TLS certificate |
-key-file | /etc/matchbox/server.key | Server TLS private key |
TLS Flags (HTTP, optional)
| Flag | Default | Description |
|---|---|---|
-web-ssl | false | Enable TLS for HTTP API |
-web-cert-file | `` | HTTP server TLS certificate |
-web-key-file | `` | HTTP server TLS private key |
Environment Variables
All flags can be set via environment variables with MATCHBOX_ prefix:
export MATCHBOX_ADDRESS=0.0.0.0:8080
export MATCHBOX_RPC_ADDRESS=0.0.0.0:8081
export MATCHBOX_LOG_LEVEL=debug
export MATCHBOX_DATA_PATH=/custom/path
Operational Considerations
Firewall Configuration
Matchbox host:
firewall-cmd --permanent --add-port=8080/tcp # HTTP API
firewall-cmd --permanent --add-port=8081/tcp # gRPC API
firewall-cmd --reload
dnsmasq host (if separate):
firewall-cmd --permanent --add-service=dhcp
firewall-cmd --permanent --add-service=tftp
firewall-cmd --permanent --add-service=dns # optional
firewall-cmd --reload
Monitoring
Health check endpoints:
# HTTP API
curl http://matchbox.example.com:8080
# Should return: matchbox
# gRPC API
openssl s_client -connect matchbox.example.com:8081 \
-CAfile ~/.matchbox/ca.crt \
-cert ~/.matchbox/client.crt \
-key ~/.matchbox/client.key
Prometheus metrics: Not built-in; consider adding reverse proxy (e.g., nginx) with metrics exporter
Logs (systemd):
journalctl -u matchbox -f
Logs (container):
docker logs -f matchbox
Backup Strategy
What to backup:
/var/lib/matchbox/{groups,profiles,ignition}- Configs/etc/matchbox/*.{crt,key}- TLS certificates- Terraform state (if using Terraform provider)
Backup command:
tar -czf matchbox-backup-$(date +%F).tar.gz \
/var/lib/matchbox/{groups,profiles,ignition} \
/etc/matchbox
Restore:
tar -xzf matchbox-backup-YYYY-MM-DD.tar.gz -C /
sudo chown -R matchbox:matchbox /var/lib/matchbox
sudo systemctl restart matchbox
GitOps approach: Store configs in git repository for versioning and auditability
Updates
Binary deployment:
# Download new version
wget https://github.com/poseidon/matchbox/releases/download/vX.Y.Z/matchbox-vX.Y.Z-linux-amd64.tar.gz
tar xzf matchbox-vX.Y.Z-linux-amd64.tar.gz
# Replace binary
sudo systemctl stop matchbox
sudo cp matchbox-vX.Y.Z-linux-amd64/matchbox /usr/local/bin/
sudo systemctl start matchbox
Container deployment:
docker pull quay.io/poseidon/matchbox:vX.Y.Z
docker stop matchbox
docker rm matchbox
docker run -d --name matchbox ... quay.io/poseidon/matchbox:vX.Y.Z ...
Kubernetes deployment:
kubectl set image deployment/matchbox matchbox=quay.io/poseidon/matchbox:vX.Y.Z
kubectl rollout status deployment/matchbox
Scaling Considerations
Vertical scaling (single instance):
- CPU: Minimal (config rendering is lightweight)
- Memory: ~50MB base + asset cache
- Disk: Depends on cached assets (100MB - 10GB+)
Horizontal scaling (multiple instances):
- Stateless HTTP API (load balance round-robin)
- Shared storage required (RWX PV, NFS, or custom backend)
- gRPC API can be load-balanced with gRPC-aware LB
Asset serving optimization:
- Use CDN or cache proxy for remote assets
- Local asset caching for <100 machines
- Dedicated HTTP server (nginx) for large deployments (1000+ machines)
Security Best Practices
Don’t store secrets in Ignition configs
- Use Ignition
files.sourcewith auth headers to fetch from Vault - Or provision minimal config, fetch secrets post-boot
- Use Ignition
Network segmentation
- Provision VLAN isolated from production
- Firewall rules: only allow provisioning traffic
gRPC API access control
- Client cert authentication (mandatory)
- Restrict cert issuance to authorized personnel/systems
- Rotate certs periodically
Audit logging
- Version control groups/profiles (git)
- Log gRPC API changes (Terraform state tracking)
- Monitor HTTP endpoint access
Validate configs before deployment
fcct --strictfor Butane configs- Terraform plan before apply
- Test in dev environment first
Troubleshooting
Common Issues
1. Machines not PXE booting:
# Check DHCP responses
tcpdump -i eth0 port 67 and port 68
# Verify TFTP files
ls -la /var/lib/tftpboot/
curl tftp://192.168.1.100/undionly.kpxe
# Check Matchbox accessibility
curl http://matchbox.example.com:8080/boot.ipxe
2. 404 Not Found on /ignition:
# Test group matching
curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10'
# Check group exists
ls -la /var/lib/matchbox/groups/
# Check profile referenced by group exists
ls -la /var/lib/matchbox/profiles/
# Verify ignition_id file exists
ls -la /var/lib/matchbox/ignition/
3. gRPC connection refused (Terraform):
# Test TLS connection
openssl s_client -connect matchbox.example.com:8081 \
-CAfile ~/.matchbox/ca.crt \
-cert ~/.matchbox/client.crt \
-key ~/.matchbox/client.key
# Check Matchbox gRPC is listening
sudo ss -tlnp | grep 8081
# Verify firewall
sudo firewall-cmd --list-ports
4. Ignition config validation errors:
# Validate Butane locally
podman run -i --rm quay.io/coreos/fcct:release --strict < config.yaml
# Fetch rendered Ignition
curl 'http://matchbox.example.com:8080/ignition?mac=...' | jq .
# Validate Ignition spec
curl 'http://matchbox.example.com:8080/ignition?mac=...' | \
podman run -i --rm quay.io/coreos/ignition-validate:latest
Summary
Matchbox deployment considerations:
- Architecture: Single-host (dev/lab) vs HA (production) vs Kubernetes
- Installation: Binary (systemd), container (Docker/Podman), or Kubernetes manifests
- Network boot: dnsmasq container (quick), existing infrastructure (enterprise), or iPXE-only (modern)
- TLS: Self-signed (dev), corporate PKI (production), Let’s Encrypt (HTTP only)
- Scaling: Vertical (simple) vs horizontal (requires shared storage)
- Security: Client cert auth, network segmentation, no secrets in configs
- Operations: Backup configs, GitOps workflow, monitoring/logging
Recommendation for production:
- HA deployment (2+ instances) with load balancer
- Shared storage (NFS or RWX PV on Kubernetes)
- Corporate PKI for TLS certificates
- GitOps workflow (Terraform + git-controlled configs)
- Network segmentation (dedicated provisioning VLAN)
- Prometheus/Grafana monitoring
1.5.3 - Network Boot Support
Network Boot Support in Matchbox
Matchbox provides comprehensive network boot support for bare-metal provisioning, supporting multiple boot firmware types and protocols.
Overview
Matchbox serves as an HTTP entrypoint for network-booted machines but does not implement DHCP, TFTP, or DNS services itself. Instead, it integrates with existing network infrastructure (or companion services like dnsmasq) to provide a complete PXE boot solution.
Boot Protocol Support
1. PXE (Preboot Execution Environment)
Legacy BIOS support via chainloading to iPXE:
Machine BIOS → DHCP (gets TFTP server) → TFTP (gets undionly.kpxe)
→ iPXE firmware → HTTP (Matchbox /boot.ipxe)
Key characteristics:
- Requires TFTP server to serve
undionly.kpxe(iPXE bootloader) - Chainloads from legacy PXE ROM to modern iPXE
- Supports older hardware with basic PXE firmware
- TFTP only used for initial iPXE bootstrap; subsequent downloads via HTTP
2. iPXE (Enhanced PXE)
Primary boot method supported by Matchbox:
iPXE Client → DHCP (gets boot script URL) → HTTP (Matchbox endpoints)
→ Kernel/initrd download → Boot with Ignition config
Endpoints served by Matchbox:
| Endpoint | Purpose |
|---|---|
/boot.ipxe | Static script that gathers machine attributes (UUID, MAC, hostname, serial) |
/ipxe?<labels> | Rendered iPXE script with kernel, initrd, and boot args for matched machine |
/assets/ | Optional local caching of kernel/initrd images |
Example iPXE flow:
- Machine boots with iPXE firmware
- DHCP response points to
http://matchbox.example.com:8080/boot.ipxe - iPXE fetches
/boot.ipxe:#!ipxe chain ipxe?uuid=${uuid}&mac=${mac:hexhyp}&domain=${domain}&hostname=${hostname}&serial=${serial} - iPXE makes request to
/ipxe?uuid=...&mac=...with machine attributes - Matchbox matches machine to group/profile and renders iPXE script:
#!ipxe kernel /assets/coreos/VERSION/coreos_production_pxe.vmlinuz \ coreos.config.url=http://matchbox.foo:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp} \ coreos.first_boot=1 initrd /assets/coreos/VERSION/coreos_production_pxe_image.cpio.gz boot
Advantages:
- HTTP downloads (faster than TFTP)
- Scriptable boot logic
- Can fetch configs from HTTP endpoints
- Supports HTTPS (if compiled with TLS support)
3. GRUB2
UEFI firmware support:
UEFI Firmware → DHCP (gets GRUB bootloader) → TFTP (grub.efi)
→ GRUB → HTTP (Matchbox /grub endpoint)
Matchbox endpoint: /grub?<labels>
Example GRUB config rendered by Matchbox:
default=0
timeout=1
menuentry "CoreOS" {
echo "Loading kernel"
linuxefi "(http;matchbox.foo:8080)/assets/coreos/VERSION/coreos_production_pxe.vmlinuz" \
"coreos.config.url=http://matchbox.foo:8080/ignition" "coreos.first_boot"
echo "Loading initrd"
initrdefi "(http;matchbox.foo:8080)/assets/coreos/VERSION/coreos_production_pxe_image.cpio.gz"
}
Use case:
- UEFI systems that prefer GRUB over iPXE
- Environments with existing GRUB network boot infrastructure
4. PXELINUX (Legacy, via TFTP)
While not a primary Matchbox target, PXELINUX clients can be configured to chainload iPXE:
# /var/lib/tftpboot/pxelinux.cfg/default
timeout 10
default iPXE
LABEL iPXE
KERNEL ipxe.lkrn
APPEND dhcp && chain http://matchbox.example.com:8080/boot.ipxe
DHCP Configuration Patterns
Matchbox supports two DHCP deployment models:
Pattern 1: PXE-Enabled DHCP
Full DHCP server provides IP allocation + PXE boot options.
Example dnsmasq configuration:
dhcp-range=192.168.1.1,192.168.1.254,30m
enable-tftp
tftp-root=/var/lib/tftpboot
# Legacy BIOS → chainload to iPXE
dhcp-match=set:bios,option:client-arch,0
dhcp-boot=tag:bios,undionly.kpxe
# UEFI → iPXE
dhcp-match=set:efi32,option:client-arch,6
dhcp-boot=tag:efi32,ipxe.efi
dhcp-match=set:efi64,option:client-arch,9
dhcp-boot=tag:efi64,ipxe.efi
# iPXE clients → Matchbox
dhcp-userclass=set:ipxe,iPXE
dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe
# DNS for Matchbox
address=/matchbox.example.com/192.168.1.100
Client architecture detection:
- Option 93 (
client-arch): Identifies BIOS (0), UEFI32 (6), UEFI64 (9) - User class: Detects iPXE clients to skip TFTP chainloading
Pattern 2: Proxy DHCP
Runs alongside existing DHCP server; provides only boot options (no IP allocation).
Example dnsmasq proxy-DHCP:
dhcp-range=192.168.1.1,proxy,255.255.255.0
enable-tftp
tftp-root=/var/lib/tftpboot
# Chainload legacy PXE to iPXE
pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe
# iPXE clients → Matchbox
dhcp-userclass=set:ipxe,iPXE
pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe
Benefits:
- Non-invasive: doesn’t replace existing DHCP
- PXE clients receive merged responses from both DHCP servers
- Ideal for environments where main DHCP cannot be modified
Network Boot Flow (Complete)
Scenario: BIOS machine with legacy PXE firmware
┌──────────────────────────────────────────────────────────────────┐
│ 1. Machine powers on, BIOS set to network boot │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 2. NIC PXE firmware broadcasts DHCPDISCOVER (PXEClient) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 3. DHCP/proxyDHCP responds with: │
│ - IP address (if full DHCP) │
│ - Next-server: TFTP server IP │
│ - Filename: undionly.kpxe (based on arch=0) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 4. PXE firmware downloads undionly.kpxe via TFTP │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 5. Execute iPXE (undionly.kpxe) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 6. iPXE requests DHCP again, identifies as iPXE (user-class) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 7. DHCP responds with boot URL (not TFTP): │
│ http://matchbox.example.com:8080/boot.ipxe │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 8. iPXE fetches /boot.ipxe via HTTP: │
│ #!ipxe │
│ chain ipxe?uuid=${uuid}&mac=${mac:hexhyp}&... │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 9. iPXE chains to /ipxe?uuid=XXX&mac=YYY (introspected labels) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 10. Matchbox matches machine to group/profile │
│ - Finds most specific group matching labels │
│ - Retrieves profile (kernel, initrd, args, configs) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 11. Matchbox renders iPXE script with: │
│ - kernel URL (local asset or remote HTTPS) │
│ - initrd URL │
│ - kernel args (including ignition.config.url) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 12. iPXE downloads kernel + initrd (HTTP/HTTPS) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 13. iPXE boots kernel with specified args │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 14. Fedora CoreOS/Flatcar boots, Ignition runs │
│ - Fetches /ignition?uuid=XXX&mac=YYY from Matchbox │
│ - Matchbox renders Ignition config with group metadata │
│ - Ignition partitions disk, writes files, creates users │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 15. System reboots (if disk install), boots from disk │
└──────────────────────────────────────────────────────────────────┘
Asset Serving
Matchbox can serve static assets (kernel, initrd images) from a local directory to reduce bandwidth and increase speed:
Asset directory structure:
/var/lib/matchbox/assets/
├── fedora-coreos/
│ └── 36.20220906.3.2/
│ ├── fedora-coreos-36.20220906.3.2-live-kernel-x86_64
│ ├── fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img
│ └── fedora-coreos-36.20220906.3.2-live-rootfs.x86_64.img
└── flatcar/
└── 3227.2.0/
├── flatcar_production_pxe.vmlinuz
├── flatcar_production_pxe_image.cpio.gz
└── version.txt
HTTP endpoint: http://matchbox.example.com:8080/assets/
Scripts provided:
scripts/get-fedora-coreos- Download/verify Fedora CoreOS imagesscripts/get-flatcar- Download/verify Flatcar Linux images
Profile reference:
{
"boot": {
"kernel": "/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-kernel-x86_64",
"initrd": ["--name main /assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img"]
}
}
Alternative: Profiles can reference remote HTTPS URLs (requires iPXE compiled with TLS support):
{
"kernel": "https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/36.20220906.3.2/x86_64/fedora-coreos-36.20220906.3.2-live-kernel-x86_64"
}
OS Support
Fedora CoreOS
Boot types:
- Live PXE (RAM-only, ephemeral)
- Install to disk (persistent, recommended)
Required kernel args:
coreos.inst.install_dev=/dev/sda- Target disk for installcoreos.inst.ignition_url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}- Provisioning configcoreos.live.rootfs_url=...- Root filesystem image
Ignition fetch: During first boot, ignition.service fetches config from Matchbox
Flatcar Linux
Boot types:
- Live PXE (RAM-only)
- Install to disk
Required kernel args:
flatcar.first_boot=yes- Marks first bootflatcar.config.url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}- Ignition config URLflatcar.autologin- Auto-login to console (optional, dev/debug)
Ignition support: Flatcar uses Ignition v3.x for provisioning
RHEL CoreOS
Supported as it uses Ignition like Fedora CoreOS. Requires Red Hat-specific image sources.
Machine Matching & Labels
Matchbox matches machines to profiles using labels extracted during boot:
Reserved Label Selectors
| Label | Source | Example | Normalized |
|---|---|---|---|
uuid | SMBIOS UUID | 550e8400-e29b-41d4-a716-446655440000 | Lowercase |
mac | NIC MAC address | 52:54:00:89:d8:10 | Normalized to colons |
hostname | Network boot program | node1.example.com | As-is |
serial | Hardware serial | VMware-42 1a... | As-is |
Custom Labels
Groups can match on arbitrary labels passed as query params:
/ipxe?mac=52:54:00:89:d8:10®ion=us-west&env=prod
Matching precedence: Most specific group wins (most selector matches)
Firmware Compatibility
| Firmware Type | Client Arch | Boot File | Protocol | Matchbox Support |
|---|---|---|---|---|
| BIOS (legacy PXE) | 0 | undionly.kpxe → iPXE | TFTP → HTTP | ✅ Via chainload |
| UEFI 32-bit | 6 | ipxe.efi | TFTP → HTTP | ✅ |
| UEFI (BIOS compat) | 7 | ipxe.efi | TFTP → HTTP | ✅ |
| UEFI 64-bit | 9 | ipxe.efi | TFTP → HTTP | ✅ |
| Native iPXE | - | N/A | HTTP | ✅ Direct |
| GRUB (UEFI) | - | grub.efi | TFTP → HTTP | ✅ /grub endpoint |
Network Requirements
Firewall rules on Matchbox host:
# HTTP API (read-only)
firewall-cmd --add-port=8080/tcp --permanent
# gRPC API (authenticated, Terraform)
firewall-cmd --add-port=8081/tcp --permanent
DNS requirement:
matchbox.example.commust resolve to Matchbox server IP- Can be configured in dnsmasq, corporate DNS, or
/etc/hostson DHCP server
DHCP/TFTP host (if using dnsmasq):
firewall-cmd --add-service=dhcp --permanent
firewall-cmd --add-service=tftp --permanent
firewall-cmd --add-service=dns --permanent # optional
Troubleshooting Tips
Verify Matchbox endpoints:
curl http://matchbox.example.com:8080 # Should return: matchbox curl http://matchbox.example.com:8080/boot.ipxe # Should return iPXE scriptTest machine matching:
curl 'http://matchbox.example.com:8080/ipxe?mac=52:54:00:89:d8:10' # Should return rendered iPXE script with kernel/initrdCheck TFTP files:
ls -la /var/lib/tftpboot/ # Should contain: undionly.kpxe, ipxe.efi, grub.efiVerify DHCP responses:
tcpdump -i eth0 -n port 67 and port 68 # Watch for DHCP offers with PXE optionsiPXE console debugging:
- Press Ctrl+B during iPXE boot to enter console
- Commands:
dhcp,ifstat,show net0/ip,chain http://...
Limitations
- HTTPS support: iPXE must be compiled with crypto support (larger binary, ~80KB vs ~45KB)
- TFTP dependency: Legacy PXE requires TFTP for initial chainload (can’t skip)
- No DHCP/TFTP built-in: Must use external services or dnsmasq container
- Boot firmware variations: Some vendor PXE implementations have quirks
- SecureBoot: iPXE and GRUB must be signed (or SecureBoot disabled)
Reference Implementation: dnsmasq Container
Matchbox project provides quay.io/poseidon/dnsmasq with:
- Pre-configured DHCP/TFTP/DNS service
- Bundled
ipxe.efi,undionly.kpxe,grub.efi - Example configs for PXE-DHCP and proxy-DHCP modes
Quick start (full DHCP):
docker run --rm --cap-add=NET_ADMIN --net=host quay.io/poseidon/dnsmasq \
-d -q \
--dhcp-range=192.168.1.3,192.168.1.254 \
--enable-tftp --tftp-root=/var/lib/tftpboot \
--dhcp-match=set:bios,option:client-arch,0 \
--dhcp-boot=tag:bios,undionly.kpxe \
--dhcp-match=set:efi64,option:client-arch,9 \
--dhcp-boot=tag:efi64,ipxe.efi \
--dhcp-userclass=set:ipxe,iPXE \
--dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
--address=/matchbox.example.com/192.168.1.2 \
--log-queries --log-dhcp
Quick start (proxy-DHCP):
docker run --rm --cap-add=NET_ADMIN --net=host quay.io/poseidon/dnsmasq \
-d -q \
--dhcp-range=192.168.1.1,proxy,255.255.255.0 \
--enable-tftp --tftp-root=/var/lib/tftpboot \
--dhcp-userclass=set:ipxe,iPXE \
--pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe \
--pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe \
--log-queries --log-dhcp
Summary
Matchbox provides robust network boot support through:
- Protocol flexibility: iPXE (primary), GRUB2, legacy PXE (via chainload)
- Firmware compatibility: BIOS and UEFI
- Modern approach: HTTP-based with optional local asset caching
- Clean separation: Matchbox handles config rendering; external services handle DHCP/TFTP
- Production-ready: Used by Typhoon Kubernetes distributions for bare-metal provisioning
1.5.4 - Use Case Evaluation
Matchbox Use Case Evaluation
Analysis of Matchbox’s suitability for various use cases, strengths, limitations, and comparison with alternative provisioning solutions.
Use Case Fit Analysis
✅ Ideal Use Cases
1. Bare-Metal Kubernetes Clusters
Scenario: Provisioning 10-1000 physical servers for Kubernetes nodes
Why Matchbox Excels:
- Ignition-native (perfect for Fedora CoreOS/Flatcar)
- Declarative machine provisioning via Terraform
- Label-based matching (region, role, hardware type)
- Integration with Typhoon Kubernetes distribution
- Minimal OS surface (immutable, container-optimized)
Example workflow:
resource "matchbox_profile" "k8s_controller" {
name = "k8s-controller"
kernel = "/assets/fedora-coreos/.../kernel"
raw_ignition = data.ct_config.controller.rendered
}
resource "matchbox_group" "controllers" {
profile = matchbox_profile.k8s_controller.name
selector = {
role = "controller"
}
}
Alternatives considered:
- Cloud-init + netboot.xyz: Less declarative, no native Ignition support
- Foreman: Heavier, more complex for container-centric workloads
- Metal³: Kubernetes-native but requires existing cluster
Verdict: ⭐⭐⭐⭐⭐ Matchbox is purpose-built for this
2. Lab/Development Environments
Scenario: Rapid PXE boot testing with QEMU/KVM VMs or homelab servers
Why Matchbox Excels:
- Quick setup (binary + dnsmasq container)
- No DHCP infrastructure required (proxy-DHCP mode)
- Localhost deployment (no external dependencies)
- Fast iteration (change configs, re-PXE)
- Included examples and scripts
Example setup:
# Start Matchbox locally
docker run -d --net=host -v /var/lib/matchbox:/var/lib/matchbox \
quay.io/poseidon/matchbox:latest -address=0.0.0.0:8080
# Start dnsmasq on same host
docker run -d --net=host --cap-add=NET_ADMIN \
quay.io/poseidon/dnsmasq ...
Alternatives considered:
- netboot.xyz: Great for manual OS selection, no automation
- PiXE server: Simpler but less flexible matching logic
- Manual iPXE scripts: No dynamic matching, manual maintenance
Verdict: ⭐⭐⭐⭐⭐ Minimal setup, maximum flexibility
3. Edge/Remote Site Provisioning
Scenario: Provision machines at 10+ remote datacenters or edge locations
Why Matchbox Excels:
- Lightweight (single binary, ~20MB)
- Declarative region-based matching
- Centralized config management (Terraform)
- Can run on minimal hardware (ARM support)
- HTTP-based (works over WAN with reverse proxy)
Architecture:
Central Matchbox (via Terraform)
↓ gRPC API
Regional Matchbox Instances (read-only cache)
↓ HTTP
Edge Machines (PXE boot)
Label-based routing:
{
"selector": {
"region": "us-west",
"site": "pdx-1"
},
"metadata": {
"ntp_servers": ["10.100.1.1", "10.100.1.2"]
}
}
Alternatives considered:
- Foreman: Requires more resources per site
- Ansible + netboot: No declarative PXE boot, post-install only
- Cloud-init datasources: Requires cloud metadata service per site
Verdict: ⭐⭐⭐⭐☆ Good fit, but consider caching strategy for WAN
⚠️ Moderate Fit Use Cases
4. Multi-Tenant Bare-Metal Cloud
Scenario: Provide bare-metal-as-a-service to multiple customers
Matchbox challenges:
- No built-in multi-tenancy (single namespace)
- No RBAC (gRPC API is all-or-nothing with client certs)
- No customer self-service portal
Workarounds:
- Deploy separate Matchbox per tenant (isolation via separate instances)
- Proxy gRPC API with custom RBAC layer
- Use group selectors with customer IDs
Better alternatives:
- Metal³ (Kubernetes-native, better multi-tenancy)
- OpenStack Ironic (purpose-built for bare-metal cloud)
- MAAS (Ubuntu-specific, has RBAC)
Verdict: ⭐⭐☆☆☆ Possible but architecturally challenging
5. Heterogeneous OS Provisioning
Scenario: Need to provision Fedora CoreOS, Ubuntu, RHEL, Windows
Matchbox challenges:
- Designed for Ignition-based OSes (FCOS, Flatcar, RHCOS)
- No native support for Kickstart (RHEL/CentOS)
- No support for Preseed (Ubuntu/Debian)
- No Windows unattend.xml support
What works:
- Fedora CoreOS ✅
- Flatcar Linux ✅
- RHEL CoreOS ✅
- Container Linux (deprecated but supported) ✅
What requires workarounds:
- RHEL/CentOS: Possible via generic configs + Kickstart URLs, but not native
- Ubuntu: Can PXE boot and point to autoinstall ISO, but loses Matchbox templating benefits
- Debian: Similar to Ubuntu
- Windows: Not supported (different PXE boot mechanisms)
Better alternatives for heterogeneous environments:
- Foreman (supports Kickstart, Preseed, unattend.xml)
- MAAS (Ubuntu-centric but extensible)
- Cobbler (older but supports many OS types)
Verdict: ⭐⭐☆☆☆ Stick to Ignition-based OSes or use different tool
❌ Poor Fit Use Cases
6. Windows PXE Boot
Why Matchbox doesn’t fit:
- No WinPE support
- No unattend.xml rendering
- Different PXE boot chain (WDS/SCCM model)
Recommendation: Use Microsoft WDS or SCCM
Verdict: ⭐☆☆☆☆ Not designed for this
7. BIOS/Firmware Updates
Why Matchbox doesn’t fit:
- Focused on OS provisioning, not firmware
- No vendor-specific tooling (Dell iDRAC, HP iLO integration)
Recommendation: Use vendor tools or Ansible with ipmi/redfish modules
Verdict: ⭐☆☆☆☆ Out of scope
Strengths
1. Ignition-First Design
- Native support for modern immutable OSes
- Declarative, atomic provisioning (no config drift)
- First-boot partition/filesystem setup
2. Label-Based Matching
- Flexible machine classification (MAC, UUID, region, role, custom)
- Most-specific-match algorithm (override defaults per machine)
- Query params for dynamic attributes
3. Terraform Integration
- Declarative infrastructure as code
- Plan before apply (preview changes)
- State tracking for auditability
- Rich templating (ct_config provider for Butane)
4. Minimal Dependencies
- Single static binary (~20MB)
- No database required (FileStore default)
- No built-in DHCP/TFTP (separation of concerns)
- Container-ready (OCI image available)
5. HTTP-Centric
- Faster downloads than TFTP (iPXE via HTTP)
- Proxy/CDN friendly for asset distribution
- Standard web tooling (curl, load balancers, Ingress)
6. Production-Ready
- Used by Typhoon Kubernetes (battle-tested)
- Clear upgrade path (SemVer releases)
- OpenPGP signature support for config integrity
Limitations
1. No Multi-Tenancy
- Single namespace (all groups/profiles global)
- No RBAC on gRPC API (client cert = full access)
- Requires separate instances per tenant
2. Ignition-Only Focus
- Cloud-Config deprecated (legacy support only)
- No native Kickstart/Preseed/unattend.xml
- Limits OS choice to CoreOS family
3. Storage Constraints
- FileStore doesn’t scale to 10,000+ profiles
- No built-in HA storage (requires NFS or custom backend)
- Kubernetes deployment needs RWX PersistentVolume
4. No Machine Discovery
- Doesn’t detect new machines (passive service)
- No inventory management (use external CMDB)
- No hardware introspection (use Ironic for that)
5. Limited Observability
- No built-in metrics (Prometheus integration requires reverse proxy)
- Logs are minimal (request logging only)
- No audit trail for gRPC API changes (use Terraform state)
6. TFTP Still Required
- Legacy BIOS PXE needs TFTP for chainloading to iPXE
- Can’t fully eliminate TFTP unless all machines have native iPXE
Comparison with Alternatives
vs. Foreman
| Feature | Matchbox | Foreman |
|---|---|---|
| OS Support | Ignition-based | Kickstart, Preseed, AutoYaST, etc. |
| Complexity | Low (single binary) | High (Rails app, DB, Puppet/Ansible) |
| Config Model | Declarative (Ignition) | Imperative (post-install scripts) |
| API | HTTP + gRPC | REST API |
| UI | None (API-only) | Full web UI |
| Terraform | Native provider | Community modules |
| Use Case | Container-centric infra | Traditional Linux servers |
When to choose Matchbox: CoreOS-based Kubernetes clusters, minimal infrastructure
When to choose Foreman: Heterogeneous OS, need web UI, traditional config mgmt
vs. Metal³
| Feature | Matchbox | Metal³ |
|---|---|---|
| Platform | Standalone | Kubernetes-native (operator) |
| Bootstrap | Can bootstrap k8s cluster | Needs existing k8s cluster |
| Machine Lifecycle | Provision only | Provision + decommission + reprovision |
| Hardware Introspection | No (labels passed manually) | Yes (via Ironic) |
| Multi-tenancy | No | Yes (via k8s namespaces) |
| Complexity | Low | High (requires Ironic, DHCP, etc.) |
When to choose Matchbox: Greenfield bare-metal, no existing k8s
When to choose Metal³: Existing k8s, need hardware mgmt lifecycle
vs. Cobbler
| Feature | Matchbox | Cobbler |
|---|---|---|
| Age | Modern (2016+) | Legacy (2008+) |
| Config Format | Ignition (declarative) | Kickstart/Preseed (imperative) |
| Templating | Go templates (minimal) | Cheetah templates (extensive) |
| Python | Go (static binary) | Python (requires interpreter) |
| DHCP Management | External | Can manage DHCP |
| Maintenance | Active (Poseidon) | Low activity |
When to choose Matchbox: Modern immutable OSes, container workloads
When to choose Cobbler: Legacy infra, need DHCP management, heterogeneous OS
vs. MAAS (Ubuntu)
| Feature | Matchbox | MAAS |
|---|---|---|
| OS Support | CoreOS family | Ubuntu (primary), others (limited) |
| IPAM | No (external DHCP) | Built-in IPAM |
| Power Mgmt | No (manual or scripts) | Built-in (IPMI, AMT, etc.) |
| UI | No | Full web UI |
| Declarative | Yes (Terraform) | Limited (CLI mostly) |
| Cloud Integration | No | Yes (libvirt, LXD, VM hosts) |
When to choose Matchbox: Non-Ubuntu, Kubernetes, minimal dependencies
When to choose MAAS: Ubuntu-centric, need power mgmt, cloud integration
vs. netboot.xyz
| Feature | Matchbox | netboot.xyz |
|---|---|---|
| Purpose | Automated provisioning | Manual OS selection menu |
| Automation | Full (API-driven) | None (interactive menu) |
| Customization | Per-machine configs | Global menu |
| Ignition | Native support | No |
| Complexity | Medium | Very low |
When to choose Matchbox: Automated fleet provisioning
When to choose netboot.xyz: Ad-hoc OS installation, homelab
Decision Matrix
Use this table to evaluate Matchbox for your use case:
| Requirement | Weight | Matchbox Score | Notes |
|---|---|---|---|
| Ignition/CoreOS support | High | ⭐⭐⭐⭐⭐ | Native, first-class |
| Heterogeneous OS | High | ⭐⭐☆☆☆ | Limited to Ignition OSes |
| Declarative provisioning | Medium | ⭐⭐⭐⭐⭐ | Terraform native |
| Multi-tenancy | Medium | ⭐☆☆☆☆ | Requires separate instances |
| Web UI | Medium | ☆☆☆☆☆ | No UI (API-only) |
| Ease of deployment | Medium | ⭐⭐⭐⭐☆ | Binary or container, minimal deps |
| Scalability | Medium | ⭐⭐⭐☆☆ | FileStore limits, need shared storage for HA |
| Hardware mgmt | Low | ☆☆☆☆☆ | No power mgmt, no introspection |
| Cost | Low | ⭐⭐⭐⭐⭐ | Open source, Apache 2.0 |
Scoring:
- ⭐⭐⭐⭐⭐ Excellent
- ⭐⭐⭐⭐☆ Good
- ⭐⭐⭐☆☆ Adequate
- ⭐⭐☆☆☆ Limited
- ⭐☆☆☆☆ Poor
- ☆☆☆☆☆ Not supported
Recommendations
Choose Matchbox if:
- ✅ Provisioning Fedora CoreOS, Flatcar, or RHEL CoreOS
- ✅ Building bare-metal Kubernetes clusters
- ✅ Prefer declarative infrastructure (Terraform)
- ✅ Want minimal dependencies (single binary)
- ✅ Need flexible label-based machine matching
- ✅ Have homogeneous OS requirements (all Ignition-based)
Avoid Matchbox if:
- ❌ Need multi-OS support (Windows, traditional Linux)
- ❌ Require web UI for operations teams
- ❌ Need built-in hardware management (power, BIOS config)
- ❌ Have strict multi-tenancy requirements
- ❌ Need automated hardware discovery/introspection
Hybrid Approaches
Pattern 1: Matchbox + Ansible
- Matchbox: Initial OS provisioning
- Ansible: Post-boot configuration, app deployment
- Works well for stateful services on bare-metal
Pattern 2: Matchbox + Metal³
- Matchbox: Bootstrap initial k8s cluster
- Metal³: Ongoing cluster node lifecycle management
- Gradual migration from Matchbox to Metal³
Pattern 3: Matchbox + Terraform + External Secrets
- Matchbox: Base OS + minimal config
- Ignition: Fetch secrets from Vault/GCP Secret Manager
- Terraform: Orchestrate end-to-end provisioning
Conclusion
Matchbox is a purpose-built, minimalist network boot service optimized for modern immutable operating systems (Ignition-based). It excels in container-centric bare-metal environments, particularly for Kubernetes clusters built with Fedora CoreOS or Flatcar Linux.
Best fit: Organizations adopting immutable infrastructure patterns, container orchestration, and declarative provisioning workflows.
Not ideal for: Heterogeneous OS environments, multi-tenant bare-metal clouds, or teams requiring extensive web UI and built-in hardware management.
For home labs and development, Matchbox offers an excellent balance of simplicity and power. For production Kubernetes deployments, it’s a proven, battle-tested solution (via Typhoon). For complex enterprise provisioning with mixed OS requirements, consider Foreman or MAAS instead.
1.6 - Ubiquiti Dream Machine Pro Analysis
Overview
The Ubiquiti Dream Machine Pro (UDM Pro) is an all-in-one network gateway, router, and switch designed for enterprise and advanced home lab environments. This analysis focuses on its capabilities relevant to infrastructure automation and network boot scenarios.
Key Specifications
Hardware
- Processor: Quad-core ARM Cortex-A57 @ 1.7 GHz
- RAM: 4GB DDR4
- Storage: 128GB eMMC (for UniFi OS, applications, and logs)
- Network Interfaces:
- 1x WAN port (RJ45, SFP, or SFP+)
- 8x LAN ports (1 Gbps RJ45, configurable)
- 1x SFP+ port (10 Gbps)
- 1x SFP port (1 Gbps)
- Additional Features:
- 3.5" SATA HDD bay (for UniFi Protect surveillance)
- IDS/IPS engine
- Deep packet inspection
- Built-in UniFi Network Controller
Software
- OS: UniFi OS (Linux-based)
- Controller: Built-in UniFi Network Controller
- Services: DHCP, DNS, routing, firewall, VPN (site-to-site and remote access)
Network Boot (PXE) Support
Native DHCP PXE Capabilities
The UDM Pro provides basic PXE boot support through its DHCP server:
Supported:
- DHCP Option 66 (
next-server/ TFTP server address) - DHCP Option 67 (
filename/ boot file name) - Basic single-architecture PXE booting
Configuration via UniFi Controller:
- Navigate to Settings → Networks → Select your network
- Scroll to DHCP section
- Enable DHCP
- Under Advanced DHCP Options:
- TFTP Server: IP address of your TFTP/PXE server (e.g.,
192.168.42.16) - Boot Filename: Name of the bootloader file (e.g.,
pxelinux.0for BIOS orbootx64.efifor UEFI)
- TFTP Server: IP address of your TFTP/PXE server (e.g.,
Limitations:
- No multi-architecture support: Cannot differentiate boot files based on client architecture (BIOS vs. UEFI, x86_64 vs. ARM64)
- No conditional DHCP options: Cannot vary
filenameornext-serverbased on client characteristics - Fixed boot parameters: One boot configuration for all PXE clients
- Single bootloader only: Must choose either BIOS or UEFI bootloader, not both
Use Cases:
- ✅ Homogeneous environments (all BIOS or all UEFI)
- ✅ Single OS deployment scenarios
- ✅ Simple provisioning workflows
- ❌ Mixed BIOS/UEFI environments (requires external DHCP server with conditional logic)
Network Segmentation & VLANs
The UDM Pro excels at network segmentation, critical for infrastructure isolation:
- VLAN Support: Native 802.1Q tagging
- Firewall Rules: Inter-VLAN routing with granular firewall policies
- Network Isolation: Can create fully isolated networks or controlled inter-network traffic
- Use Cases for Infrastructure:
- Management VLAN (for PXE/provisioning)
- Production VLAN (workloads)
- IoT/OT VLAN (isolated devices)
- DMZ (exposed services)
VPN Capabilities
Site-to-Site VPN
- Protocols: IPsec, WireGuard (experimental)
- Use Case: Connect home lab to cloud infrastructure (GCP, AWS, Azure)
- Performance: Hardware-accelerated encryption on UDM Pro
Remote Access VPN
- Protocols: L2TP, OpenVPN
- Use Case: Remote administration of home lab infrastructure
- Integration: Can work with Cloudflare Access for additional security layer
IDS/IPS Engine
- Technology: Suricata-based
- Capabilities:
- Intrusion detection
- Intrusion prevention (can drop malicious traffic)
- Threat signatures updated via UniFi
- Performance Impact: Can affect throughput on high-bandwidth connections
- Recommendation: Enable for security-sensitive infrastructure segments
DNS & DHCP Services
DNS
- Local DNS: Can act as caching DNS resolver
- Custom DNS Records: Limited to UniFi controller hostname
- Recommendation: Use external DNS (Pi-hole, Bind9) for advanced features like split-horizon DNS
DHCP
- Static Leases: Supports MAC-based static IP assignments
- DHCP Options: Can configure common options (NTP, DNS, domain name)
- Reservations: Per-client reservations via GUI
- PXE Options: Basic Option 66/67 support (as noted above)
Integration with Infrastructure-as-Code
UniFi Network API
- REST API: Available for configuration automation
- Python Libraries:
pyunifiand others for programmatic access - Use Cases:
- Terraform provider for network state management
- Ansible modules for configuration automation
- CI/CD integration for network-as-code
Terraform Provider
- Provider:
paultyng/unifi - Capabilities: Manage networks, firewall rules, port forwarding, DHCP settings
- Limitations: Not all UI features exposed via API
Configuration Persistence
- Backup/Restore: JSON-based configuration export
- Version Control: Can track config changes in Git
- Recovery: Auto-backup to cloud (optional)
Performance Characteristics
Throughput
- Routing/NAT: ~3.5 Gbps (without IDS/IPS)
- IDS/IPS Enabled: ~850 Mbps - 1 Gbps
- VPN (IPsec): ~1 Gbps
- Inter-VLAN Routing: Wire speed (8 Gbps backplane)
Scalability
- Concurrent Devices: 500+ clients tested
- VLANs: Up to 32 networks/VLANs
- Firewall Rules: Thousands (performance depends on complexity)
- DHCP Leases: Supports large pools efficiently
Comparison to Alternatives
| Feature | UDM Pro | pfSense | OPNsense | MikroTik |
|---|---|---|---|---|
| Basic PXE | ✅ | ✅ | ✅ | ✅ |
| Conditional DHCP | ❌ | ✅ | ✅ | ✅ |
| All-in-one | ✅ | ❌ | ❌ | Varies |
| GUI Ease-of-use | ✅✅ | ⚠️ | ⚠️ | ❌ |
| API/Automation | ⚠️ | ✅ | ✅ | ✅✅ |
| IDS/IPS Built-in | ✅ | ⚠️ (addon) | ⚠️ (addon) | ❌ |
| Hardware | Fixed | Flexible | Flexible | Flexible |
| Price | $$$ | $ (+ hardware) | $ (+ hardware) | $ - $$$ |
Recommendations for Home Lab Use
Ideal Use Cases
✅ Use the UDM Pro when:
- You want an all-in-one solution with minimal configuration
- You need integrated UniFi controller and network management
- Your home lab has mixed UniFi hardware (switches, APs)
- You want a polished GUI and mobile app management
- Network segmentation and VLANs are critical
Consider Alternatives When
⚠️ Look elsewhere if:
- You need conditional DHCP options or multi-architecture PXE boot
- You require advanced routing protocols (BGP, OSPF beyond basics)
- You need granular firewall control and scripting (pfSense/OPNsense better)
- Budget is tight and you already have x86 hardware (pfSense on old PC)
- You need extremely low latency (sub-1ms) routing
Recommended Configuration for Infrastructure Lab
Network Segmentation:
- VLAN 10: Management (PXE, Ansible, provisioning tools)
- VLAN 20: Kubernetes cluster
- VLAN 30: Storage network (NFS, iSCSI)
- VLAN 40: Public-facing services (behind Cloudflare)
DHCP Strategy:
- Use UDM Pro native DHCP with basic PXE options for single-arch PXE needs
- Static reservations for infrastructure components
- Consider external DHCP server if conditional options are required
Firewall Rules:
- Default deny between VLANs
- Allow management VLAN → all (with source IP restrictions)
- Allow cluster VLAN → storage VLAN (on specific ports)
- NAT only on VLAN 40 (public services)
VPN Configuration:
- Site-to-Site to GCP via WireGuard (lower overhead than IPsec)
- Remote access VPN on separate VLAN with restrictive firewall
Integration:
- Terraform for network state management
- Ansible for DHCP/DNS servers in management VLAN
- Cloudflare Access for secure public service exposure
Conclusion
The UDM Pro is a capable all-in-one network device ideal for home labs that prioritize ease-of-use and integration with the UniFi ecosystem. It provides basic PXE boot support suitable for single-architecture environments, though conditional DHCP options require external DHCP servers for complex scenarios.
For infrastructure automation projects, the UDM Pro serves well as a reliable network foundation that handles VLANs, routing, and basic services, allowing you to focus on higher-level infrastructure concerns like container orchestration and cloud integration.
1.6.1 - UDM Pro VLAN Configuration & Capabilities
Overview
The Ubiquiti Dream Machine Pro (UDM Pro) provides robust VLAN support through native 802.1Q tagging, enabling network segmentation for security, performance, and organizational purposes. This document covers VLAN configuration capabilities, port assignments, and VPN integration.
VLAN Fundamentals on UDM Pro
Supported Standards
- 802.1Q VLAN Tagging: Full support for standard VLAN tagging
- VLAN Range: IDs 1-4094 (standard IEEE 802.1Q range)
- Maximum VLANs: Up to 32 networks/VLANs per device
- Native VLAN: Configurable per port (default: VLAN 1)
VLAN Types
Corporate Network
- Default network type for general-purpose VLANs
- Provides DHCP, inter-VLAN routing, and firewall capabilities
- Can enable/disable guest policies, IGMP snooping, and multicast DNS
Guest Network
- Isolated network with internet-only access
- Automatic firewall rules preventing access to other VLANs
- Captive portal support for guest authentication
IoT Network
- Optimized for IoT devices with device isolation
- Prevents lateral movement between IoT devices
- Allows communication with controller/gateway only
Port-Based VLAN Assignment
Per-Port VLAN Configuration
The UDM Pro’s 8x 1 Gbps LAN ports and SFP/SFP+ ports support flexible VLAN assignment:
Configuration Options per Port:
- Native VLAN/Untagged VLAN: The default VLAN for untagged traffic on the port
- Tagged VLANs: Multiple VLANs that can pass through the port with 802.1Q tags
- Port Profile: Pre-configured VLAN assignments that can be applied to ports
Port Profile Types
All: Port accepts all VLANs (trunk mode)
- Passes all configured VLANs with tags
- Used for connecting managed switches or access points
- Native VLAN for untagged traffic
Specific VLANs: Port limited to selected VLANs
- Choose which VLANs are allowed (tagged)
- Set native/untagged VLAN
- Used for controlled trunk links
Single VLAN: Access port mode
- Port carries only one VLAN (untagged)
- All traffic on this port belongs to specified VLAN
- Used for end devices (PCs, servers, printers)
Configuration Steps
Via UniFi Controller GUI:
Create Port Profile:
- Navigate to Settings → Profiles → Port Manager
- Click Create New Port Profile
- Select profile type (All, LAN, or Custom)
- Configure VLAN settings:
- Native VLAN/Network: Untagged VLAN
- Tagged VLANs: Select allowed VLANs (for trunk mode)
- Enable/disable settings: PoE, Storm Control, Port Isolation
Assign Profile to Ports:
- Navigate to UniFi Devices → Select UDM Pro
- Go to Ports tab
- For each LAN port (1-8) or SFP port:
- Click port to edit
- Select Port Profile from dropdown
- Apply changes
Quick Port Assignment (Alternative):
- Settings → Networks → Select VLAN
- Under Port Manager, assign specific ports to this network
- Ports become access ports for this VLAN
Example Port Layout
UDM Pro Port Assignment Example:
Port 1: Native VLAN 10 (Management) - Access Mode
└── Use: Ansible control server
Port 2: Native VLAN 20 (Kubernetes) - Access Mode
└── Use: K8s master node
Port 3: Native VLAN 30 (Storage) - Access Mode
└── Use: NAS/SAN device
Port 4: Native VLAN 1, Tagged: 10,20,30,40 - Trunk Mode
└── Use: Managed switch uplink
Port 5-7: Native VLAN 40 (DMZ) - Access Mode
└── Use: Public-facing servers
Port 8: Native VLAN 1 (Default/Untagged) - Access Mode
└── Use: Management laptop (temporary)
SFP+: Native VLAN 1, Tagged: All - Trunk Mode
└── Use: 10G uplink to core switch
VLAN Features and Capabilities
Inter-VLAN Routing
Enabled by Default:
- Hardware-accelerated routing between VLANs
- Wire-speed performance (8 Gbps backplane)
- Routing decisions made at Layer 3
Firewall Control:
- Default behavior: Allow all inter-VLAN traffic
- Recommended: Create explicit allow/deny rules per VLAN pair
- Granular control: Protocol, port, source/destination filtering
Example Firewall Rules:
Rule 1: Allow Management (VLAN 10) → All VLANs
Source: 192.168.10.0/24
Destination: Any
Action: Accept
Rule 2: Allow K8s (VLAN 20) → Storage (VLAN 30) - NFS only
Source: 192.168.20.0/24
Destination: 192.168.30.0/24
Ports: 2049 (NFS), 111 (Portmapper)
Action: Accept
Rule 3: Block IoT (VLAN 50) → All Private Networks
Source: 192.168.50.0/24
Destination: 192.168.0.0/16, 10.0.0.0/8, 172.16.0.0/12
Action: Drop
Rule 4 (Implicit): Default Deny Between VLANs
Source: Any
Destination: Any
Action: Drop
DHCP per VLAN
Each VLAN can have its own DHCP server:
- Independent IP ranges per VLAN
- Separate DHCP options (DNS, gateway, NTP, domain)
- Static DHCP reservations per VLAN
- PXE boot options (Option 66/67) per network
Configuration:
- Settings → Networks → Select VLAN
- DHCP section:
- Enable DHCP server
- Define IP range (e.g., 192.168.10.100-192.168.10.254)
- Set lease time
- Configure gateway (usually UDM Pro’s IP on this VLAN)
- Add custom DHCP options
Example DHCP Configuration:
VLAN 10 (Management):
Subnet: 192.168.10.0/24
Gateway: 192.168.10.1 (UDM Pro)
DHCP Range: 192.168.10.100-192.168.10.200
DNS: 192.168.10.10 (local DNS server)
TFTP Server (Option 66): 192.168.10.16
Boot Filename (Option 67): pxelinux.0
VLAN 20 (Kubernetes):
Subnet: 192.168.20.0/24
Gateway: 192.168.20.1 (UDM Pro)
DHCP Range: 192.168.20.50-192.168.20.99
DNS: 8.8.8.8, 8.8.4.4
Domain Name: k8s.lab.local
VLAN Isolation
Guest Portal Isolation:
- Guest networks auto-configured with isolation rules
- Prevents access to RFC1918 private networks
- Internet-only access by default
Manual Isolation (Firewall Rules):
- Create LAN In rules to block inter-VLAN traffic
- Use groups for easier management of multiple VLANs
- Apply port isolation for additional security
Device Isolation (IoT Networks):
- Prevents devices on same VLAN from communicating
- Only controller/gateway access allowed
- Use for untrusted IoT devices (cameras, smart home)
VPN and VLAN Integration
Site-to-Site VPN VLAN Assignment
✅ VLANs CAN be assigned to site-to-site VPN connections:
WireGuard VPN:
- Configure remote subnet to map to specific local VLAN
- Example: GCP subnet 10.128.0.0/20 → routed through VLAN 10
- Routing table automatically updated
- Firewall rules apply to VPN traffic
IPsec Site-to-Site:
- Specify local networks (can select specific VLANs)
- Remote networks configured in tunnel settings
- Multiple VLANs can traverse single VPN tunnel
- Perfect Forward Secrecy supported
Configuration Steps:
- Settings → VPN → Site-to-Site VPN
- Create New VPN tunnel (WireGuard or IPsec)
- Under Local Networks, select VLANs to include:
- Option 1: Select “All” networks
- Option 2: Choose specific VLANs (e.g., VLAN 10, 20 only)
- Configure Remote Networks (cloud provider subnets)
- Set encryption parameters and pre-shared keys
- Create Firewall Rules for VPN traffic:
- Allow specific VLAN → VPN tunnel
- Control which VLANs can reach remote networks
Example Site-to-Site Config:
Home Lab → GCP WireGuard VPN
Local Networks:
- VLAN 10 (Management): 192.168.10.0/24
- VLAN 20 (Kubernetes): 192.168.20.0/24
Remote Networks:
- GCP VPC: 10.128.0.0/20
Firewall Rules:
- Allow VLAN 10 → GCP VPC (all protocols)
- Allow VLAN 20 → GCP VPC (HTTPS, kubectl API only)
- Block all other VLANs from VPN tunnel
Remote Access VPN VLAN Assignment
✅ VLANs CAN be assigned to remote access VPN clients:
L2TP/IPsec Remote Access:
- VPN clients land on a specific VLAN
- Default: All clients in same VPN subnet
- Firewall rules control VLAN access from VPN
OpenVPN Remote Access (via UniFi Network Application addon):
- Not natively built into UDM Pro
- Requires UniFi Network Application 6.0+
- Can route VPN clients to specific VLAN
Teleport VPN (UniFi’s solution):
- Built-in remote access VPN
- Clients route through UDM Pro
- Can access specific VLANs based on firewall rules
- Layer 3 routing to VLANs
Configuration:
- Settings → VPN → Remote Access
- Enable L2TP or configure Teleport
- Set VPN Network (e.g., 192.168.100.0/24)
- Advanced:
- Enable access to specific VLANs
- By default, VPN network is treated as separate VLAN
- Firewall Rules to allow VPN → VLANs:
- Source: VPN network (192.168.100.0/24)
- Destination: VLAN 10, VLAN 20 (or specific resources)
- Action: Accept
Example Remote Access Config:
Remote VPN Users → Home Lab Access
VPN Network: 192.168.100.0/24
VPN Gateway: 192.168.100.1 (UDM Pro)
Firewall Rules:
Rule 1: Allow VPN → Management VLAN (admin users)
Source: 192.168.100.0/24
Dest: 192.168.10.0/24
Ports: SSH (22), HTTPS (443)
Rule 2: Allow VPN → Kubernetes VLAN (developers)
Source: 192.168.100.0/24
Dest: 192.168.20.0/24
Ports: kubectl (6443), app ports (8080-8090)
Rule 3: Block VPN → Storage VLAN (security)
Source: 192.168.100.0/24
Dest: 192.168.30.0/24
Action: Drop
VPN VLAN Routing Limitations
Current Limitations:
- Cannot assign individual VPN clients to different VLANs dynamically
- No VLAN assignment based on user identity (all clients in same VPN network)
- RADIUS integration does not support per-user VLAN assignment for VPN
- For per-user VLAN control, use firewall rules based on source IP
Workarounds:
- Use firewall rules with VPN client IP ranges for granular access
- Deploy separate VPN tunnels for different access levels
- Use RADIUS for authentication + firewall rules for authorization
VLAN Best Practices for Home Lab
Network Segmentation Strategy
Recommended VLAN Layout:
VLAN 1: Default/Management (UDM Pro access)
VLAN 10: Infrastructure Management (Ansible, PXE, monitoring)
VLAN 20: Kubernetes Cluster (control plane + workers)
VLAN 30: Storage Network (NFS, iSCSI, object storage)
VLAN 40: DMZ/Public Services (exposed to internet via Cloudflare)
VLAN 50: IoT Devices (isolated smart home devices)
VLAN 60: Guest Network (visitor WiFi, untrusted devices)
VLAN 100: VPN Remote Access (remote admin/dev access)
Firewall Policy Design
Default Deny Approach:
- Create explicit allow rules for necessary traffic
- Set implicit deny for all inter-VLAN traffic
- Log dropped packets for troubleshooting
Rule Order (top to bottom):
- Management VLAN → All (with source IP restrictions)
- Kubernetes → Storage (specific ports)
- DMZ → Internet (outbound only)
- VPN → Specific VLANs (based on role)
- All → Internet (NAT)
- Block RFC1918 from DMZ
- Drop all (implicit)
Performance Optimization
VLAN Routing Performance:
- Inter-VLAN routing is hardware-accelerated
- No performance penalty for multiple VLANs
- Use VLAN tagging on trunk ports to reduce switch load
Multicast and Broadcast Control:
- Enable IGMP snooping per VLAN for multicast efficiency
- Disable multicast DNS (mDNS) between VLANs if not needed
- Use multicast routing for cross-VLAN multicast (advanced)
Advanced VLAN Features
VLAN-Specific Services
DNS per VLAN:
- Configure different DNS servers per VLAN via DHCP
- Example: Management VLAN uses local DNS, DMZ uses public DNS
NTP per VLAN:
- DHCP Option 42 for NTP server
- Different time sources per network segment
Domain Name per VLAN:
- DHCP Option 15 for domain name
- Useful for split-horizon DNS setups
VLAN Tagging on WiFi
UniFi WiFi Integration:
- Each WiFi SSID can map to a specific VLAN
- Multiple SSIDs on same AP → different VLANs
- Seamless VLAN tagging for wireless clients
Configuration:
- Create WiFi network in UniFi Controller
- Assign VLAN ID to SSID
- Client traffic automatically tagged
VLAN Monitoring and Troubleshooting
Traffic Statistics:
- Per-VLAN bandwidth usage visible in UniFi Controller
- Deep Packet Inspection (DPI) provides application-level stats
- Export data for analysis in external tools
Debugging Tools:
- Port mirroring for packet capture
- Flow logs for traffic analysis
- Firewall logs show inter-VLAN blocks
Common Issues:
- VLAN not working: Check port profile assignment and native VLAN config
- No inter-VLAN routing: Verify firewall rules aren’t blocking traffic
- DHCP not working on VLAN: Ensure DHCP server enabled on that network
- VPN can’t reach VLAN: Check VPN local networks include the VLAN
Summary
VLAN Port Assignment: ✅ YES
The UDM Pro fully supports port-based VLAN assignment:
- Individual ports can be assigned to specific VLANs (access mode)
- Ports can carry multiple tagged VLANs (trunk mode)
- Native/untagged VLAN configurable per port
- Port profiles simplify configuration across multiple devices
VPN VLAN Assignment: ✅ YES
VLANs can be assigned to VPN connections:
- Site-to-Site VPN: Select which VLANs traverse the tunnel
- Remote Access VPN: VPN clients route to specific VLANs via firewall rules
- Routing Control: Full control over which VLANs are accessible via VPN
- Limitations: No per-user VLAN assignment; use firewall rules for granular access
Key Capabilities
- Up to 32 VLANs supported
- Hardware-accelerated inter-VLAN routing
- Per-VLAN DHCP, DNS, and firewall policies
- Full integration with UniFi WiFi for SSID-to-VLAN mapping
- Flexible port profiles for easy configuration
- VPN integration for both site-to-site and remote access scenarios
2 - Architecture Decision Records
Architecture Decision Records (ADRs)
This section contains architectural decision records that document the key design choices made. Each ADR follows the MADR 4.0.0 format and includes:
- Context and problem statement
- Decision drivers and constraints
- Considered options with pros and cons
- Decision outcome and rationale
- Consequences (positive and negative)
- Confirmation methods
ADR Categories
ADRs are classified into three categories:
- Strategic - High-level architectural decisions affecting the entire system (frameworks, authentication strategies, cross-cutting patterns). Use for foundational technology choices.
- User Journey - Decisions solving specific user journey problems. More tactical than strategic, but still architectural. Use when evaluating approaches to implement user-facing features.
- API Design - API endpoint implementation decisions (pagination, filtering, bulk operations). Use for significant API design trade-offs that warrant documentation.
Status Values
Each ADR has a status that reflects its current state:
proposed- Decision is under considerationaccepted- Decision has been approved and should be implementedrejected- Decision was considered but not approveddeprecated- Decision is no longer relevant or has been supersededsuperseded by ADR-XXXX- Decision has been replaced by a newer ADR
These records provide historical context for architectural decisions and help ensure consistency across the platform.
2.1 - [0001] Use MADR for Architecture Decision Records
Context and Problem Statement
As the project grows, architectural decisions are made that have long-term impacts on the system’s design, maintainability, and scalability. Without a structured way to document these decisions, we risk losing the context and rationale behind important choices, making it difficult for current and future team members to understand why certain approaches were taken.
How should we document architectural decisions in a way that is accessible, maintainable, and provides sufficient context for future reference?
Decision Drivers
- Need for clear documentation of architectural decisions and their rationale
- Easy accessibility and searchability of past decisions
- Low barrier to entry for creating and maintaining decision records
- Integration with existing documentation workflow
- Version control friendly format
- Industry-standard approach that team members may already be familiar with
Considered Options
- MADR (Markdown Architectural Decision Records)
- ADR using custom format
- Wiki-based documentation
- No formal ADR process
Decision Outcome
Chosen option: “MADR (Markdown Architectural Decision Records)”, because it provides a well-established, standardized format that is lightweight, version-controlled, and integrates seamlessly with our existing documentation structure. MADR 4.0.0 offers a clear template that captures all necessary information while remaining flexible enough for different types of decisions.
Consequences
- Good, because MADR is a widely adopted standard with clear documentation and examples
- Good, because markdown files are easy to create, edit, and review through pull requests
- Good, because ADRs will be version-controlled alongside code, maintaining historical context
- Good, because the format is flexible enough to accommodate strategic, user-journey, and API design decisions
- Good, because team members can easily search and reference past decisions
- Neutral, because requires discipline to maintain and update ADR status as decisions evolve
- Bad, because team members need to learn and follow the MADR format conventions
Confirmation
Compliance will be confirmed through:
- Code reviews ensuring new architectural decisions are documented as ADRs
- ADRs are stored in
docs/content/r&d/adrs/following the naming conventionNNNN-title-with-dashes.md - Regular reviews during architecture discussions to reference and update existing ADRs
Pros and Cons of the Options
MADR (Markdown Architectural Decision Records)
MADR 4.0.0 is a standardized format for documenting architectural decisions using markdown.
- Good, because it’s a well-established standard with extensive documentation
- Good, because markdown is simple, portable, and version-control friendly
- Good, because it provides a clear structure while remaining flexible
- Good, because it integrates with static site generators and documentation tools
- Good, because it’s lightweight and doesn’t require special tools
- Neutral, because it requires some initial learning of the format
- Neutral, because maintaining consistency requires discipline
ADR using custom format
Create our own custom format for architectural decision records.
- Good, because we can tailor it exactly to our needs
- Bad, because it requires defining and maintaining our own standard
- Bad, because new team members won’t be familiar with the format
- Bad, because we lose the benefits of community knowledge and tooling
- Bad, because it may evolve inconsistently over time
Wiki-based documentation
Use a wiki system (like Confluence, Notion, or GitHub Wiki) to document decisions.
- Good, because wikis provide easy editing and hyperlinking
- Good, because some team members may be familiar with wiki tools
- Neutral, because it may or may not integrate with version control
- Bad, because content may not be version-controlled alongside code
- Bad, because it creates a separate system to maintain
- Bad, because it’s harder to review changes through standard PR process
- Bad, because portability and long-term accessibility may be concerns
No formal ADR process
Continue without a structured approach to documenting architectural decisions.
- Good, because it requires no additional overhead
- Bad, because context and rationale for decisions are lost over time
- Bad, because new team members struggle to understand why decisions were made
- Bad, because it leads to repeated discussions of previously settled questions
- Bad, because it makes it difficult to track when decisions should be revisited
More Information
- MADR 4.0.0 specification: https://adr.github.io/madr/
- ADRs will be categorized as: strategic, user-journey, or api-design
- ADR status values: proposed | accepted | rejected | deprecated | superseded by ADR-XXXX
- All ADRs are stored in
docs/content/r&d/adrs/directory
2.2 - [0002] Network Boot Architecture for Home Lab
Context and Problem Statement
When setting up a home lab infrastructure, servers need to be provisioned and booted over the network using PXE (Preboot Execution Environment). This requires a TFTP/HTTP server to serve boot files to requesting machines. The question is: where should this boot server be hosted to balance security, reliability, cost, and operational complexity?
Decision Drivers
- Security: Minimize attack surface and ensure only authorized servers receive boot files
- Reliability: Boot process should be resilient and not dependent on external network connectivity
- Cost: Minimize ongoing infrastructure costs
- Complexity: Keep the operational burden manageable
- Trust Model: Clear verification of requesting server identity
Considered Options
- Option 1: TFTP/HTTP server locally on home lab network
- Option 2: TFTP/HTTP server on public cloud (without VPN)
- Option 3: TFTP/HTTP server on public cloud (with VPN)
Decision Outcome
Chosen option: “Option 3: TFTP/HTTP server on public cloud (with VPN)”, because:
- No local machine management: Unlike Option 1, this avoids the need to maintain dedicated local hardware for the boot server, reducing operational overhead
- Secure protocol support: The VPN tunnel encrypts all traffic, allowing unsecured protocols like TFTP to be used without risk of data exposure over public internet routes (unlike Option 2)
- Cost-effective VPN: The UDM Pro natively supports WireGuard, enabling a self-managed VPN solution that avoids expensive managed VPN services (~$180-300/year vs ~$540-900/year)
Consequences
- Good, because all traffic is encrypted through WireGuard VPN tunnel
- Good, because boot server is not exposed to public internet (no public attack surface)
- Good, because trust model is simple - subnet validation similar to local option
- Good, because centralized cloud management reduces local maintenance burden
- Good, because boot server remains available even if home lab storage fails
- Good, because UDM Pro’s native WireGuard support keeps costs at ~$180-300/year
- Bad, because boot process depends on both internet connectivity and VPN availability
- Bad, because VPN adds latency to boot file transfers
- Bad, because VPN gateway becomes an additional failure point
- Bad, because higher ongoing cost compared to local-only option (~$180-300/year vs ~$10/year)
Confirmation
The implementation will be confirmed by:
- Successfully network booting a test server using the chosen architecture
- Validating the trust model prevents unauthorized boot requests
- Measuring actual costs against estimates
Pros and Cons of the Options
Option 1: TFTP/HTTP server locally on home lab network
Run the boot server on local infrastructure (e.g., Raspberry Pi, dedicated VM, or container) within the home lab network.
Boot Flow Sequence
sequenceDiagram
participant Server as Home Lab Server
participant DHCP as Local DHCP Server
participant Boot as Local TFTP/HTTP Server
Server->>DHCP: PXE Boot Request (DHCP Discover)
DHCP->>Server: DHCP Offer with Boot Server IP
Server->>Boot: TFTP Request for Boot File
Boot->>Boot: Verify MAC/IP against allowlist
Boot->>Server: Send iPXE/Boot Loader
Server->>Boot: HTTP Request for Kernel/Initrd
Boot->>Server: Send Boot Files
Server->>Server: Boot into OSTrust Model
- MAC Address Allowlist: Maintain a list of known server MAC addresses
- Network Isolation: Boot server only accessible from home lab VLAN
- No external exposure: Traffic never leaves local network
- Physical security: Relies on physical access control to home lab
Cost Estimate
- Hardware: ~$50-100 one-time (Raspberry Pi or repurposed hardware)
- Power: ~$5-10/year (low power consumption)
- Total: ~$55-110 initial + ~$10/year ongoing
Pros and Cons
- Good, because no dependency on internet connectivity for booting
- Good, because lowest latency for boot file transfers
- Good, because all data stays within local network (maximum privacy)
- Good, because lowest ongoing cost
- Good, because simple trust model based on network isolation
- Neutral, because requires dedicated local hardware or resources
- Bad, because single point of failure if boot server goes down
- Bad, because requires local maintenance and updates
Option 2: TFTP/HTTP server on public cloud (without VPN)
Host the boot server on a cloud provider (AWS, GCP, Azure) and expose it directly to the internet.
Boot Flow Sequence
sequenceDiagram
participant Server as Home Lab Server
participant DHCP as Local DHCP Server
participant Router as Home Router/NAT
participant Internet as Internet
participant Boot as Cloud TFTP/HTTP Server
Server->>DHCP: PXE Boot Request (DHCP Discover)
DHCP->>Server: DHCP Offer with Cloud Boot Server IP
Server->>Router: TFTP Request
Router->>Internet: NAT Translation
Internet->>Boot: TFTP Request from Home IP
Boot->>Boot: Verify source IP + token/certificate
Boot->>Internet: Send iPXE/Boot Loader
Internet->>Router: Response
Router->>Server: Boot Loader
Server->>Router: HTTP Request for Kernel/Initrd
Router->>Internet: NAT Translation
Internet->>Boot: HTTP Request with auth headers
Boot->>Boot: Validate request authenticity
Boot->>Internet: Send Boot Files
Internet->>Router: Response
Router->>Server: Boot Files
Server->>Server: Boot into OSTrust Model
- Source IP Validation: Restrict to home lab’s public IP (dynamic IP is problematic)
- Certificate/Token Authentication: Embed certificates in initial bootloader
- TLS for HTTP: All HTTP traffic encrypted
- Challenge-Response: Boot server can challenge requesting server
- Risk: TFTP typically unencrypted, vulnerable to interception
Cost Estimate
- Cloud VM (t3.micro or equivalent): ~$10-15/month
- Data Transfer: ~$1-5/month (boot files are typically small)
- Static IP: ~$3-5/month
- Total: ~$170-300/year
Pros and Cons
- Good, because boot server remains available even if home lab has issues
- Good, because centralized management in cloud console
- Good, because easy to scale or replicate
- Neutral, because requires internet connectivity for every boot
- Bad, because significantly higher ongoing cost
- Bad, because TFTP protocol is inherently insecure over public internet
- Bad, because complex trust model required (IP validation, certificates)
- Bad, because boot process depends on internet availability
- Bad, because higher latency for boot file transfers
- Bad, because public exposure increases attack surface
Option 3: TFTP/HTTP server on public cloud (with VPN)
Host the boot server in the cloud but connect the home lab to the cloud via a site-to-site VPN tunnel.
Boot Flow Sequence
sequenceDiagram
participant Server as Home Lab Server
participant DHCP as Local DHCP Server
participant VPN as VPN Gateway (Home)
participant CloudVPN as VPN Gateway (Cloud)
participant Boot as Cloud TFTP/HTTP Server
Note over VPN,CloudVPN: Site-to-Site VPN Tunnel Established
Server->>DHCP: PXE Boot Request (DHCP Discover)
DHCP->>Server: DHCP Offer with Boot Server Private IP
Server->>VPN: TFTP Request to Private IP
VPN->>CloudVPN: Encrypted VPN Tunnel
CloudVPN->>Boot: TFTP Request (appears local)
Boot->>Boot: Verify source IP from home lab subnet
Boot->>CloudVPN: Send iPXE/Boot Loader
CloudVPN->>VPN: Encrypted Response
VPN->>Server: Boot Loader
Server->>VPN: HTTP Request for Kernel/Initrd
VPN->>CloudVPN: Encrypted VPN Tunnel
CloudVPN->>Boot: HTTP Request
Boot->>Boot: Validate subnet membership
Boot->>CloudVPN: Send Boot Files
CloudVPN->>VPN: Encrypted Response
VPN->>Server: Boot Files
Server->>Server: Boot into OSTrust Model
- VPN Tunnel Encryption: All traffic encrypted end-to-end
- Private IP Addressing: Boot server only accessible via VPN
- Subnet Validation: Verify requests come from trusted home lab subnet
- VPN Authentication: Strong auth at tunnel level (certificates, pre-shared keys)
- No public exposure: Boot server has no public IP
Cost Estimate
- Cloud VM (t3.micro or equivalent): ~$10-15/month
- Data Transfer (VPN): ~$5-10/month
- VPN Gateway Service (if using managed): ~$30-50/month OR
- Self-managed VPN (WireGuard/OpenVPN): ~$0 additional
- Total (self-managed VPN): ~$180-300/year
- Total (managed VPN): ~$540-900/year
Pros and Cons
- Good, because all traffic encrypted through VPN tunnel
- Good, because boot server not exposed to public internet
- Good, because trust model similar to local option (subnet validation)
- Good, because centralized cloud management benefits
- Good, because boot server available if home lab storage fails
- Neutral, because moderate complexity (VPN setup and maintenance)
- Bad, because higher cost than local option
- Bad, because boot process still depends on internet + VPN availability
- Bad, because VPN adds latency to boot process
- Bad, because VPN gateway becomes additional failure point
- Bad, because most expensive option if using managed VPN service
More Information
Related Resources
Key Questions for Decision
- How critical is boot availability during internet outages?
- Is the home lab public IP static or dynamic?
- What is the acceptable boot time latency?
- How many servers need to be supported?
- Is there existing VPN infrastructure?
Related Issues
- Issue #595 - story(docs): create adr for network boot architecture
2.3 - [0003] Cloud Provider Selection for Network Boot Infrastructure
Context and Problem Statement
ADR-0002 established that network boot infrastructure will be hosted on a cloud provider and accessed via VPN (specifically WireGuard from the UDM Pro). The decision to use cloud hosting provides resilience against local hardware failures while maintaining security through encrypted VPN tunnels.
The question now is: Which cloud provider should host the network boot infrastructure?
This decision will affect:
- Cost: Ongoing monthly/annual infrastructure costs
- Protocol Support: Ability to serve TFTP, HTTP, and HTTPS boot files
- VPN Integration: Ease of WireGuard deployment and management
- Operational Complexity: Management overhead and maintenance burden
- Performance: Boot file transfer latency and throughput
- Vendor Lock-in: Future flexibility to migrate or multi-cloud
Decision Drivers
- Cost Efficiency: Minimize ongoing infrastructure costs for home lab scale
- Protocol Support: Must support TFTP (UDP/69), HTTP (TCP/80), and HTTPS (TCP/443) for network boot workflows
- WireGuard Compatibility: Must support self-managed WireGuard VPN with reasonable effort
- UDM Pro Integration: Should work seamlessly with UniFi Dream Machine Pro’s native WireGuard client
- Simplicity: Minimize operational complexity for a single-person home lab
- Existing Expertise: Leverage existing team knowledge and infrastructure
- Performance: Sufficient throughput and low latency for boot file transfers (50-200MB per boot)
Considered Options
- Option 1: Google Cloud Platform (GCP)
- Option 2: Amazon Web Services (AWS)
Decision Outcome
Chosen option: “Option 1: Google Cloud Platform (GCP)”, because:
- Existing Infrastructure: The home lab already uses GCP extensively (Cloud Run services, load balancers, mTLS infrastructure per existing codebase), reducing operational overhead and leveraging existing expertise
- Comparable Costs: Both providers offer similar costs for the required infrastructure (~$6-12/month for compute + VPN), with GCP’s e2-micro being sufficient
- Equivalent Protocol Support: Both support TFTP/HTTP/HTTPS via direct VM access (load balancers unnecessary for single boot server), meeting all protocol requirements
- WireGuard Compatibility: Both require self-managed WireGuard deployment (neither has native WireGuard support), with nearly identical implementation complexity
- Unified Management: Consolidating all cloud infrastructure on GCP simplifies monitoring, billing, IAM, and operational workflows
While AWS would be a viable alternative (especially with t4g.micro ARM instances offering slightly better price/performance), the existing GCP investment makes it the pragmatic choice to avoid multi-cloud complexity.
Consequences
- Good, because consolidates all cloud infrastructure on a single provider (reduced operational complexity)
- Good, because leverages existing GCP expertise and IAM configurations
- Good, because unified Cloud Monitoring/Logging across all services
- Good, because single cloud bill simplifies cost tracking
- Good, because existing Terraform modules and patterns can be reused
- Good, because GCP’s e2-micro instances (~$6.50/month) are cost-effective for the workload
- Good, because self-managed WireGuard provides flexibility and low cost (~$10/month total)
- Neutral, because both providers have comparable protocol support (TFTP/HTTP/HTTPS via VM)
- Neutral, because both require self-managed WireGuard (no native support)
- Bad, because creates vendor lock-in to GCP (migration would require relearning and reconfiguration)
- Bad, because foregoes AWS’s slightly cheaper t4g.micro ARM instances (~$6/month vs GCP’s ~$6.50/month)
- Bad, because multi-cloud strategy could provide redundancy (accepted trade-off for simplicity)
Confirmation
The implementation will be confirmed by:
- Successfully deploying WireGuard VPN gateway on GCP Compute Engine
- Establishing site-to-site VPN tunnel between UDM Pro and GCP
- Network booting a test server via VPN using TFTP and HTTP protocols
- Measuring actual costs against estimates (~$10-15/month)
- Validating boot performance (transfer time < 30 seconds for typical boot)
Pros and Cons of the Options
Option 1: Google Cloud Platform (GCP)
Host network boot infrastructure on Google Cloud Platform.
Architecture Overview
graph TB
subgraph "Home Lab Network"
A[Home Lab Servers]
B[UDM Pro - WireGuard Client]
end
subgraph "GCP VPC"
C[WireGuard Gateway VM<br/>e2-micro]
D[Boot Server VM<br/>e2-micro]
C -->|VPC Routing| D
end
A -->|PXE Boot Request| B
B -->|WireGuard Tunnel| C
C -->|TFTP/HTTP/HTTPS| D
D -->|Boot Files| C
C -->|Encrypted Response| B
B -->|Boot Files| AImplementation Details
Compute:
- WireGuard Gateway: e2-micro VM (~$6.50/month) running Ubuntu 22.04
- Self-managed WireGuard server
- IP forwarding enabled
- Static external IP (~$3.50/month if VM ever stops)
- Boot Server: e2-micro VM (same or consolidated with gateway)
- TFTP server (
tftpd-hpa) - HTTP server (nginx or simple Python server)
- Optional HTTPS with self-signed cert or Let’s Encrypt
- TFTP server (
Networking:
- VPC: Default VPC or custom VPC with private subnets
- Firewall Rules:
- Allow UDP/51820 from home lab public IP (WireGuard)
- Allow UDP/69, TCP/80, TCP/443 from VPN subnet (boot protocols)
- Routes: Custom route to direct home lab subnet through WireGuard gateway
- Cloud VPN: Not used (self-managed WireGuard instead to save ~$65/month)
WireGuard Setup:
- Install WireGuard on Compute Engine VM
- Configure
wg0interface with PostUp/PostDown iptables rules - Store private key in Secret Manager
- UDM Pro connects as WireGuard peer
Cost Breakdown (US regions):
| Component | Monthly Cost |
|---|---|
| e2-micro VM (WireGuard + Boot) | ~$6.50 |
| Static External IP (if attached) | ~$3.50 |
| Egress (10 boots × 150MB) | ~$0.18 |
| Total | ~$10.18 |
| Annual | ~$122 |
Pros and Cons
- Good, because existing home lab infrastructure already uses GCP extensively
- Good, because consolidates all cloud resources on single provider (unified billing, IAM, monitoring)
- Good, because leverages existing GCP expertise and Terraform modules
- Good, because Cloud Monitoring/Logging already configured for other services
- Good, because Secret Manager integration for WireGuard key storage
- Good, because e2-micro instance size is sufficient for network boot workload
- Good, because low cost (~$10/month for self-managed WireGuard)
- Good, because VPC networking is familiar and well-documented
- Neutral, because requires self-managed WireGuard (no native support, same as AWS)
- Neutral, because TFTP/HTTP/HTTPS served directly from VM (no special GCP features needed)
- Bad, because slightly more expensive than AWS t4g.micro (~$6.50/month vs ~$6/month)
- Bad, because creates vendor lock-in to GCP ecosystem
- Bad, because Cloud VPN (managed IPsec) is expensive (~$73/month), so must use self-managed WireGuard
Option 2: Amazon Web Services (AWS)
Host network boot infrastructure on Amazon Web Services.
Architecture Overview
graph TB
subgraph "Home Lab Network"
A[Home Lab Servers]
B[UDM Pro - WireGuard Client]
end
subgraph "AWS VPC"
C[WireGuard Gateway EC2<br/>t4g.micro]
D[Boot Server EC2<br/>t4g.micro]
C -->|VPC Routing| D
end
A -->|PXE Boot Request| B
B -->|WireGuard Tunnel| C
C -->|TFTP/HTTP/HTTPS| D
D -->|Boot Files| C
C -->|Encrypted Response| B
B -->|Boot Files| AImplementation Details
Compute:
- WireGuard Gateway: t4g.micro EC2 (~$6/month, ARM-based Graviton)
- Self-managed WireGuard server
- Source/Dest check disabled for IP forwarding
- Elastic IP (free when attached to running instance)
- Boot Server: t4g.micro EC2 (same or consolidated with gateway)
- TFTP server (
tftpd-hpa) - HTTP server (nginx)
- Optional HTTPS with Let’s Encrypt or self-signed cert
- TFTP server (
Networking:
- VPC: Default VPC or custom VPC with private subnets
- Security Groups:
- WireGuard SG: Allow UDP/51820 from home lab public IP
- Boot Server SG: Allow UDP/69, TCP/80, TCP/443 from WireGuard SG
- Route Table: Add route for home lab subnet via WireGuard instance
- Site-to-Site VPN: Not used (self-managed WireGuard saves ~$30/month)
WireGuard Setup:
- Install WireGuard on Ubuntu 22.04 or Amazon Linux 2023 EC2
- Configure
wg0with iptables MASQUERADE - Store private key in Secrets Manager
- UDM Pro connects as WireGuard peer
Cost Breakdown (US East):
| Component | Monthly Cost |
|---|---|
| t4g.micro EC2 (WireGuard + Boot) | ~$6.00 |
| Elastic IP (attached) | $0.00 |
| Egress (10 boots × 150MB) | ~$0.09 |
| Total (On-Demand) | ~$6.09 |
| Total (1-yr Reserved) | ~$3.59 |
| Annual (On-Demand) | ~$73 |
| Annual (Reserved) | ~$43 |
Pros and Cons
- Good, because t4g.micro ARM instances offer best price/performance (~$6/month on-demand)
- Good, because Reserved Instances provide significant savings (~40% with 1-year commitment)
- Good, because Elastic IP is free when attached to running instance
- Good, because AWS has extensive documentation and community support
- Good, because potential for future multi-cloud strategy
- Good, because ACM provides free SSL certificates (if public domain used)
- Good, because Secrets Manager for WireGuard key storage
- Good, because low cost (~$6/month on-demand, ~$3.50/month with RI)
- Neutral, because requires self-managed WireGuard (no native support, same as GCP)
- Neutral, because TFTP/HTTP/HTTPS served directly from EC2 (no special AWS features)
- Bad, because introduces multi-cloud complexity (separate billing, IAM, monitoring)
- Bad, because no existing AWS infrastructure in home lab (new learning curve)
- Bad, because requires separate monitoring/logging setup (CloudWatch vs Cloud Monitoring)
- Bad, because separate Terraform state and modules needed
- Bad, because Site-to-Site VPN is expensive (~$36/month), so must use self-managed WireGuard
More Information
Detailed Analysis
For in-depth analysis of each provider’s capabilities:
Key Findings Summary
Both providers offer:
- ✅ TFTP Support: Via direct VM/EC2 access (load balancers don’t support TFTP)
- ✅ HTTP/HTTPS Support: Full support via direct VM/EC2 or load balancers
- ✅ WireGuard Compatibility: Self-managed deployment on VM/EC2 (neither has native support)
- ✅ UDM Pro Integration: Native WireGuard client works with both
- ✅ Low Cost: $6-12/month for compute + VPN infrastructure
- ✅ Sufficient Performance: 100+ Mbps throughput on smallest instances
Key differences:
- GCP: Slightly higher cost (~$10/month), but consolidates with existing infrastructure
- AWS: Slightly lower cost (~$6/month on-demand, ~$3.50/month Reserved), but introduces multi-cloud complexity
Cost Comparison Table
| Component | GCP (e2-micro) | AWS (t4g.micro On-Demand) | AWS (t4g.micro 1-yr RI) |
|---|---|---|---|
| Compute | $6.50/month | $6.00/month | $3.50/month |
| Static IP | $3.50/month | $0.00 (Elastic IP free when attached) | $0.00 |
| Egress (1.5GB) | $0.18/month | $0.09/month | $0.09/month |
| Monthly | $10.18 | $6.09 | $3.59 |
| Annual | $122 | $73 | $43 |
Savings Analysis: AWS is ~$49-79/year cheaper, but introduces operational complexity.
Protocol Support Comparison
| Protocol | GCP Support | AWS Support | Implementation |
|---|---|---|---|
| TFTP (UDP/69) | ⚠️ Via VM | ⚠️ Via EC2 | Direct VM/EC2 access (no LB support) |
| HTTP (TCP/80) | ✅ Full | ✅ Full | Direct VM/EC2 or Load Balancer |
| HTTPS (TCP/443) | ✅ Full | ✅ Full | Direct VM/EC2 or Load Balancer + cert |
| WireGuard | ⚠️ Self-managed | ⚠️ Self-managed | Install on VM/EC2 |
WireGuard Deployment Comparison
| Aspect | GCP | AWS |
|---|---|---|
| Native Support | ❌ No (IPsec Cloud VPN only) | ❌ No (IPsec Site-to-Site VPN only) |
| Self-Managed | ✅ Compute Engine | ✅ EC2 |
| Setup Complexity | Similar (install, configure, firewall) | Similar (install, configure, SG) |
| IP Forwarding | Enable on VM | Disable Source/Dest check |
| Firewall | VPC Firewall rules | Security Groups |
| Key Storage | Secret Manager | Secrets Manager |
| Cost | ~$10/month total | ~$6/month total |
Trade-offs Analysis
Choosing GCP:
- Wins: Operational simplicity, unified infrastructure, existing expertise
- Loses: ~$50-80/year higher cost, vendor lock-in
Choosing AWS:
- Wins: Lower cost, Reserved Instance savings, multi-cloud optionality
- Loses: Multi-cloud complexity, separate monitoring/billing, new tooling
For a home lab prioritizing simplicity over cost optimization, GCP’s consolidation benefits outweigh the modest cost difference.
Related ADRs
- ADR-0002: Network Boot Architecture - Established requirement for cloud-hosted boot server with VPN
- ADR-0001: Use MADR for Architecture Decision Records - MADR format used for this ADR
Future Considerations
- Cost Reevaluation: If annual costs become significant, reconsider AWS Reserved Instances
- Multi-Cloud: If multi-cloud strategy emerges, migrate boot server to AWS
- Managed WireGuard: If GCP or AWS adds native WireGuard support, reevaluate managed option
- High Availability: If HA required, evaluate multi-region deployment costs on both providers
Related Issues
- Issue #597 - story(docs): create adr for cloud provider selection
2.4 - [0004] Server Operating System Selection
Context and Problem Statement
The homelab infrastructure requires a server operating system to run Kubernetes clusters for container workloads. The choice of operating system significantly impacts ease of cluster initialization, ongoing maintenance burden, security posture, and operational complexity.
The question is: Which operating system should be used for homelab Kubernetes servers?
This decision will affect:
- Cluster Initialization: Complexity and time required to bootstrap Kubernetes
- Maintenance Burden: Frequency and complexity of OS updates, Kubernetes upgrades, and patching
- Security Posture: Attack surface, built-in security features, and hardening requirements
- Resource Efficiency: RAM, CPU, and disk overhead
- Operational Complexity: Day-to-day management, troubleshooting, and debugging
- Learning Curve: Time required for team to become proficient
Decision Drivers
- Ease of Kubernetes Setup: Minimize steps and complexity for cluster initialization
- Maintenance Simplicity: Reduce ongoing operational burden for updates and upgrades
- Security-First Design: Minimal attack surface and strong security defaults
- Resource Efficiency: Low RAM/CPU/disk overhead for cost-effective homelab
- Learning Curve: Reasonable adoption time for single-person homelab
- Community Support: Strong documentation and active community
- Immutability: Prefer declarative, version-controlled configuration (GitOps-friendly)
- Purpose-Built: OS optimized specifically for Kubernetes vs general-purpose
Considered Options
- Option 1: Ubuntu Server with k3s
- Option 2: Fedora Server with kubeadm
- Option 3: Talos Linux (purpose-built Kubernetes OS)
- Option 4: Harvester HCI (hyperconverged platform)
Decision Outcome
Chosen option: “Option 3: Talos Linux”, because:
- Minimal Attack Surface: No SSH, shell, or package manager eliminates entire classes of vulnerabilities, providing the strongest security posture
- Built-in Kubernetes: No separate installation or configuration complexity - Kubernetes is included and optimized
- Declarative Configuration: API-driven, immutable infrastructure aligns with GitOps principles and prevents configuration drift
- Lowest Resource Overhead: ~768MB RAM vs 1-2GB+ for traditional distros, maximizing homelab hardware efficiency
- Simplified Maintenance: Declarative upgrades (
talosctl upgrade) for both OS and Kubernetes reduce operational burden - Security by Default: Immutable filesystem, no shell, KSPP compliance - secure without manual hardening
While the learning curve is steeper than traditional Linux distributions, the benefits of purpose-built Kubernetes infrastructure, minimal maintenance, and superior security outweigh the initial learning investment for a dedicated Kubernetes homelab.
Consequences
- Good, because minimal attack surface (no SSH/shell) provides strongest security posture
- Good, because declarative configuration enables GitOps workflows and prevents drift
- Good, because lowest resource overhead (~768MB RAM) maximizes homelab efficiency
- Good, because built-in Kubernetes eliminates installation complexity
- Good, because immutable infrastructure prevents configuration drift
- Good, because simplified upgrades (single command for OS + K8s) reduce maintenance burden
- Good, because smallest disk footprint (~500MB) vs 10GB+ for traditional distros
- Good, because secure by default (no manual hardening required)
- Good, because purpose-built design optimized specifically for Kubernetes
- Good, because API-driven management (talosctl) enables automation
- Neutral, because steeper learning curve (paradigm shift from shell-based management)
- Neutral, because smaller community than Ubuntu/Fedora (but active and helpful)
- Bad, because limited to Kubernetes workloads only (not general-purpose)
- Bad, because no shell access requires different troubleshooting approach
- Bad, because newer platform (less mature than Ubuntu/Fedora)
- Bad, because no escape hatch for manual intervention when needed
Confirmation
The implementation will be confirmed by:
- Successfully bootstrapping a Talos cluster using talosctl
- Deploying test workloads and validating functionality
- Performing declarative OS and Kubernetes upgrades
- Measuring actual resource usage (RAM < 1GB per node)
- Validating security posture (no SSH/shell, immutable filesystem)
- Testing GitOps workflow (machine configs in version control)
Pros and Cons of the Options
Option 1: Ubuntu Server with k3s
Host Kubernetes using Ubuntu Server 24.04 LTS with k3s lightweight Kubernetes distribution.
Architecture Overview
sequenceDiagram
participant Admin
participant Server as Ubuntu Server
participant K3s as k3s Components
Admin->>Server: Install Ubuntu 24.04 LTS
Server->>Server: Configure network (static IP)
Admin->>Server: Update system
Admin->>Server: curl -sfL https://get.k3s.io | sh -
Server->>K3s: Download k3s binary
K3s->>Server: Configure containerd
K3s->>Server: Start k3s service
K3s->>Server: Initialize etcd (embedded)
K3s->>Server: Start API server
K3s->>Server: Deploy built-in CNI (Flannel)
K3s-->>Admin: Control plane ready
Admin->>Server: Retrieve node token
Admin->>Server: Install k3s agent on workers
K3s->>Server: Join workers to cluster
K3s-->>Admin: Cluster ready (5-10 minutes)Implementation Details
Installation:
# Single-command k3s install
curl -sfL https://get.k3s.io | sh -
# Get token for workers
sudo cat /var/lib/rancher/k3s/server/node-token
# Install on workers
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -
Resource Requirements:
- RAM: 1GB total (512MB OS + 512MB k3s)
- CPU: 1-2 cores
- Disk: 20GB (10GB OS + 10GB containers)
Maintenance:
# OS updates
sudo apt update && sudo apt upgrade
# k3s upgrade
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.32.0+k3s1 sh -
# Or automatic via system-upgrade-controller
Pros and Cons
- Good, because most familiar Linux distribution (easy adoption)
- Good, because 5-year LTS support (10 years with Ubuntu Pro)
- Good, because k3s provides single-command setup
- Good, because extensive documentation and community support
- Good, because compatible with all Kubernetes tooling
- Good, because automatic security updates available
- Good, because general-purpose (can run non-K8s workloads)
- Good, because low learning curve
- Neutral, because moderate resource overhead (1GB RAM)
- Bad, because general-purpose OS has larger attack surface
- Bad, because requires manual OS updates and reboots
- Bad, because managing OS + Kubernetes lifecycle separately
- Bad, because imperative configuration (not GitOps-native)
- Bad, because mutable filesystem (configuration drift possible)
Option 2: Fedora Server with kubeadm
Host Kubernetes using Fedora Server with kubeadm (official Kubernetes tool) and CRI-O container runtime.
Architecture Overview
sequenceDiagram
participant Admin
participant Server as Fedora Server
participant K8s as Kubernetes Components
Admin->>Server: Install Fedora 41
Server->>Server: Configure network
Admin->>Server: Update system (dnf update)
Admin->>Server: Install CRI-O
Server->>Server: Configure CRI-O runtime
Admin->>Server: Install kubeadm/kubelet/kubectl
Server->>Server: Disable swap, load kernel modules
Server->>Server: Configure SELinux
Admin->>K8s: kubeadm init --cri-socket=unix:///var/run/crio/crio.sock
K8s->>Server: Generate certificates
K8s->>Server: Start etcd
K8s->>Server: Start API server
K8s-->>Admin: Control plane ready
Admin->>K8s: kubectl apply CNI
K8s->>Server: Deploy CNI pods
Admin->>K8s: kubeadm join (workers)
K8s-->>Admin: Cluster ready (15-20 minutes)Implementation Details
Installation:
# Install CRI-O
sudo dnf install -y cri-o
sudo systemctl enable --now crio
# Install kubeadm components
sudo dnf install -y kubelet kubeadm kubectl
# Initialize cluster
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/crio/crio.sock
# Install CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
Resource Requirements:
- RAM: 2.2GB total (700MB OS + 1.5GB Kubernetes)
- CPU: 2+ cores
- Disk: 35GB (15GB OS + 20GB containers)
Maintenance:
# OS updates (every 13 months major upgrade)
sudo dnf update -y
# Kubernetes upgrade
sudo dnf update -y kubeadm
sudo kubeadm upgrade apply v1.32.0
sudo dnf update -y kubelet kubectl
Pros and Cons
- Good, because SELinux enabled by default (stronger than AppArmor)
- Good, because latest kernel and packages (bleeding edge)
- Good, because native CRI-O support (OpenShift compatibility)
- Good, because upstream for RHEL (enterprise patterns)
- Good, because kubeadm provides full control over cluster
- Neutral, because faster release cycle (latest features, but more upgrades)
- Bad, because short support cycle (13 months per release)
- Bad, because bleeding-edge can introduce instability
- Bad, because complex kubeadm setup (many manual steps)
- Bad, because higher resource overhead (2.2GB RAM)
- Bad, because SELinux configuration for Kubernetes is complex
- Bad, because frequent OS upgrades required (every 13 months)
- Bad, because managing OS + Kubernetes separately
- Bad, because imperative configuration (not GitOps-native)
Option 3: Talos Linux (purpose-built Kubernetes OS)
Use Talos Linux, an immutable, API-driven operating system designed specifically for Kubernetes with built-in cluster management.
Architecture Overview
sequenceDiagram
participant Admin
participant Server as Bare Metal Server
participant Talos as Talos Linux
participant K8s as Kubernetes Components
Admin->>Server: Boot Talos ISO (PXE or USB)
Server->>Talos: Start in maintenance mode
Talos-->>Admin: API endpoint ready
Admin->>Admin: Generate configs (talosctl gen config)
Admin->>Talos: talosctl apply-config (controlplane.yaml)
Talos->>Server: Install Talos to disk
Server->>Server: Reboot from disk
Talos->>K8s: Start kubelet
Talos->>K8s: Start etcd
Talos->>K8s: Start API server
Admin->>Talos: talosctl bootstrap
Talos->>K8s: Initialize cluster
K8s->>Talos: Start controller-manager
K8s-->>Admin: Control plane ready
Admin->>K8s: Apply CNI
Admin->>Talos: Apply worker configs
Talos->>K8s: Join workers
K8s-->>Admin: Cluster ready (10-15 minutes)Implementation Details
Installation:
# Generate machine configs
talosctl gen config homelab https://192.168.1.10:6443
# Apply config to control plane (booted from ISO)
talosctl apply-config --insecure --nodes 192.168.1.10 --file controlplane.yaml
# Bootstrap Kubernetes
talosctl bootstrap --nodes 192.168.1.10 --endpoints 192.168.1.10
# Get kubeconfig
talosctl kubeconfig --nodes 192.168.1.10
# Add workers
talosctl apply-config --insecure --nodes 192.168.1.11 --file worker.yaml
Machine Configuration (declarative YAML):
version: v1alpha1
machine:
type: controlplane
install:
disk: /dev/sda
network:
hostname: control-plane-1
interfaces:
- interface: eth0
addresses:
- 192.168.1.10/24
cluster:
clusterName: homelab
controlPlane:
endpoint: https://192.168.1.10:6443
network:
cni:
name: custom
urls:
- https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
Resource Requirements:
- RAM: 768MB total (256MB OS + 512MB Kubernetes)
- CPU: 1-2 cores
- Disk: 10-15GB (500MB OS + 10GB containers)
Maintenance:
# Upgrade Talos (OS + Kubernetes)
talosctl upgrade --nodes 192.168.1.10 --image ghcr.io/siderolabs/installer:v1.9.0
# Upgrade Kubernetes version
talosctl upgrade-k8s --nodes 192.168.1.10 --to 1.32.0
# Apply config changes
talosctl apply-config --nodes 192.168.1.10 --file controlplane.yaml
Pros and Cons
- Good, because Kubernetes built-in (no separate installation)
- Good, because minimal attack surface (no SSH, shell, package manager)
- Good, because immutable infrastructure (config drift impossible)
- Good, because API-driven management (GitOps-friendly)
- Good, because lowest resource overhead (~768MB RAM)
- Good, because declarative configuration (YAML in version control)
- Good, because secure by default (no manual hardening)
- Good, because smallest disk footprint (~500MB OS)
- Good, because designed specifically for Kubernetes
- Good, because simple declarative upgrades (OS + K8s)
- Good, because UEFI Secure Boot support
- Neutral, because smaller community (but active and helpful)
- Bad, because steep learning curve (paradigm shift)
- Bad, because limited to Kubernetes workloads only
- Bad, because troubleshooting without shell requires different approach
- Bad, because relatively new (less mature than Ubuntu/Fedora)
- Bad, because no escape hatch for manual intervention
Option 4: Harvester HCI (hyperconverged platform)
Use Harvester, a hyperconverged infrastructure platform built on K3s and KubeVirt for unified VM + container management.
Architecture Overview
sequenceDiagram
participant Admin
participant Server as Bare Metal Server
participant Harvester as Harvester HCI
participant K3s as K3s / KubeVirt
participant Storage as Longhorn Storage
Admin->>Server: Boot Harvester ISO
Server->>Harvester: Installation wizard
Admin->>Harvester: Configure cluster (VIP, storage)
Harvester->>Server: Install RancherOS 2.0
Harvester->>Server: Install K3s
Server->>Server: Reboot
Harvester->>K3s: Start K3s server
K3s->>Storage: Deploy Longhorn
K3s->>Server: Deploy KubeVirt
K3s->>Server: Deploy multus CNI
Harvester-->>Admin: Web UI ready
Admin->>Harvester: Add nodes
Harvester->>K3s: Join cluster
K3s-->>Admin: Cluster ready (20-30 minutes)Implementation Details
Installation: Interactive ISO wizard or cloud-init config
Resource Requirements:
- RAM: 8GB minimum per node (16GB+ recommended)
- CPU: 4+ cores per node
- Disk: 250GB+ per node (100GB OS + 150GB storage)
- Nodes: 3+ for production HA
Features:
- Web UI management
- Built-in storage (Longhorn)
- VM support (KubeVirt)
- Live migration
- Rancher integration
Pros and Cons
- Good, because unified VM + container platform
- Good, because built-in K3s (Kubernetes included)
- Good, because web UI simplifies management
- Good, because built-in persistent storage (Longhorn)
- Good, because VM live migration
- Good, because Rancher integration
- Neutral, because immutable OS layer
- Bad, because very heavy resource requirements (8GB+ RAM)
- Bad, because complex architecture (KubeVirt, Longhorn, multus)
- Bad, because overkill for container-only workloads
- Bad, because larger attack surface (web UI, VM layer)
- Bad, because requires 3+ nodes for HA (not single-node friendly)
- Bad, because steep learning curve for full feature set
More Information
Detailed Analysis
For in-depth analysis of each operating system:
- Installation methods (kubeadm, k3s, MicroK8s)
- Cluster initialization sequences
- Maintenance requirements and upgrade procedures
- Resource overhead and security posture
- kubeadm with CRI-O installation
- SELinux configuration for Kubernetes
- Rapid release cycle implications
- RHEL ecosystem compatibility
- API-driven, immutable architecture
- Declarative configuration model
- Security-first design principles
- Production readiness and advanced features
- Hyperconverged infrastructure capabilities
- VM + container unified platform
- KubeVirt and Longhorn integration
- Multi-node cluster requirements
Key Findings Summary
Resource efficiency comparison:
- ✅ Talos: 768MB RAM, 500MB disk (most efficient)
- ✅ Ubuntu + k3s: 1GB RAM, 20GB disk (efficient)
- ⚠️ Fedora + kubeadm: 2.2GB RAM, 35GB disk (moderate)
- ❌ Harvester: 8GB+ RAM, 250GB+ disk (heavy)
Security posture comparison:
- ✅ Talos: Minimal attack surface (no SSH/shell, immutable)
- ✅ Fedora: SELinux by default (strong MAC)
- ⚠️ Ubuntu: AppArmor (moderate security)
- ⚠️ Harvester: Larger attack surface (web UI, VM layer)
Operational complexity comparison:
- ✅ Ubuntu + k3s: Single command install, familiar management
- ✅ Talos: Declarative, automated (after learning curve)
- ⚠️ Fedora + kubeadm: Manual kubeadm steps, frequent OS upgrades
- ❌ Harvester: Complex HCI architecture, heavy requirements
Decision Matrix
| Criterion | Ubuntu + k3s | Fedora + kubeadm | Talos Linux | Harvester |
|---|---|---|---|---|
| Setup Simplicity | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Maintenance Burden | ⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Security Posture | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| Resource Efficiency | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐ |
| Learning Curve | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
| Community Support | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Immutability | ⭐ | ⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| GitOps-Friendly | ⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| Purpose-Built | ⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Overall Score | 29/45 | 24/45 | 38/45 | 28/45 |
Talos Linux scores highest for Kubernetes-dedicated homelab infrastructure prioritizing security, efficiency, and GitOps workflows.
Trade-offs Analysis
Choosing Talos Linux:
- Wins: Best security, lowest overhead, declarative configuration, minimal maintenance
- Loses: Steeper learning curve, no shell access, smaller community
Choosing Ubuntu + k3s:
- Wins: Easiest adoption, largest community, general-purpose flexibility
- Loses: Higher attack surface, manual OS management, imperative config
Choosing Fedora + kubeadm:
- Wins: Latest features, SELinux, enterprise compatibility
- Loses: Frequent OS upgrades, complex setup, higher overhead
Choosing Harvester:
- Wins: VM + container unified platform, web UI
- Loses: Heavy resources, complex architecture, overkill for K8s-only
For a Kubernetes-dedicated homelab prioritizing security and efficiency, Talos Linux’s benefits outweigh the learning curve investment.
Related ADRs
- ADR-0001: Use MADR for Architecture Decision Records - MADR format used for this ADR
- ADR-0002: Network Boot Architecture - Server provisioning architecture
- ADR-0003: Cloud Provider Selection - Cloud infrastructure decisions
Future Considerations
- Team Growth: If team grows beyond single person, reassess Ubuntu for familiarity
- VM Requirements: If VM workloads emerge, consider Harvester or KubeVirt on Talos
- Enterprise Patterns: If RHEL compatibility needed, reconsider Fedora/CentOS Stream
- Maintenance Burden: If Talos learning curve proves too steep, fallback to k3s
- Talos Maturity: Monitor Talos ecosystem growth and production adoption
Related Issues
- Issue #598 - story(docs): create adr for server operating system
2.5 - [0005] Network Boot Infrastructure Implementation on Google Cloud
Context and Problem Statement
ADR-0002 established that network boot infrastructure will be hosted on a cloud provider accessed via WireGuard VPN. ADR-0003 selected Google Cloud Platform as the hosting provider to consolidate infrastructure and leverage existing expertise.
The remaining question is: How should the network boot server itself be implemented?
This decision affects:
- Development Effort: Time required to build, test, and maintain the solution
- Feature Completeness: Capabilities for boot image management, machine mapping, and provisioning workflows
- Operational Complexity: Deployment, monitoring, and troubleshooting burden
- Security: Boot image integrity, access control, and audit capabilities
- Scalability: Ability to grow from single home lab to multiple environments
The boot server must handle:
- HTTP/HTTPS requests for UEFI boot scripts, kernels, initrd images, and cloud-init configurations
- Machine-to-image mapping to serve appropriate boot files based on MAC address, hardware profile, or tags
- Boot image lifecycle management including upload, versioning, and rollback capabilities
Hardware-Specific Context
The target bare metal servers (HP DL360 Gen 9) have the following network boot capabilities:
- UEFI HTTP Boot: Supported in iLO 4 firmware v2.40+ (released 2016)
- TLS Support: Server-side TLS only (no client certificate authentication)
- Boot Process: Firmware handles initial HTTP requests directly (no PXE/TFTP chain loading required)
- Configuration: Boot URL configured via iLO RBSU or UEFI System Utilities
Security Implications: Since the servers cannot present client certificates for mTLS authentication with Cloudflare, the WireGuard VPN serves as the secure transport layer for boot traffic. The HTTP boot server is only accessible through the VPN tunnel.
Reference: HP DL360 Gen 9 Network Boot Analysis
Decision Drivers
- Time to Production: Minimize time to get a working network boot infrastructure
- Feature Requirements: Must support machine-specific boot configurations, image versioning, and cloud-init integration
- Maintenance Burden: Prefer solutions that minimize ongoing maintenance and updates
- GCP Integration: Should leverage GCP services (Cloud Storage, Secret Manager, IAM)
- Security: Boot images must be served securely with access control and integrity verification
- Observability: Comprehensive logging and monitoring for troubleshooting boot failures
- Cost: Minimize infrastructure costs while meeting functional requirements
- Future Flexibility: Ability to extend or customize as needs evolve
Considered Options
- Option 1: Custom server implementation (Go-based)
- Option 2: Matchbox-based solution
Decision Outcome
Chosen option: “Option 1: Custom implementation”, because:
- UEFI HTTP Boot Simplification: Elimination of TFTP/PXE dramatically reduces implementation complexity
- Cloud Run Deployment: HTTP-only boot enables serverless deployment (~$5/month vs $8-17/month)
- Development Time Manageable: UEFI HTTP boot reduces custom development to 2-3 weeks
- Full Control: Custom implementation maintains flexibility for future home lab requirements
- GCP Native Integration: Direct Cloud Storage, Firestore, Secret Manager, and IAM integration
- Existing Framework: Leverages
z5labs/humuspatterns already in use across services - HTTP REST API: Native HTTP REST admin API via
z5labs/humusframework provides better integration with existing tooling
Consequences
- Good, because UEFI HTTP boot eliminates TFTP complexity entirely
- Good, because Cloud Run deployment reduces operational overhead and cost
- Good, because leverages existing
z5labs/humusframework and Go expertise - Good, because GCP native integration (Cloud Storage, Firestore, Secret Manager, IAM)
- Good, because full control over implementation enables future customization
- Good, because simplified testing (HTTP-only, no TFTP/PXE edge cases)
- Good, because OpenTelemetry observability built-in from existing patterns
- Neutral, because requires 2-3 weeks development time vs 1 week for Matchbox setup
- Neutral, because ongoing maintenance responsibility (no upstream project support)
- Bad, because custom implementation may miss edge cases that Matchbox handles
- Bad, because reinvents machine matching and boot configuration patterns
- Bad, because Cloud Run cold start latency needs monitoring (mitigated with min instances = 1)
Confirmation
The implementation success will be validated by:
- Successfully deploying custom boot server on GCP Cloud Run
- Successfully network booting HP DL360 Gen 9 via UEFI HTTP boot through WireGuard VPN
- Confirming iLO 4 firmware v2.40+ compatibility with HTTP boot workflow
- Validating boot image upload and versioning workflows via HTTP REST API
- Measuring Cloud Run cold start latency for boot requests (target: < 100ms)
- Measuring boot file request latency for kernel/initrd downloads (target: < 100ms)
- Confirming Cloud Storage integration for boot asset storage
- Testing machine-to-image mapping based on MAC address using Firestore
- Validating WireGuard VPN security for boot traffic (compensating for lack of client cert support)
- Verifying OpenTelemetry observability integration with Cloud Monitoring
Pros and Cons of the Options
Option 1: Custom Server Implementation (Go-based)
Build a custom network boot server in Go, leveraging the existing z5labs/humus framework for HTTP services.
Architecture Overview
architecture-beta
group gcp(cloud)[GCP VPC]
service wg_nlb(internet)[Network LB] in gcp
service wireguard(server)[WireGuard Gateway] in gcp
service https_lb(internet)[HTTPS LB] in gcp
service compute(server)[Compute Engine] in gcp
service storage(database)[Cloud Storage] in gcp
service firestore(database)[Firestore] in gcp
service secrets(disk)[Secret Manager] in gcp
service monitoring(internet)[Cloud Monitoring] in gcp
group homelab(cloud)[Home Lab]
service udm(server)[UDM Pro] in homelab
service servers(server)[Bare Metal Servers] in homelab
servers:L -- R:udm
udm:R -- L:wg_nlb
wg_nlb:R -- L:wireguard
wireguard:R -- L:https_lb
https_lb:R -- L:compute
compute:B --> T:storage
compute:B --> T:firestore
compute:R --> L:secrets
compute:T --> B:monitoringComponents:
- Boot Server: Go service deployed to Cloud Run (or Compute Engine VM as fallback)
- HTTP/HTTPS server (using
z5labs/humusframework with OpenAPI) - UEFI HTTP boot endpoint serving boot scripts and assets
- HTTP REST admin API for boot configuration management
- HTTP/HTTPS server (using
- Cloud Storage: Buckets for boot images, boot scripts, kernels, initrd files
- Firestore/Datastore: Machine-to-image mapping database (MAC → boot profile)
- Secret Manager: WireGuard keys, TLS certificates (optional for HTTPS boot)
- Cloud Monitoring: Metrics for boot requests, success/failure rates, latency
Boot Image Lifecycle
sequenceDiagram
participant Admin
participant API as Boot Server API
participant Storage as Cloud Storage
participant DB as Firestore
participant Monitor as Cloud Monitoring
Note over Admin,Monitor: Upload Boot Image
Admin->>API: POST /api/v1/images (kernel, initrd, metadata)
API->>API: Validate image integrity (checksum)
API->>Storage: Upload kernel to gs://boot-images/kernels/
API->>Storage: Upload initrd to gs://boot-images/initrd/
API->>DB: Store metadata (version, checksum, tags)
API->>Monitor: Log upload event
API->>Admin: 201 Created (image ID)
Note over Admin,Monitor: Map Machine to Image
Admin->>API: POST /api/v1/machines (MAC, image_id, profile)
API->>DB: Store machine mapping
API->>Admin: 201 Created
Note over Admin,Monitor: UEFI HTTP Boot Request
participant Server as Home Lab Server
Note right of Server: iLO 4 firmware v2.40+ initiates HTTP request directly
Server->>API: HTTP GET /boot?mac=aa:bb:cc:dd:ee:ff (via WireGuard VPN)
API->>DB: Query machine mapping by MAC
API->>API: Generate iPXE script (kernel, initrd URLs)
API->>Monitor: Log boot script request
API->>Server: Send iPXE script
Server->>API: HTTP GET /kernels/ubuntu-22.04.img
API->>Storage: Fetch kernel from Cloud Storage
API->>Monitor: Log kernel download (size, duration)
API->>Server: Stream kernel file
Server->>API: HTTP GET /initrd/ubuntu-22.04.img
API->>Storage: Fetch initrd from Cloud Storage
API->>Monitor: Log initrd download
API->>Server: Stream initrd file
Server->>Server: Boot into OS
Note over Admin,Monitor: Rollback Image Version
Admin->>API: POST /api/v1/machines/{mac}/rollback
API->>DB: Update machine mapping to previous image_id
API->>Monitor: Log rollback event
API->>Admin: 200 OKImplementation Details
Development Stack:
- Language: Go 1.24 (leverage existing Go expertise)
- HTTP Framework:
z5labs/humus(consistent with existing services) - UEFI Boot: Standard HTTP handlers (no special libraries needed)
- Storage Client:
cloud.google.com/go/storage - Database: Firestore for machine mappings (or simple JSON config in Cloud Storage)
- Observability: OpenTelemetry (metrics, traces, logs to Cloud Monitoring/Trace)
Deployment:
- Cloud Run (preferred - HTTP-only boot enables serverless deployment):
- Min instances: 1 (ensures fast boot response, avoids cold start delays)
- Max instances: 2 (home lab scale)
- Memory: 512MB
- CPU: 1 vCPU
- Health checks:
/health/startup,/health/liveness - Concurrency: 10 requests per instance
- Alternative - Compute Engine VM (if Cloud Run latency unacceptable):
- e2-micro instance ($6.50/month)
- Container-Optimized OS with Docker
- systemd service for boot server
- Health checks:
/health/startup,/health/liveness
- Networking:
- VPC firewall: Allow TCP/80, TCP/443 from WireGuard subnet (no UDP/69 needed)
- Static internal IP for boot server (Compute Engine) or HTTPS Load Balancer (Cloud Run)
- Cloud NAT for outbound connectivity (Cloud Storage access)
Configuration Management:
- Machine mappings stored in Firestore or Cloud Storage JSON files
- Boot profiles defined in YAML (similar to Matchbox groups):
profiles: - name: ubuntu-22.04-server kernel: gs://boot-images/kernels/ubuntu-22.04.img initrd: gs://boot-images/initrd/ubuntu-22.04.img cmdline: "console=tty0 console=ttyS0" cloud_init: gs://boot-images/cloud-init/ubuntu-base.yaml machines: - mac: "aa:bb:cc:dd:ee:ff" profile: ubuntu-22.04-server hostname: node-01
Cost Breakdown:
Option A: Cloud Run Deployment (Preferred):
| Component | Monthly Cost |
|---|---|
| Cloud Run (1 min instance, 512MB, always-on) | $3.50 |
| Cloud Storage (50GB boot images) | $1.00 |
| Firestore (minimal reads/writes) | $0.50 |
| Egress (10 boots × 150MB) | $0.18 |
| Total | ~$5.18 |
Option B: Compute Engine Deployment (If Cloud Run latency unacceptable):
| Component | Monthly Cost |
|---|---|
| e2-micro VM (boot server) | $6.50 |
| Cloud Storage (50GB boot images) | $1.00 |
| Firestore (minimal reads/writes) | $0.50 |
| Egress (10 boots × 150MB) | $0.18 |
| Total | ~$8.18 |
Pros and Cons
- Good, because UEFI HTTP boot eliminates TFTP complexity entirely
- Good, because Cloud Run deployment option reduces operational overhead and infrastructure cost
- Good, because full control over boot server implementation and features
- Good, because leverages existing Go expertise and
z5labs/humusframework patterns - Good, because seamless GCP integration (Cloud Storage, Firestore, Secret Manager, IAM)
- Good, because minimal dependencies (no external projects to track)
- Good, because customizable to specific home lab requirements
- Good, because OpenTelemetry observability built-in from existing patterns
- Good, because can optimize for home lab scale (< 20 machines)
- Good, because lightweight implementation (no unnecessary features)
- Good, because simplified testing (HTTP-only, no TFTP/PXE edge cases)
- Good, because standard HTTP serving is well-understood (lower risk than TFTP)
- Neutral, because development effort required (2-3 weeks for MVP, reduced from 3-4 weeks)
- Neutral, because requires ongoing maintenance and security updates
- Neutral, because Cloud Run cold start latency needs validation (POC required)
- Bad, because reinvents machine matching and boot configuration patterns
- Bad, because testing network boot scenarios still requires hardware
- Bad, because potential for bugs in custom implementation
- Bad, because no community support or established best practices
- Bad, because development time still longer than Matchbox (2-3 weeks vs 1 week)
Option 2: Matchbox-Based Solution
Deploy Matchbox, an open-source network boot server developed by CoreOS (now part of Red Hat), to handle UEFI HTTP boot workflows.
Architecture Overview
architecture-beta
group gcp(cloud)[GCP VPC]
service wg_nlb(internet)[Network LB] in gcp
service wireguard(server)[WireGuard Gateway] in gcp
service https_lb(internet)[HTTPS LB] in gcp
service compute(server)[Compute Engine] in gcp
service storage(database)[Cloud Storage] in gcp
service secrets(disk)[Secret Manager] in gcp
service monitoring(internet)[Cloud Monitoring] in gcp
group homelab(cloud)[Home Lab]
service udm(server)[UDM Pro] in homelab
service servers(server)[Bare Metal Servers] in homelab
servers:L -- R:udm
udm:R -- L:wg_nlb
wg_nlb:R -- L:wireguard
wireguard:R -- L:https_lb
https_lb:R -- L:compute
compute:B --> T:storage
compute:R --> L:secrets
compute:T --> B:monitoringComponents:
- Matchbox Server: Container deployed to Cloud Run or Compute Engine VM
- HTTP/gRPC APIs for boot workflows and configuration
- UEFI HTTP boot support (TFTP disabled)
- Machine grouping and profile templating
- Ignition, Cloud-Init, and generic boot support
- Cloud Storage: Backend for boot assets (mounted via gcsfuse or synced periodically)
- Local Storage (Compute Engine only):
/var/lib/matchboxfor assets and configuration (synced from Cloud Storage) - Secret Manager: WireGuard keys, Matchbox TLS certificates
- Cloud Monitoring: Logs from Matchbox container, custom metrics via log parsing
Boot Image Lifecycle
sequenceDiagram
participant Admin
participant CLI as matchbox CLI / API
participant Matchbox as Matchbox Server
participant Storage as Cloud Storage
participant Monitor as Cloud Monitoring
Note over Admin,Monitor: Upload Boot Image
Admin->>CLI: Upload kernel/initrd via gRPC API
CLI->>Matchbox: gRPC CreateAsset(kernel, initrd)
Matchbox->>Matchbox: Validate asset integrity
Matchbox->>Matchbox: Store to /var/lib/matchbox/assets/
Matchbox->>Storage: Sync to gs://boot-assets/ (via sidecar script)
Matchbox->>Monitor: Log asset upload event
Matchbox->>CLI: Asset ID, checksum
Note over Admin,Monitor: Create Boot Profile
Admin->>CLI: Create profile YAML (kernel, initrd, cmdline)
CLI->>Matchbox: gRPC CreateProfile(profile.yaml)
Matchbox->>Matchbox: Store to /var/lib/matchbox/profiles/
Matchbox->>Storage: Sync profiles to gs://boot-config/
Matchbox->>CLI: Profile ID
Note over Admin,Monitor: Create Machine Group
Admin->>CLI: Create group YAML (MAC selector, profile mapping)
CLI->>Matchbox: gRPC CreateGroup(group.yaml)
Matchbox->>Matchbox: Store to /var/lib/matchbox/groups/
Matchbox->>Storage: Sync groups to gs://boot-config/
Matchbox->>CLI: Group ID
Note over Admin,Monitor: UEFI HTTP Boot Request
participant Server as Home Lab Server
Note right of Server: iLO 4 firmware v2.40+ initiates HTTP request directly
Server->>Matchbox: HTTP GET /boot.ipxe?mac=aa:bb:cc:dd:ee:ff (via WireGuard VPN)
Matchbox->>Matchbox: Match MAC to group
Matchbox->>Matchbox: Render iPXE template with profile
Matchbox->>Monitor: Log boot request (MAC, group, profile)
Matchbox->>Server: Send iPXE script
Server->>Matchbox: HTTP GET /assets/ubuntu-22.04-kernel.img
Matchbox->>Matchbox: Serve from /var/lib/matchbox/assets/
Matchbox->>Monitor: Log asset download (size, duration)
Matchbox->>Server: Stream kernel file
Server->>Matchbox: HTTP GET /assets/ubuntu-22.04-initrd.img
Matchbox->>Matchbox: Serve from /var/lib/matchbox/assets/
Matchbox->>Monitor: Log asset download
Matchbox->>Server: Stream initrd file
Server->>Server: Boot into OS
Note over Admin,Monitor: Rollback Machine Group
Admin->>CLI: Update group YAML (change profile reference)
CLI->>Matchbox: gRPC UpdateGroup(group.yaml)
Matchbox->>Matchbox: Update /var/lib/matchbox/groups/
Matchbox->>Storage: Sync updated group config
Matchbox->>Monitor: Log group update
Matchbox->>CLI: SuccessImplementation Details
Matchbox Deployment:
- Container:
quay.io/poseidon/matchbox:latest(official image) - Deployment Options:
- Cloud Run (preferred - HTTP-only boot enables serverless deployment):
- Min instances: 1 (ensures fast boot response)
- Memory: 1GB RAM (Matchbox recommendation)
- CPU: 1 vCPU
- Storage: Cloud Storage for assets/profiles/groups (via HTTP API)
- Compute Engine VM (if persistent local storage preferred):
- e2-small instance ($14/month, 2GB RAM recommended for Matchbox)
/var/lib/matchbox: Persistent disk (10GB SSD, $1.70/month)- Cloud Storage sync: Periodic backup of assets/profiles/groups to
gs://matchbox-config/ - Option: Use
gcsfuseto mount Cloud Storage directly (adds latency but simplifies backups)
- Cloud Run (preferred - HTTP-only boot enables serverless deployment):
Configuration Structure:
/var/lib/matchbox/
├── assets/ # Boot images (kernels, initrds, ISOs)
│ ├── ubuntu-22.04-kernel.img
│ ├── ubuntu-22.04-initrd.img
│ └── flatcar-stable.img.gz
├── profiles/ # Boot profiles (YAML)
│ ├── ubuntu-server.yaml
│ └── flatcar-container.yaml
└── groups/ # Machine groups (YAML)
├── default.yaml
├── node-01.yaml
└── storage-nodes.yaml
Example Profile (profiles/ubuntu-server.yaml):
id: ubuntu-22.04-server
name: Ubuntu 22.04 LTS Server
boot:
kernel: /assets/ubuntu-22.04-kernel.img
initrd:
- /assets/ubuntu-22.04-initrd.img
args:
- console=tty0
- console=ttyS0
- ip=dhcp
ignition_id: ubuntu-base.yaml
Example Group (groups/node-01.yaml):
id: node-01
name: Node 01 - Ubuntu Server
profile: ubuntu-22.04-server
selector:
mac: "aa:bb:cc:dd:ee:ff"
metadata:
hostname: node-01.homelab.local
ssh_authorized_keys:
- "ssh-ed25519 AAAA..."
GCP Integration:
- Cloud Storage Sync: Cron job or sidecar container to sync
/var/lib/matchboxto Cloud Storage# Sync every 5 minutes */5 * * * * gsutil -m rsync -r /var/lib/matchbox gs://matchbox-config/ - Secret Manager: Store Matchbox TLS certificates for gRPC API authentication
- Cloud Monitoring: Ship Matchbox logs to Cloud Logging, parse for metrics:
- Boot request count by MAC/group
- Asset download success/failure rates
- TFTP vs HTTP request distribution
Networking:
- VPC firewall: Allow TCP/8080 (HTTP), TCP/8081 (gRPC) from WireGuard subnet (no UDP/69 needed)
- Optional: Internal load balancer if high availability required (adds ~$18/month)
- Note: Cloud Run deployment includes integrated HTTPS load balancing
Cost Breakdown:
Option A: Cloud Run Deployment (Preferred):
| Component | Monthly Cost |
|---|---|
| Cloud Run (1 min instance, 1GB RAM, always-on) | $7.00 |
| Cloud Storage (50GB boot images) | $1.00 |
| Egress (10 boots × 150MB) | $0.18 |
| Total | ~$8.18 |
Option B: Compute Engine Deployment (If persistent local storage preferred):
| Component | Monthly Cost |
|---|---|
| e2-small VM (Matchbox server) | $14.00 |
| Persistent SSD (10GB) | $1.70 |
| Cloud Storage (50GB backups) | $1.00 |
| Egress (10 boots × 150MB) | $0.18 |
| Total | ~$16.88 |
Pros and Cons
- Good, because HTTP-only boot enables Cloud Run deployment (reduces cost significantly)
- Good, because UEFI HTTP boot eliminates TFTP complexity and potential failure points
- Good, because production-ready boot server with extensive real-world usage
- Good, because feature-complete with machine grouping, templating, and multi-OS support
- Good, because gRPC API for programmatic boot configuration management
- Good, because supports Ignition (Flatcar, CoreOS), Cloud-Init, and generic boot workflows
- Good, because well-documented with established best practices
- Good, because active community and upstream maintenance (Red Hat/CoreOS)
- Good, because reduces development time to days (deploy + configure vs weeks of coding)
- Good, because avoids reinventing network boot patterns (machine matching, boot configuration)
- Good, because proven security model (TLS for gRPC, asset integrity checks)
- Neutral, because requires learning Matchbox configuration patterns (YAML profiles/groups)
- Neutral, because containerized deployment (Docker on Compute Engine or Cloud Run)
- Neutral, because Cloud Run deployment option competitive with custom implementation cost
- Bad, because introduces external dependency (Matchbox project maintenance)
- Bad, because some features unnecessary for home lab scale (large-scale provisioning, etcd backend)
- Bad, because less control over implementation details (limited customization)
- Bad, because Cloud Storage integration requires custom sync scripts (Matchbox doesn’t natively support GCS backend)
- Bad, because dependency on upstream for security patches and bug fixes
UEFI HTTP Boot Architecture
This section documents the UEFI HTTP boot capability that fundamentally changes the network boot infrastructure design.
Boot Process Overview
Traditional PXE Boot (NOT USED - shown for comparison):
sequenceDiagram
participant Server as Bare Metal Server
participant DHCP as DHCP Server
participant TFTP as TFTP Server
participant HTTP as HTTP Server
Note over Server,HTTP: Traditional PXE Boot Chain (NOT USED)
Server->>DHCP: DHCP Discover
DHCP->>Server: DHCP Offer (TFTP server, boot filename)
Server->>TFTP: TFTP GET /pxelinux.0
TFTP->>Server: Send PXE bootloader
Server->>TFTP: TFTP GET /ipxe.efi
TFTP->>Server: Send iPXE binary
Server->>HTTP: HTTP GET /boot.ipxe
HTTP->>Server: Send boot script
Server->>HTTP: HTTP GET /kernel, /initrd
HTTP->>Server: Stream boot filesUEFI HTTP Boot (ACTUAL IMPLEMENTATION):
sequenceDiagram
participant Server as HP DL360 Gen 9<br/>(iLO 4 v2.40+)
participant DHCP as DHCP Server<br/>(UDM Pro)
participant VPN as WireGuard VPN
participant HTTP as HTTP Boot Server<br/>(GCP Cloud Run)
Note over Server,HTTP: UEFI HTTP Boot (ACTUAL IMPLEMENTATION)
Server->>DHCP: DHCP Discover
DHCP->>Server: DHCP Offer (boot URL: http://boot.internal/boot.ipxe?mac=...)
Note right of Server: Firmware initiates HTTP request directly<br/>(no TFTP/PXE chain loading)
Server->>VPN: WireGuard tunnel established
Server->>HTTP: HTTP GET /boot.ipxe?mac=aa:bb:cc:dd:ee:ff
HTTP->>Server: Send boot script with kernel/initrd URLs
Server->>HTTP: HTTP GET /assets/talos-kernel.img
HTTP->>Server: Stream kernel (via WireGuard)
Server->>HTTP: HTTP GET /assets/talos-initrd.img
HTTP->>Server: Stream initrd (via WireGuard)
Server->>Server: Boot into OSKey Differences
| Aspect | Traditional PXE | UEFI HTTP Boot |
|---|---|---|
| Initial Protocol | TFTP (UDP/69) | HTTP (TCP/80) or HTTPS (TCP/443) |
| Boot Loader | Requires TFTP transfer of iPXE binary | Firmware has HTTP client built-in |
| Chain Loading | PXE → TFTP → iPXE → HTTP | Direct HTTP boot (no chain) |
| Firewall Rules | UDP/69, TCP/80, TCP/443 | TCP/80, TCP/443 only |
| Cloud Run Support | ❌ (UDP not supported) | ✅ (HTTP-only) |
| Transfer Speed | ~1-5 Mbps (TFTP) | 10-100 Mbps (HTTP) |
| Complexity | High (multiple protocols) | Low (HTTP-only) |
Security Architecture
Challenge: HP DL360 Gen 9 UEFI HTTP boot does not support client-side TLS certificates (mTLS).
Solution: WireGuard VPN provides transport-layer security:
flowchart LR
subgraph homelab[Home Lab]
server[HP DL360 Gen 9<br/>UEFI HTTP Boot<br/>iLO 4 v2.40+]
udm[UDM Pro<br/>WireGuard Client]
end
subgraph gcp[Google Cloud Platform]
wg_gw[WireGuard Gateway<br/>Compute Engine]
cr[Boot Server<br/>Cloud Run]
end
server -->|HTTP| udm
udm -->|Encrypted WireGuard Tunnel| wg_gw
wg_gw -->|HTTP| cr
style server fill:#f9f,stroke:#333
style udm fill:#bbf,stroke:#333
style wg_gw fill:#bfb,stroke:#333
style cr fill:#fbb,stroke:#333Why WireGuard instead of Cloudflare mTLS?
- Cloudflare mTLS Limitation: Requires client certificates at TLS layer
- UEFI Firmware Limitation: Cannot present client certificates during TLS handshake
- WireGuard Solution: Provides mutual authentication at network layer (pre-shared keys)
- Security Equivalent: WireGuard offers same security properties as mTLS:
- Mutual authentication (both endpoints authenticated)
- Confidentiality (all traffic encrypted)
- Integrity (authenticated encryption via ChaCha20-Poly1305)
- No Internet exposure (boot server only accessible via VPN)
Firmware Configuration
HP iLO 4 UEFI HTTP Boot Setup:
Access Configuration:
- iLO web interface → Remote Console → Power On → Press F9 (RBSU)
- Or: Direct RBSU access during POST (Press F9)
Enable UEFI HTTP Boot:
- Navigate:
System Configuration → BIOS/Platform Configuration (RBSU) → Network Options - Set
Network BoottoEnabled - Set
Boot ModetoUEFI(not Legacy BIOS) - Enable
UEFI HTTP Boot Support
- Navigate:
Configure NIC:
- Navigate:
RBSU → Network Options → [FlexibleLOM/PCIe NIC] - Set
Option ROMtoEnabled(required for UEFI boot option to appear) - Set
Network BoottoEnabled - Configure IPv4/IPv6 settings (DHCP or static)
- Navigate:
Set Boot Order:
- Navigate:
RBSU → Boot Options → UEFI Boot Order - Move network device to top priority
- Navigate:
Configure Boot URL (via DHCP or static):
- DHCP option 67:
http://10.x.x.x/boot.ipxe?mac=${net0/mac} - Or: Static configuration in UEFI System Utilities
- DHCP option 67:
Required Firmware Versions:
- iLO 4: v2.40 or later (for UEFI HTTP boot support)
- System ROM: P89 v2.60 or later (recommended)
Verification:
# Check iLO firmware version via REST API
curl -k -u admin:password https://ilo-address/redfish/v1/Managers/1/ | jq '.FirmwareVersion'
# Expected output: "2.40" or higher
Architectural Implications
TFTP Elimination Impact:
- Deployment: Cloud Run becomes viable (no UDP/TFTP requirement)
- Cost: Reduced infrastructure costs (~$5-8/month vs $8-17/month)
- Complexity: Simplified networking (TCP-only firewall rules)
- Development: Reduced effort (no TFTP library, testing, edge cases)
- Scalability: Cloud Run autoscaling vs fixed VM capacity
- Maintenance: Serverless reduces operational overhead
Decision Impact:
The removal of TFTP complexity fundamentally shifts the cost/benefit analysis:
- Custom Implementation: More attractive (Cloud Run, reduced development time)
- Matchbox: Still valid but cost/complexity advantage reduced
- TCO Gap: Narrowed from ~$8,000-12,000 to ~$4,000-8,000 (Year 1)
- Development Gap: Reduced from 2-3 weeks to 1-2 weeks
Detailed Comparison
Feature Comparison
| Feature | Custom Implementation | Matchbox |
|---|---|---|
| UEFI HTTP Boot | ✅ Native (standard HTTP) | ✅ Built-in |
| HTTP/HTTPS Boot | ✅ Via z5labs/humus | ✅ Built-in |
| Cloud Run Deployment | ✅ Preferred option | ✅ Enabled by HTTP-only |
| Boot Scripting | ✅ Custom templates | ✅ Go templates |
| Machine-to-Image Mapping | ✅ Firestore/JSON | ✅ YAML groups with selectors |
| Boot Profile Management | ✅ Custom API | ✅ gRPC API + YAML |
| Cloud-Init Support | ⚠️ Requires implementation | ✅ Native support |
| Ignition Support | ❌ Not planned | ✅ Native support (Flatcar, CoreOS) |
| Asset Versioning | ⚠️ Requires implementation | ⚠️ Manual (via Cloud Storage versioning) |
| Rollback Capability | ⚠️ Requires implementation | ✅ Update group to previous profile |
| OpenTelemetry Observability | ✅ Built-in | ⚠️ Logs only (requires parsing) |
| GCP Cloud Storage Integration | ✅ Native SDK | ⚠️ Requires sync scripts |
| HTTP REST Admin API | ✅ Native (z5labs/humus) | ⚠️ gRPC only |
| Multi-Environment Support | ⚠️ Requires implementation | ✅ Groups + metadata |
Development Effort Comparison
| Task | Custom Implementation | Matchbox |
|---|---|---|
| Initial Setup | 1-2 days (project scaffolding) | 4-8 hours (deployment + config) |
| UEFI HTTP Boot | 1-2 days (standard HTTP endpoints) | ✅ Included |
| HTTP Boot API | 2-3 days (z5labs/humus endpoints) | ✅ Included |
| Machine Matching Logic | 2-3 days (database queries, selectors) | ✅ Included |
| Boot Script Templates | 2-3 days (boot script templating) | ✅ Included |
| Cloud-Init Support | 3-5 days (parsing, injection) | ✅ Included |
| Asset Management | 2-3 days (upload, storage) | ✅ Included |
| HTTP REST Admin API | 2-3 days (OpenAPI endpoints) | ✅ Included (gRPC) |
| Cloud Run Deployment | 1 day (Cloud Run config) | 1 day (Cloud Run config) |
| Testing | 3-5 days (unit, integration, E2E - simplified) | 2-3 days (integration only) |
| Documentation | 2-3 days | 1 day (reference existing docs) |
| Total Effort | 2-3 weeks | 1 week |
Operational Complexity
| Aspect | Custom Implementation | Matchbox |
|---|---|---|
| Deployment | Docker container on Compute Engine | Docker container on Compute Engine |
| Configuration Updates | API calls or Terraform updates | YAML file updates + API/filesystem sync |
| Monitoring | OpenTelemetry metrics to Cloud Monitoring | Log parsing + custom metrics |
| Troubleshooting | Full access to code, custom logging | Matchbox logs + gRPC API inspection |
| Security Patches | Manual code updates | Upstream container image updates |
| Dependency Updates | Manual Go module updates | Upstream Matchbox updates |
| Backup/Restore | Cloud Storage + Firestore backups | Sync /var/lib/matchbox to Cloud Storage |
Cost Comparison Summary
Comparing Cloud Run Deployments (Preferred for both options):
| Item | Custom (Cloud Run) | Matchbox (Cloud Run) | Difference |
|---|---|---|---|
| Compute | Cloud Run ($3.50/month) | Cloud Run ($7/month) | +$3.50/month |
| Storage | Cloud Storage ($1/month) | Cloud Storage ($1/month) | $0 |
| Development | 2-3 weeks @ $100/hour = $8,000-12,000 | 1 week @ $100/hour = $4,000 | -$4,000-8,000 |
| Annual Infrastructure | ~$54 | ~$96 | +$42/year |
| TCO (Year 1) | ~$8,054-12,054 | ~$4,096 | -$3,958-7,958 |
| TCO (Year 3) | ~$8,162-12,162 | ~$4,288 | -$3,874-7,874 |
Key Insights:
- UEFI HTTP boot enables Cloud Run deployment for both options, dramatically reducing infrastructure costs
- Custom implementation TCO gap narrowed from $7,895-11,895 to $3,958-7,958 (Year 1)
- Both options now cost ~$5-8/month for infrastructure (vs $8-17/month with TFTP)
- Development time difference reduced from 2-3 weeks to 1-2 weeks
- Decision is much closer than originally assessed
Risk Analysis
| Risk | Custom Implementation | Matchbox | Mitigation |
|---|---|---|---|
| Security Vulnerabilities | Medium (standard HTTP code, well-understood) | Medium (upstream dependency) | Both: Monitor for security updates, automated deployments |
| Boot Failures | Medium (HTTP-only reduces complexity) | Low (battle-tested) | Custom: Comprehensive E2E testing with real hardware |
| Cloud Run Cold Starts | Medium (needs validation) | Medium (needs validation) | Both: Min instances = 1 (always-on) |
| Maintenance Burden | Medium (ongoing code maintenance) | Low (upstream handles updates) | Both: Automated deployment pipelines |
| GCP Integration Issues | Low (native SDK) | Medium (sync scripts) | Matchbox: Robust sync with error handling |
| Scalability Limits | Low (Cloud Run autoscaling) | Low (handles thousands of nodes) | Both: Monitor boot request latency |
| Dependency Abandonment | N/A (no external deps) | Low (Red Hat backing) | Matchbox: Can fork if necessary |
Implementation Plan
Phase 1: Core Boot Server (Week 1)
Project Setup (1-2 days)
- Create Go project with
z5labs/humusframework - Set up OpenAPI specification for HTTP REST admin API
- Configure Cloud Storage and Firestore clients
- Implement basic health check endpoints
- Create Go project with
UEFI HTTP Boot Endpoints (2-3 days)
- HTTP endpoint serving boot scripts (iPXE format)
- Kernel and initrd streaming from Cloud Storage
- MAC-based machine matching using Firestore
- Boot script templating with machine-specific parameters
Testing & Deployment (2-3 days)
- Deploy to Cloud Run with min instances = 1
- Configure WireGuard VPN connectivity
- Test UEFI HTTP boot from HP DL360 Gen 9 (iLO 4 v2.40+)
- Validate boot latency and Cloud Run cold start metrics
Phase 2: Admin API & Management (Week 2)
HTTP REST Admin API (2-3 days)
- Boot image upload endpoints (kernel, initrd, metadata)
- Machine-to-image mapping management
- Boot profile CRUD operations
- Asset versioning and integrity validation
Cloud-Init Integration (2-3 days)
- Cloud-init configuration templating
- Metadata injection for machine-specific settings
- Integration with boot workflow
Observability & Documentation (2-3 days)
- OpenTelemetry metrics integration
- Cloud Monitoring dashboards
- API documentation
- Operational runbooks
Success Criteria
- ✅ Successfully boot HP DL360 Gen 9 via UEFI HTTP boot through WireGuard VPN
- ✅ Boot latency < 100ms for HTTP requests (kernel/initrd downloads)
- ✅ Cloud Run cold start latency < 100ms (with min instances = 1)
- ✅ Machine-to-image mapping works correctly based on MAC address
- ✅ Cloud Storage integration functional (upload, retrieve boot assets)
- ✅ HTTP REST API fully functional for boot configuration management
- ✅ Firestore stores machine mappings and boot profiles correctly
- ✅ OpenTelemetry metrics available in Cloud Monitoring
- ✅ Configuration update workflow clear and documented
- ✅ Firmware compatibility confirmed (no TFTP fallback needed)
More Information
Related Resources
- Matchbox Documentation
- Matchbox GitHub Repository
- iPXE Boot Process
- PXE Boot Specification
- Flatcar Linux Provisioning with Matchbox
- CoreOS Ignition Specification
- Cloud-Init Documentation
Related ADRs
- ADR-0002: Network Boot Architecture - Established cloud-hosted boot server with VPN
- ADR-0003: Cloud Provider Selection - Selected GCP as hosting provider
- ADR-0001: Use MADR for Architecture Decision Records - MADR format
Future Considerations
- High Availability: If boot server uptime becomes critical, evaluate multi-region deployment or failover strategies
- Multi-Cloud: If multi-cloud strategy emerges, custom implementation provides better portability
- Enterprise Features: If advanced provisioning workflows required (bare metal Kubernetes, Ignition support, etc.), evaluate adding features to custom implementation
- Asset Versioning: Implement comprehensive boot image versioning and rollback capabilities beyond basic Cloud Storage versioning
- Multi-Environment Support: Add support for multiple environments (dev, staging, prod) with environment-specific boot profiles
Related Issues
- Issue #601 - story(docs): create adr for network boot infrastructure on google cloud
- Issue #595 - story(docs): create adr for network boot architecture
- Issue #597 - story(docs): create adr for cloud provider selection