Technology Analysis
In-depth analysis of technologies and tools evaluated for home lab infrastructure
Technology Analysis
This section contains detailed research and analysis of various technologies evaluated for potential use in the home lab infrastructure.
Network Boot & Provisioning
- Matchbox - Network boot service for bare-metal provisioning
- Comprehensive analysis of PXE/iPXE/GRUB support
- Configuration model (profiles, groups, templating)
- Deployment patterns and operational considerations
- Use case evaluation and comparison with alternatives
Cloud Providers
- Google Cloud Platform - GCP capabilities for network boot infrastructure
- Network boot protocol support (TFTP, HTTP, HTTPS)
- WireGuard VPN deployment and integration
- Cost analysis and performance considerations
- Amazon Web Services - AWS capabilities for network boot infrastructure
- Network boot protocol support (TFTP, HTTP, HTTPS)
- WireGuard VPN deployment and integration
- Cost analysis and performance considerations
Operating Systems
- Server Operating Systems - OS evaluation for Kubernetes homelab infrastructure
- Ubuntu Server analysis (kubeadm, k3s, MicroK8s)
- Fedora Server analysis (kubeadm with CRI-O)
- Talos Linux analysis (purpose-built Kubernetes OS)
- Harvester HCI analysis (hyperconverged platform)
- Comparison of setup complexity, maintenance, security, and resource overhead
Hardware
Future Analysis Topics
Planned technology evaluations:
- Storage Solutions: Ceph, GlusterFS, ZFS over iSCSI
- Container Orchestration: Kubernetes distributions (k3s, Talos, etc.)
- Observability: Prometheus, Grafana, Loki, Tempo stack
- Service Mesh: Istio, Linkerd, Cilium comparison
- CI/CD: GitLab Runner, Tekton, Argo Workflows
- Secret Management: Vault, External Secrets Operator
- Load Balancing: MetalLB, kube-vip, Cilium LB-IPAM
1 - Server Operating System Analysis
Evaluation of operating systems for homelab Kubernetes infrastructure
This section provides detailed analysis of operating systems evaluated for the homelab server infrastructure, with a focus on Kubernetes cluster setup and maintenance.
Overview
The selection of a server operating system is critical for homelab infrastructure. The primary evaluation criterion is ease of Kubernetes cluster initialization and ongoing maintenance burden.
Evaluated Options
Ubuntu - Traditional general-purpose Linux distribution
- Kubernetes via kubeadm, k3s, or MicroK8s
- Strong community support and extensive documentation
- Familiar package management and system administration
Fedora - Cutting-edge Linux distribution
- Latest kernel and system components
- Kubernetes via kubeadm or k3s
- Shorter support lifecycle with more frequent upgrades
Talos Linux - Purpose-built Kubernetes OS
- API-driven, immutable infrastructure
- Built-in Kubernetes with minimal attack surface
- Designed specifically for container workloads
Harvester - Hyperconverged infrastructure platform
- Built on Rancher and K3s
- Combines compute, storage, and networking
- VM and container workloads on unified platform
Evaluation Criteria
Each option is evaluated based on:
- Kubernetes Installation Methods - Available tooling and installation approaches
- Cluster Initialization Process - Steps required to bootstrap a cluster
- Maintenance Requirements - OS updates, Kubernetes upgrades, security patches
- Resource Overhead - Memory, CPU, and storage footprint
- Learning Curve - Ease of adoption and operational complexity
- Community Support - Documentation quality and ecosystem maturity
- Security Posture - Attack surface and security-first design
1.1 - Ubuntu Analysis
Analysis of Ubuntu for Kubernetes homelab infrastructure
Overview
Ubuntu Server is a popular general-purpose Linux distribution developed by Canonical. It provides Long Term Support (LTS) releases with 5 years of standard support and optional Extended Security Maintenance (ESM).
Key Facts:
- Latest LTS: Ubuntu 24.04 LTS (Noble Numbat)
- Support Period: 5 years standard, 10 years with Ubuntu Pro (free for personal use)
- Kernel: Linux 6.8+ (LTS), regular HWE updates
- Package Manager: APT/DPKG, Snap
- Init System: systemd
Kubernetes Installation Methods
Ubuntu supports multiple Kubernetes installation approaches:
Installation:
# Install container runtime (containerd)
sudo apt-get update
sudo apt-get install -y containerd
# Configure containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd
# Install kubeadm, kubelet, kubectl
sudo apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.31/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.31/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
Cluster Initialization:
# Initialize control plane
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
# Configure kubectl for admin
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# Install CNI (e.g., Calico, Flannel)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
# Join worker nodes
kubeadm token create --print-join-command
Pros:
- Official Kubernetes tooling, well-documented
- Full control over cluster configuration
- Supports latest Kubernetes versions
- Large community and extensive resources
Cons:
- More manual steps than turnkey solutions
- Requires understanding of Kubernetes architecture
- Manual upgrade process for each component
- More complex troubleshooting
2. k3s (Lightweight Kubernetes)
Installation:
# Single-command install on control plane
curl -sfL https://get.k3s.io | sh -
# Get node token for workers
sudo cat /var/lib/rancher/k3s/server/node-token
# Install on worker nodes
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -
Pros:
- Extremely simple installation (single command)
- Lightweight (< 512MB RAM)
- Built-in container runtime (containerd)
- Automatic updates via Rancher System Upgrade Controller
- Great for edge and homelab use cases
Cons:
- Less customizable than kubeadm
- Some features removed (e.g., in-tree storage, cloud providers)
- Slightly different from upstream Kubernetes
3. MicroK8s (Canonical’s Distribution)
Installation:
# Install via snap
sudo snap install microk8s --classic
# Join cluster
sudo microk8s add-node
# Run output command on worker nodes
# Enable addons
microk8s enable dns storage ingress
Pros:
- Zero-ops, single package install
- Snap-based automatic updates
- Addons for common services (DNS, storage, ingress)
- Canonical support available
Cons:
- Requires snap (not universally liked)
- Less ecosystem compatibility than vanilla Kubernetes
- Ubuntu-specific (less portable)
Cluster Initialization Sequence
kubeadm Approach
sequenceDiagram
participant Admin
participant Server as Ubuntu Server
participant K8s as Kubernetes Components
Admin->>Server: Install Ubuntu 24.04 LTS
Server->>Server: Configure network (static IP)
Admin->>Server: Update system (apt update && upgrade)
Admin->>Server: Install containerd
Server->>Server: Configure containerd (CRI)
Admin->>Server: Install kubeadm/kubelet/kubectl
Server->>Server: Disable swap, configure kernel modules
Admin->>K8s: kubeadm init --pod-network-cidr=10.244.0.0/16
K8s->>Server: Generate certificates
K8s->>Server: Start etcd
K8s->>Server: Start API server
K8s->>Server: Start controller-manager
K8s->>Server: Start scheduler
K8s-->>Admin: Control plane ready
Admin->>K8s: kubectl apply -f calico.yaml
K8s->>Server: Deploy CNI pods
Admin->>K8s: kubeadm join (on workers)
K8s->>Server: Add worker nodes
K8s-->>Admin: Cluster readyk3s Approach
sequenceDiagram
participant Admin
participant Server as Ubuntu Server
participant K3s as k3s Components
Admin->>Server: Install Ubuntu 24.04 LTS
Server->>Server: Configure network (static IP)
Admin->>Server: Update system
Admin->>Server: curl -sfL https://get.k3s.io | sh -
Server->>K3s: Download k3s binary
K3s->>Server: Configure containerd
K3s->>Server: Start k3s service
K3s->>Server: Initialize etcd (embedded)
K3s->>Server: Start API server
K3s->>Server: Start controller-manager
K3s->>Server: Start scheduler
K3s->>Server: Deploy built-in CNI (Flannel)
K3s-->>Admin: Control plane ready
Admin->>Server: Retrieve node token
Admin->>Server: Install k3s agent on workers
K3s->>Server: Join workers to cluster
K3s-->>Admin: Cluster ready (5-10 minutes total)Maintenance Requirements
OS Updates
Security Patches:
# Automatic security updates (recommended)
sudo apt-get install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades
# Manual updates
sudo apt-get update
sudo apt-get upgrade
Frequency:
- Security patches: Weekly to monthly
- Kernel updates: Monthly (may require reboot)
- Major version upgrades: Every 2 years (LTS to LTS)
Kubernetes Upgrades
kubeadm Upgrade:
# Upgrade control plane
sudo apt-get update
sudo apt-get install -y kubeadm=1.32.0-*
sudo kubeadm upgrade apply v1.32.0
sudo apt-get install -y kubelet=1.32.0-* kubectl=1.32.0-*
sudo systemctl restart kubelet
# Upgrade workers
kubectl drain <node> --ignore-daemonsets
sudo apt-get install -y kubeadm=1.32.0-* kubelet=1.32.0-* kubectl=1.32.0-*
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon <node>
k3s Upgrade:
# Manual upgrade
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.32.0+k3s1 sh -
# Automatic upgrade via system-upgrade-controller
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
Upgrade Frequency: Every 3-6 months (Kubernetes minor versions)
Resource Overhead
Minimal Installation (Ubuntu Server + k3s):
- RAM: ~512MB (OS) + 512MB (k3s) = 1GB total
- CPU: 1 core minimum, 2 cores recommended
- Disk: 10GB (OS) + 10GB (container images) = 20GB
- Network: 1 Gbps recommended
Full Installation (Ubuntu Server + kubeadm):
- RAM: ~512MB (OS) + 1-2GB (Kubernetes components) = 2GB+ total
- CPU: 2 cores minimum
- Disk: 15GB (OS) + 20GB (container images/etcd) = 35GB
- Network: 1 Gbps recommended
Security Posture
Strengths:
- Regular security updates via Ubuntu Security Team
- AppArmor enabled by default
- SELinux support available
- Kernel hardening features (ASLR, stack protection)
- Ubuntu Pro ESM for extended CVE coverage (free for personal use)
Attack Surface:
- Full general-purpose OS (larger attack surface than minimal OS)
- Many installed packages by default (can be minimized)
- Requires manual hardening for production use
Hardening Steps:
# Disable unnecessary services
sudo systemctl disable snapd.service
sudo systemctl disable bluetooth.service
# Configure firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 6443/tcp # Kubernetes API
sudo ufw allow 10250/tcp # Kubelet
sudo ufw enable
# CIS Kubernetes Benchmark compliance
# Use tools like kube-bench for validation
Learning Curve
Ease of Adoption: ⭐⭐⭐⭐⭐ (Excellent)
- Most familiar Linux distribution for many users
- Extensive documentation and tutorials
- Large community support (forums, Stack Overflow)
- Straightforward package management
- Similar to Debian-based systems
Required Knowledge:
- Basic Linux system administration (apt, systemd, networking)
- Kubernetes concepts (pods, services, deployments)
- Container runtime basics (containerd, Docker)
- Text editor (vim, nano) for configuration
Ecosystem Maturity: ⭐⭐⭐⭐⭐ (Excellent)
- Documentation: Comprehensive official docs, community guides
- Community: Massive user base, active forums
- Commercial Support: Available from Canonical (Ubuntu Pro)
- Third-Party Tools: Excellent compatibility with all Kubernetes tools
- Tutorials: Abundant resources for Kubernetes on Ubuntu
Resources:
Pros and Cons Summary
Pros
- Good, because most familiar and well-documented Linux distribution
- Good, because 5-year LTS support (10 years with Ubuntu Pro)
- Good, because multiple Kubernetes installation options (kubeadm, k3s, MicroK8s)
- Good, because k3s provides extremely simple setup (single command)
- Good, because extensive package ecosystem (60,000+ packages)
- Good, because strong community support and resources
- Good, because automatic security updates available
- Good, because low learning curve for most administrators
- Good, because compatible with all Kubernetes tooling and addons
- Good, because Ubuntu Pro free for personal use (extended security)
Cons
- Bad, because general-purpose OS has larger attack surface than minimal OS
- Bad, because more resource overhead than purpose-built Kubernetes OS (1-2GB RAM)
- Bad, because requires manual OS updates and reboots
- Bad, because kubeadm setup is complex with many manual steps
- Bad, because snap packages controversial (for MicroK8s)
- Bad, because Kubernetes upgrades require manual intervention (unless using k3s auto-upgrade)
- Bad, because managing OS + Kubernetes lifecycle separately increases complexity
- Neutral, because many preinstalled packages (can be removed, but require effort)
Recommendations
Best for:
- Users familiar with Ubuntu/Debian ecosystem
- Homelabs requiring general-purpose server functionality (not just Kubernetes)
- Teams wanting multiple Kubernetes installation options
- Users prioritizing community support and documentation
Best Installation Method:
- Homelab/Learning: k3s (simplest, auto-updates, lightweight)
- Production-like: kubeadm (full control, upstream Kubernetes)
- Ubuntu-specific: MicroK8s (Canonical support, snap-based)
Avoid if:
- Seeking minimal attack surface (consider Talos Linux)
- Want infrastructure-as-code for OS layer (consider Talos Linux)
- Prefer hyperconverged platform (consider Harvester)
1.2 - Fedora Analysis
Analysis of Fedora Server for Kubernetes homelab infrastructure
Overview
Fedora Server is a cutting-edge Linux distribution sponsored by Red Hat, serving as the upstream for Red Hat Enterprise Linux (RHEL). It emphasizes innovation with the latest software packages and kernel versions.
Key Facts:
- Latest Version: Fedora 41 (October 2024)
- Support Period: ~13 months per release (shorter than Ubuntu LTS)
- Kernel: Linux 6.11+ (latest stable)
- Package Manager: DNF/RPM, Flatpak
- Init System: systemd
Kubernetes Installation Methods
Fedora supports standard Kubernetes installation approaches:
Installation:
# Install container runtime (CRI-O preferred on Fedora)
sudo dnf install -y cri-o
sudo systemctl enable --now crio
# Add Kubernetes repository
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.key
EOF
# Install kubeadm, kubelet, kubectl
sudo dnf install -y kubelet kubeadm kubectl
sudo systemctl enable --now kubelet
Cluster Initialization:
# Initialize control plane
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/crio/crio.sock
# Configure kubectl
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# Install CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
# Join workers
kubeadm token create --print-join-command
Pros:
- CRI-O is native to Fedora ecosystem (same as RHEL/OpenShift)
- Latest Kubernetes versions available quickly
- Familiar to RHEL/CentOS users
- Fully upstream Kubernetes
Cons:
- Manual setup process (same as Ubuntu/kubeadm)
- Requires Kubernetes knowledge
- More complex than turnkey solutions
2. k3s (Lightweight Kubernetes)
Installation:
# Same single-command install
curl -sfL https://get.k3s.io | sh -
# Retrieve token
sudo cat /var/lib/rancher/k3s/server/node-token
# Install on workers
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -
Pros:
- Simple installation (identical to Ubuntu)
- Lightweight and fast
- Well-tested on Fedora/RHEL family
Cons:
- Less customizable
- Not using native CRI-O by default (uses embedded containerd)
3. OKD (OpenShift Kubernetes Distribution)
Installation (Single-Node):
# Download and install OKD
wget https://github.com/okd-project/okd/releases/download/4.15.0-0.okd-2024-01-27-070424/openshift-install-linux-4.15.0-0.okd-2024-01-27-070424.tar.gz
tar -xvf openshift-install-linux-*.tar.gz
sudo mv openshift-install /usr/local/bin/
# Create install config
./openshift-install create install-config --dir=cluster
# Install cluster
./openshift-install create cluster --dir=cluster
Pros:
- Enterprise features (operators, web console, image registry)
- Built-in CI/CD and developer tools
- Based on Fedora CoreOS (immutable, auto-updating)
Cons:
- Very heavy resource requirements (16GB+ RAM)
- Complex installation and management
- Overkill for simple homelab use
Cluster Initialization Sequence
kubeadm with CRI-O
sequenceDiagram
participant Admin
participant Server as Fedora Server
participant K8s as Kubernetes Components
Admin->>Server: Install Fedora 41
Server->>Server: Configure network (static IP)
Admin->>Server: Update system (dnf update)
Admin->>Server: Install CRI-O
Server->>Server: Configure CRI-O runtime
Server->>Server: Enable crio.service
Admin->>Server: Install kubeadm/kubelet/kubectl
Server->>Server: Disable swap, load kernel modules
Server->>Server: Configure SELinux (permissive for Kubernetes)
Admin->>K8s: kubeadm init --cri-socket=unix:///var/run/crio/crio.sock
K8s->>Server: Generate certificates
K8s->>Server: Start etcd
K8s->>Server: Start API server
K8s->>Server: Start controller-manager
K8s->>Server: Start scheduler
K8s-->>Admin: Control plane ready
Admin->>K8s: kubectl apply CNI
K8s->>Server: Deploy CNI pods
Admin->>K8s: kubeadm join (workers)
K8s->>Server: Add worker nodes
K8s-->>Admin: Cluster readyk3s Approach
sequenceDiagram
participant Admin
participant Server as Fedora Server
participant K3s as k3s Components
Admin->>Server: Install Fedora 41
Server->>Server: Configure network
Admin->>Server: Update system (dnf update)
Admin->>Server: Disable firewalld (or configure)
Admin->>Server: curl -sfL https://get.k3s.io | sh -
Server->>K3s: Download k3s binary
K3s->>Server: Configure containerd
K3s->>Server: Start k3s service
K3s->>Server: Initialize embedded etcd
K3s->>Server: Start API server
K3s->>Server: Deploy built-in CNI
K3s-->>Admin: Control plane ready
Admin->>Server: Retrieve node token
Admin->>Server: Install k3s agent on workers
K3s->>Server: Join workers
K3s-->>Admin: Cluster ready (5-10 minutes)Maintenance Requirements
OS Updates
Security and System Updates:
# Automatic updates (dnf-automatic)
sudo dnf install -y dnf-automatic
sudo systemctl enable --now dnf-automatic.timer
# Manual updates
sudo dnf update -y
sudo reboot # if kernel updated
Frequency:
- Security patches: Weekly to monthly
- Kernel updates: Monthly (frequent updates)
- Major version upgrades: Every ~13 months (Fedora releases)
Version Upgrade:
# Upgrade to next Fedora release
sudo dnf upgrade --refresh
sudo dnf install dnf-plugin-system-upgrade
sudo dnf system-upgrade download --releasever=42
sudo dnf system-upgrade reboot
Kubernetes Upgrades
kubeadm Upgrade:
# Upgrade control plane
sudo dnf update -y kubeadm
sudo kubeadm upgrade apply v1.32.0
sudo dnf update -y kubelet kubectl
sudo systemctl restart kubelet
# Upgrade workers
kubectl drain <node> --ignore-daemonsets
sudo dnf update -y kubeadm kubelet kubectl
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon <node>
k3s Upgrade: Same as Ubuntu (curl script or system-upgrade-controller)
Upgrade Frequency: Kubernetes every 3-6 months, Fedora OS every ~13 months
Resource Overhead
Minimal Installation (Fedora Server + k3s):
- RAM: ~600MB (OS) + 512MB (k3s) = 1.2GB total
- CPU: 1 core minimum, 2 cores recommended
- Disk: 12GB (OS) + 10GB (containers) = 22GB
- Network: 1 Gbps recommended
Full Installation (Fedora Server + kubeadm + CRI-O):
- RAM: ~700MB (OS) + 1.5GB (Kubernetes) = 2.2GB total
- CPU: 2 cores minimum
- Disk: 15GB (OS) + 20GB (containers) = 35GB
- Network: 1 Gbps recommended
Note: Slightly higher overhead than Ubuntu due to SELinux and newer components.
Security Posture
Strengths:
- SELinux enabled by default (stronger than AppArmor)
- Latest security patches and kernel (bleeding edge)
- CRI-O container runtime (security-focused, used by OpenShift)
- Shorter support window = less legacy CVEs
- Active security team and rapid response
Attack Surface:
- General-purpose OS (larger surface than minimal OS)
- More installed packages than minimal server
- SELinux can be complex to configure for Kubernetes
Hardening Steps:
# Configure firewall (firewalld default on Fedora)
sudo firewall-cmd --permanent --add-port=6443/tcp # API server
sudo firewall-cmd --permanent --add-port=10250/tcp # Kubelet
sudo firewall-cmd --reload
# SELinux configuration for Kubernetes
sudo setenforce 0 # Permissive (Kubernetes not fully SELinux-ready)
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
# Disable unnecessary services
sudo systemctl disable bluetooth.service
Learning Curve
Ease of Adoption: ⭐⭐⭐⭐ (Good)
- Familiar for RHEL/CentOS/Alma/Rocky users
- DNF package manager (similar to APT)
- Excellent documentation
- SELinux learning curve can be steep
Required Knowledge:
- RPM-based system administration (dnf, systemd)
- SELinux basics (or willingness to use permissive mode)
- Kubernetes concepts
- Firewalld configuration
Differences from Ubuntu:
- DNF vs APT package manager
- SELinux vs AppArmor
- Firewalld vs UFW
- Faster release cycle (more frequent upgrades)
Ecosystem Maturity: ⭐⭐⭐⭐ (Good)
- Documentation: Excellent official docs, Red Hat resources
- Community: Large user base, active forums
- Commercial Support: RHEL support available (paid)
- Third-Party Tools: Good compatibility with Kubernetes tools
- Tutorials: Abundant resources, especially for RHEL ecosystem
Resources:
Pros and Cons Summary
Pros
- Good, because latest kernel and software packages (bleeding edge)
- Good, because SELinux enabled by default (stronger MAC than AppArmor)
- Good, because native CRI-O support (same as RHEL/OpenShift)
- Good, because upstream for RHEL (enterprise compatibility)
- Good, because multiple Kubernetes installation options
- Good, because k3s simplifies setup dramatically
- Good, because strong security focus and rapid CVE response
- Good, because familiar to RHEL/CentOS ecosystem
- Good, because automatic updates available (dnf-automatic)
- Neutral, because shorter support cycle (13 months) ensures latest features
Cons
- Bad, because short support cycle requires frequent OS upgrades (every ~13 months)
- Bad, because bleeding-edge packages can introduce instability
- Bad, because SELinux configuration for Kubernetes is complex (often set to permissive)
- Bad, because smaller community than Ubuntu (though still large)
- Bad, because general-purpose OS has larger attack surface than minimal OS
- Bad, because more resource overhead than purpose-built Kubernetes OS
- Bad, because OS upgrade every 13 months adds maintenance burden
- Bad, because less beginner-friendly than Ubuntu
- Bad, because managing OS + Kubernetes lifecycle separately
- Neutral, because rapid release cycle can be pro or con depending on preference
Recommendations
Best for:
- Users familiar with RHEL/CentOS/Rocky/Alma ecosystem
- Teams wanting latest kernel and software features
- Environments requiring SELinux (compliance, enterprise standards)
- Learning OpenShift/OKD ecosystem (Fedora CoreOS foundation)
- Users comfortable with frequent OS upgrades
Best Installation Method:
- Homelab/Learning: k3s (simplest, lightweight)
- Enterprise-like: kubeadm + CRI-O (OpenShift compatibility)
- Advanced: OKD (if resources available, 16GB+ RAM)
Avoid if:
- Prefer long-term stability (choose Ubuntu LTS)
- Want minimal maintenance (frequent Fedora upgrades required)
- Seeking minimal attack surface (consider Talos Linux)
- Uncomfortable with SELinux complexity
- Want infrastructure-as-code for OS (consider Talos Linux)
Comparison with Ubuntu
| Aspect | Fedora | Ubuntu LTS |
|---|
| Support Period | 13 months | 5 years (10 with Pro) |
| Kernel | Latest (6.11+) | LTS (6.8+) |
| Security | SELinux | AppArmor |
| Package Manager | DNF/RPM | APT/DEB |
| Release Cycle | 6 months | 2 years (LTS) |
| Upgrade Frequency | Every 13 months | Every 2-5 years |
| Community Size | Large | Very Large |
| Enterprise Upstream | RHEL | N/A |
| Stability | Bleeding edge | Stable/Conservative |
| Learning Curve | Moderate | Easy |
Verdict: Fedora is excellent for those wanting latest features and comfortable with frequent upgrades. Ubuntu LTS is better for long-term stability and minimal maintenance.
1.3 - Talos Linux Analysis
Analysis of Talos Linux for Kubernetes homelab infrastructure
Overview
Talos Linux is a modern operating system designed specifically for running Kubernetes. It is API-driven, immutable, and minimal, with no SSH access, shell, or package manager. All configuration is done via a declarative API.
Key Facts:
- Latest Version: Talos 1.9 (supports Kubernetes 1.31)
- Support: Community-driven, commercial support available from Sidero Labs
- Kernel: Linux 6.6+ LTS
- Architecture: Immutable, API-driven, no shell access
- Management: talosctl CLI + Kubernetes API
Kubernetes Installation Methods
Talos Linux has built-in Kubernetes - there is only one installation method.
Built-in Kubernetes (Only Option)
Installation Process:
- Boot Talos ISO/PXE (maintenance mode)
- Apply machine configuration via talosctl
- Bootstrap Kubernetes via talosctl bootstrap
Machine Configuration (YAML):
# controlplane.yaml
version: v1alpha1
machine:
type: controlplane
install:
disk: /dev/sda
network:
hostname: control-plane-1
interfaces:
- interface: eth0
dhcp: false
addresses:
- 192.168.1.10/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.1.1
cluster:
clusterName: homelab
controlPlane:
endpoint: https://192.168.1.10:6443
network:
cni:
name: custom
urls:
- https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
Cluster Initialization:
# Generate machine configs
talosctl gen config homelab https://192.168.1.10:6443
# Apply config to control plane node (booted from ISO)
talosctl apply-config --insecure --nodes 192.168.1.10 --file controlplane.yaml
# Wait for install to complete, then bootstrap
talosctl bootstrap --nodes 192.168.1.10 --endpoints 192.168.1.10
# Retrieve kubeconfig
talosctl kubeconfig --nodes 192.168.1.10 --endpoints 192.168.1.10
# Apply config to worker nodes
talosctl apply-config --insecure --nodes 192.168.1.11 --file worker.yaml
Pros:
- Kubernetes built-in, no separate installation
- Declarative configuration (GitOps-friendly)
- Extremely minimal attack surface (no shell, no SSH)
- Immutable infrastructure (config changes require reboot)
- Automatic updates via Talos controller
- Designed from ground up for Kubernetes
Cons:
- Steep learning curve (completely different paradigm)
- No SSH/shell access (all via API)
- Troubleshooting requires different mindset
- Limited to Kubernetes workloads only (not general-purpose)
- Smaller community than traditional distros
Cluster Initialization Sequence
sequenceDiagram
participant Admin
participant Server as Bare Metal Server
participant Talos as Talos Linux
participant K8s as Kubernetes Components
Admin->>Server: Boot Talos ISO (PXE or USB)
Server->>Talos: Start in maintenance mode
Talos-->>Admin: API endpoint ready (no shell)
Admin->>Admin: Generate configs (talosctl gen config)
Admin->>Talos: talosctl apply-config (controlplane.yaml)
Talos->>Server: Partition disk
Talos->>Server: Install Talos to /dev/sda
Talos->>Server: Write machine config
Server->>Server: Reboot from disk
Talos->>Talos: Load machine config
Talos->>K8s: Start kubelet
Talos->>K8s: Start etcd
Talos->>K8s: Start API server
Admin->>Talos: talosctl bootstrap
Talos->>K8s: Initialize cluster
K8s->>Talos: Start controller-manager
K8s->>Talos: Start scheduler
K8s-->>Admin: Control plane ready
Admin->>K8s: Apply CNI (via talosctl or kubectl)
K8s->>Talos: Deploy CNI pods
Admin->>Talos: Apply worker configs
Talos->>K8s: Join workers to cluster
K8s-->>Admin: Cluster ready (10-15 minutes)Maintenance Requirements
OS Updates
Declarative Upgrades:
# Upgrade Talos version (rolling upgrade)
talosctl upgrade --nodes 192.168.1.10 --image ghcr.io/siderolabs/installer:v1.9.0
# Kubernetes version upgrade (also declarative)
talosctl upgrade-k8s --nodes 192.168.1.10 --to 1.32.0
Automatic Updates (via Talos System Extensions):
# machine config with auto-update extension
machine:
install:
extensions:
- image: ghcr.io/siderolabs/system-upgrade-controller
Frequency:
- Talos releases: Every 2-3 months
- Kubernetes upgrades: Follow upstream cadence (quarterly)
- Security patches: Built into Talos releases
- No traditional OS patching (immutable system)
Configuration Changes
All changes via machine config:
# Edit machine config YAML
vim controlplane.yaml
# Apply updated config (triggers reboot if needed)
talosctl apply-config --nodes 192.168.1.10 --file controlplane.yaml
No manual package installs - everything declarative.
Resource Overhead
Minimal Footprint (Talos Linux + Kubernetes):
- RAM: ~256MB (OS) + 512MB (Kubernetes) = 768MB total
- CPU: 1 core minimum, 2 cores recommended
- Disk: ~500MB (OS) + 10GB (container images/etcd) = 10-15GB total
- Network: 1 Gbps recommended
Comparison:
- Ubuntu + k3s: ~1GB RAM
- Talos: ~768MB RAM (lighter)
- Ubuntu + kubeadm: ~2GB RAM
- Talos: ~768MB RAM (much lighter)
Minimal install size: ~500MB (vs 10GB+ for Ubuntu/Fedora)
Security Posture
Strengths: ⭐⭐⭐⭐⭐ (Excellent)
- No SSH access - attack surface eliminated
- No shell - cannot install malware
- No package manager - no additional software installation
- Immutable filesystem - rootfs read-only
- Minimal components: Only Kubernetes and essential services
- API-only access - mTLS-authenticated talosctl
- KSPP compliance: Kernel Self-Protection Project standards
- Signed images: Cryptographically signed Talos images
- Secure Boot support: UEFI Secure Boot compatible
Attack Surface:
- Smallest possible: Only Kubernetes API, kubelet, and Talos API
- ~30 running processes (vs 100+ on Ubuntu/Fedora)
- ~200MB filesystem (vs 5-10GB on Ubuntu/Fedora)
No hardening needed - secure by default.
Security Features:
# Built-in security (example config)
machine:
sysctls:
kernel.kptr_restrict: "2"
kernel.yama.ptrace_scope: "1"
kernel:
modules:
- name: br_netfilter
features:
kubernetesTalosAPIAccess:
enabled: true
allowedRoles:
- os:reader
Learning Curve
Ease of Adoption: ⭐⭐ (Challenging)
- Paradigm shift: No shell/SSH, API-only management
- Requires understanding of declarative infrastructure
- Talosctl CLI has learning curve
- Excellent documentation helps
- Different troubleshooting approach (logs via API)
Required Knowledge:
- Kubernetes fundamentals (critical)
- YAML configuration syntax
- Networking basics (especially CNI)
- GitOps concepts helpful
- Comfort with “infrastructure as code”
Debugging without shell:
# View logs via API
talosctl logs --nodes 192.168.1.10 kubelet
# Get system metrics
talosctl dashboard --nodes 192.168.1.10
# Interactive mode (limited shell in emergency)
talosctl dashboard --nodes 192.168.1.10
# Service status
talosctl service --nodes 192.168.1.10
Ecosystem Maturity: ⭐⭐⭐ (Growing)
- Documentation: Excellent official docs
- Community: Smaller but very active (Slack, GitHub Discussions)
- Commercial Support: Available from Sidero Labs
- Third-Party Tools: Growing ecosystem (Cluster API, GitOps tools)
- Tutorials: Increasing number of community guides
Resources:
Community Size: Smaller than Ubuntu/Fedora, but dedicated and helpful.
Pros and Cons Summary
Pros
- Good, because Kubernetes is built-in (no separate installation)
- Good, because minimal attack surface (no SSH, shell, or package manager)
- Good, because immutable infrastructure (config drift impossible)
- Good, because API-driven management (GitOps-friendly)
- Good, because extremely low resource overhead (~768MB RAM)
- Good, because automatic security patches via Talos upgrades
- Good, because declarative configuration (version-controlled)
- Good, because secure by default (no hardening required)
- Good, because smallest disk footprint (~500MB OS)
- Good, because designed specifically for Kubernetes (opinionated and optimized)
- Good, because UEFI Secure Boot support
- Good, because upgrades are simple and declarative (talosctl upgrade)
Cons
- Bad, because steep learning curve (no shell/SSH paradigm shift)
- Bad, because limited to Kubernetes workloads only (not general-purpose)
- Bad, because troubleshooting without shell requires different approach
- Bad, because smaller community than Ubuntu/Fedora
- Bad, because relatively new (less mature than traditional distros)
- Bad, because no escape hatch for manual intervention
- Bad, because requires comfort with declarative infrastructure
- Bad, because debugging is harder for beginners
- Neutral, because opinionated design (pro for K8s-only, con for general use)
Recommendations
Best for:
- Kubernetes-dedicated infrastructure (no general-purpose workloads)
- Security-focused environments (minimal attack surface)
- GitOps workflows (declarative configuration)
- Immutable infrastructure advocates
- Teams comfortable with API-driven management
- Production Kubernetes clusters (once team is trained)
Best Installation Method:
- Only option: Built-in Kubernetes via talosctl
Avoid if:
- Need general-purpose server functionality (SSH, cron jobs, etc.)
- Team unfamiliar with Kubernetes (too steep a learning curve)
- Require shell access for troubleshooting comfort
- Want traditional package management (apt, dnf)
- Prefer familiar Linux administration tools
Comparison with Ubuntu and Fedora
| Aspect | Talos Linux | Ubuntu + k3s | Fedora + kubeadm |
|---|
| K8s Installation | Built-in | Single command | Manual (kubeadm) |
| Attack Surface | Minimal (~30 processes) | Medium (~100) | Medium (~100) |
| Resource Overhead | 768MB RAM | 1GB RAM | 2.2GB RAM |
| Disk Footprint | 500MB | 10GB | 15GB |
| Security Model | Immutable, no shell | AppArmor, shell | SELinux, shell |
| Management | API-only (talosctl) | SSH + kubectl | SSH + kubectl |
| Learning Curve | Steep | Easy | Moderate |
| Community Size | Small (growing) | Very Large | Large |
| Support Period | Rolling releases | 5-10 years | 13 months |
| Use Case | Kubernetes only | General-purpose | General-purpose |
| Upgrades | Declarative, simple | Manual OS + K8s | Manual OS + K8s |
| Configuration | Declarative YAML | Imperative + YAML | Imperative + YAML |
| Troubleshooting | API logs/metrics | SSH + logs | SSH + logs |
| GitOps-Friendly | Excellent | Good | Good |
| Best for | K8s-dedicated infra | Homelabs, learning | RHEL ecosystem |
Verdict: Talos is the most secure and efficient option for Kubernetes-only infrastructure, but requires team buy-in to API-driven, immutable paradigm. Ubuntu/Fedora better for general-purpose servers or teams wanting shell access.
Advanced Features
Talos System Extensions
Extend Talos functionality with extensions:
machine:
install:
extensions:
- image: ghcr.io/siderolabs/intel-ucode:20240312
- image: ghcr.io/siderolabs/iscsi-tools:v0.1.4
Cluster API Integration
Talos works natively with Cluster API:
# Install Cluster API + Talos provider
clusterctl init --infrastructure talos
# Create cluster from template
clusterctl generate cluster homelab --infrastructure talos > cluster.yaml
kubectl apply -f cluster.yaml
Image Factory
Custom Talos images with extensions:
# Build custom image
curl -X POST https://factory.talos.dev/image \
-d '{"talos_version":"v1.9.0","extensions":["siderolabs/intel-ucode"]}'
Disaster Recovery
Talos supports etcd backup/restore:
# Backup etcd
talosctl etcd snapshot --nodes 192.168.1.10
# Restore from snapshot
talosctl bootstrap --recover-from ./etcd-snapshot.db
Production Readiness
Production Use: ✅ Yes (many companies run Talos in production)
High Availability:
- 3+ control plane nodes recommended
- External etcd supported
- Load balancer for API server
Monitoring:
- Prometheus metrics built-in
- Talos dashboard for health
- Standard Kubernetes observability tools
Example Production Clusters:
- Sidero Metal (bare metal provisioning)
- Various cloud providers (AWS, GCP, Azure)
- Edge deployments (minimal footprint)
1.4 - Harvester Analysis
Analysis of Harvester HCI for Kubernetes homelab infrastructure
Overview
Harvester is a Hyperconverged Infrastructure (HCI) platform built on Kubernetes, designed to provide VM and container management on a unified platform. It combines compute, storage, and networking with built-in K3s for orchestration.
Key Facts:
- Latest Version: Harvester 1.4 (based on K3s 1.30+)
- Foundation: Built on RancherOS 2.0, K3s, and KubeVirt
- Support: Supported by SUSE (acquired Rancher)
- Architecture: HCI platform with VM + container workloads
- Management: Web UI + kubectl + Rancher integration
Kubernetes Installation Methods
Harvester includes K3s as its foundation - Kubernetes is built-in.
Built-in K3s (Only Option)
Installation Process:
- Boot Harvester ISO (interactive installer or PXE)
- Complete installation wizard (web UI or console)
- Create cluster (automatic K3s deployment)
- Access via web UI or kubectl
Interactive Installation:
# Boot from Harvester ISO
1. Choose "Create a new Harvester cluster"
2. Configure:
- Cluster token
- Node role (management/worker/witness)
- Network interface (management network)
- VIP (Virtual IP for cluster access)
- Storage disk (Longhorn persistent storage)
3. Install completes (15-20 minutes)
4. Access web UI at https://<VIP>
Configuration (cloud-init for automated install):
# config.yaml
token: my-cluster-token
os:
hostname: harvester-node-1
modules:
- kvm
kernel_parameters:
- intel_iommu=on
install:
mode: create
device: /dev/sda
iso_url: https://releases.rancher.com/harvester/v1.4.0/harvester-v1.4.0-amd64.iso
vip: 192.168.1.100
vip_mode: static
networks:
harvester-mgmt:
interfaces:
- name: eth0
default_route: true
ip: 192.168.1.10
subnet_mask: 255.255.255.0
gateway: 192.168.1.1
Pros:
- Complete HCI solution (VMs + containers)
- Web UI for management (no CLI required)
- Built-in storage (Longhorn CSI)
- Built-in networking (multus, SR-IOV)
- VM live migration
- Rancher integration for multi-cluster management
- K3s built-in (no separate Kubernetes install)
Cons:
- Heavy resource requirements (8GB+ RAM per node)
- Complex architecture (steep learning curve)
- Larger attack surface than minimal OS
- Overkill for container-only workloads
- Requires 3+ nodes for production HA
Cluster Initialization Sequence
sequenceDiagram
participant Admin
participant Server as Bare Metal Server
participant Harvester as Harvester HCI
participant K3s as K3s / KubeVirt
participant Storage as Longhorn Storage
Admin->>Server: Boot Harvester ISO
Server->>Harvester: Start installation wizard
Harvester-->>Admin: Interactive console/web UI
Admin->>Harvester: Configure cluster (token, VIP, storage)
Harvester->>Server: Partition disks (OS + Longhorn storage)
Harvester->>Server: Install RancherOS 2.0 base
Harvester->>Server: Install K3s components
Server->>Server: Reboot
Harvester->>K3s: Start K3s server
K3s->>Server: Initialize control plane
K3s->>Server: Deploy Harvester operators
K3s->>Storage: Deploy Longhorn for persistent storage
K3s->>Server: Deploy KubeVirt for VM management
K3s->>Server: Deploy multus CNI (multi-network)
Harvester-->>Admin: Web UI ready at https://<VIP>
Admin->>Harvester: Add additional nodes (join cluster)
Harvester->>K3s: Join nodes to cluster
K3s->>Storage: Replicate storage across nodes
Harvester-->>Admin: Cluster ready (20-30 minutes)
Admin->>Harvester: Create VMs or deploy containersMaintenance Requirements
OS Updates
Harvester Upgrades (includes OS + K3s):
# Via Web UI:
# Settings → Upgrade → Select version → Start upgrade
# Via kubectl (after downloading upgrade image):
kubectl apply -f https://releases.rancher.com/harvester/v1.4.0/version.yaml
# Monitor upgrade progress
kubectl get upgrades -n harvester-system
Frequency:
- Harvester releases: Every 2-3 months (minor versions)
- Security patches: Included in Harvester releases
- K3s upgrades: Bundled with Harvester upgrades
- No separate OS patching (managed by Harvester)
Kubernetes Upgrades
K3s is upgraded with Harvester - no separate upgrade process.
Version Compatibility:
- Harvester 1.4.x → K3s 1.30+
- Harvester 1.3.x → K3s 1.28+
- Harvester 1.2.x → K3s 1.26+
Upgrade Process:
- Web UI or kubectl to trigger upgrade
- Rolling upgrade of nodes (one at a time)
- VM live migration during node upgrades
- Automatic rollback on failure
Resource Overhead
Single Node (Harvester HCI):
- RAM: 8GB minimum (16GB recommended for VMs)
- CPU: 4 cores minimum (8 cores recommended)
- Disk: 250GB minimum (SSD recommended)
- 100GB for OS/Harvester components
- 150GB+ for Longhorn storage (VM disks)
- Network: 1 Gbps minimum (10 Gbps for production)
Three-Node Cluster (Production HA):
- RAM: 32GB per node (64GB for VM-heavy workloads)
- CPU: 8 cores per node minimum
- Disk: 500GB+ per node (NVMe SSD recommended)
- Network: 10 Gbps recommended (separate storage network ideal)
Comparison:
- Ubuntu + k3s: 1GB RAM
- Talos: 768MB RAM
- Harvester: 8GB+ RAM (much heavier)
Note: Harvester is designed for multi-node HCI, not single-node homelabs.
Security Posture
Strengths:
- SELinux-based (RancherOS 2.0 foundation)
- Immutable OS layer (similar to Talos)
- RBAC built-in (Kubernetes + Rancher)
- Network segmentation (multus CNI)
- VM isolation (KubeVirt)
- Signed images and secure boot support
Attack Surface:
- Larger than Talos/k3s: Includes web UI, VM management, storage layer
- KubeVirt adds additional components
- Web UI is additional attack vector
- More processes than minimal OS (~50+ services)
Security Features:
# VM network isolation example
apiVersion: network.harvesterhci.io/v1beta1
kind: VlanConfig
metadata:
name: production-vlan
spec:
vlanID: 100
uplink:
linkAttributes: 1500
Hardening:
- Firewall rules (web UI or kubectl)
- RBAC policies (restrict VM/namespace access)
- Network policies (isolate workloads)
- Rancher authentication integration (LDAP, SAML)
Learning Curve
Ease of Adoption: ⭐⭐⭐ (Moderate)
- Web UI simplifies management (no CLI required for basic tasks)
- Requires understanding of VMs + containers
- Kubernetes knowledge helpful but not required initially
- Longhorn storage concepts (replicas, snapshots)
- KubeVirt for VM management (learning curve)
Required Knowledge:
- Basic Kubernetes concepts (pods, services)
- VM management (KubeVirt/libvirt)
- Storage concepts (Longhorn, CSI)
- Networking (VLANs, SR-IOV optional)
- Web UI navigation
Debugging:
# Access via kubectl (kubeconfig from web UI)
kubectl get nodes -n harvester-system
# View Harvester logs
kubectl logs -n harvester-system <pod-name>
# VM console access (via web UI or virtctl)
virtctl console <vm-name>
# Storage debugging
kubectl get volumes -A
Ecosystem Maturity: ⭐⭐⭐⭐ (Good)
- Documentation: Excellent official docs
- Community: Active Slack, GitHub Discussions, forums
- Commercial Support: Available from SUSE/Rancher
- Third-Party Tools: Rancher ecosystem integration
- Tutorials: Growing number of guides and videos
Resources:
Pros and Cons Summary
Pros
- Good, because unified platform for VMs + containers (no separate hypervisor)
- Good, because built-in K3s (Kubernetes included)
- Good, because web UI simplifies management (no CLI required)
- Good, because built-in persistent storage (Longhorn CSI)
- Good, because VM live migration (no downtime during maintenance)
- Good, because multi-network support (multus CNI, SR-IOV)
- Good, because Rancher integration (multi-cluster management)
- Good, because automatic upgrades (OS + K3s + components)
- Good, because commercial support available (SUSE)
- Good, because designed for bare-metal HCI (no cloud dependencies)
- Neutral, because immutable OS layer (similar to Talos benefits)
Cons
- Bad, because very heavy resource requirements (8GB+ RAM minimum)
- Bad, because complex architecture (KubeVirt, Longhorn, multus, etc.)
- Bad, because overkill for container-only workloads (use k3s/Talos instead)
- Bad, because larger attack surface than minimal OS (web UI, VM layer)
- Bad, because requires 3+ nodes for production HA (not single-node friendly)
- Bad, because steep learning curve for full feature set (VMs + storage + networking)
- Bad, because relatively new platform (less mature than Ubuntu/Fedora)
- Bad, because limited to Rancher ecosystem (vendor lock-in)
- Bad, because slower to adopt latest Kubernetes versions (depends on K3s bundle)
- Neutral, because opinionated HCI design (pro for VM use cases, con for simplicity)
Recommendations
Best for:
- Hybrid workloads (VMs + containers on same platform)
- Homelab users wanting to consolidate VM hypervisor + Kubernetes
- Teams familiar with Rancher ecosystem
- Multi-node clusters (3+ nodes)
- Environments requiring VM live migration
- Users wanting web UI for infrastructure management
- Replacing VMware/Proxmox + Kubernetes with unified platform
Best Installation Method:
- Only option: Interactive ISO install or PXE with cloud-init
Avoid if:
- Running container-only workloads (use k3s or Talos instead)
- Limited resources (< 8GB RAM per node)
- Single-node homelab (Harvester designed for multi-node)
- Want minimal attack surface (use Talos)
- Prefer traditional Linux shell access (use Ubuntu/Fedora)
- Need latest Kubernetes versions immediately (Harvester lags upstream)
Comparison with Other Options
| Aspect | Harvester | Talos Linux | Ubuntu + k3s | Fedora + kubeadm |
|---|
| Primary Use Case | VMs + Containers | Containers only | General-purpose | General-purpose |
| Resource Overhead | 8GB+ RAM | 768MB RAM | 1GB RAM | 2.2GB RAM |
| Kubernetes | Built-in K3s | Built-in | Install k3s | Install kubeadm |
| Management | Web UI + kubectl | API-only (talosctl) | SSH + kubectl | SSH + kubectl |
| Storage | Built-in Longhorn | External CSI | External CSI | External CSI |
| VM Support | Native (KubeVirt) | No | Via KubeVirt | Via KubeVirt |
| Learning Curve | Moderate | Steep | Easy | Moderate |
| Attack Surface | Large | Minimal | Medium | Medium |
| Multi-Node | Designed for | Supports | Supports | Supports |
| Single-Node | Not ideal | Excellent | Excellent | Good |
| Best for | VM + K8s hybrid | K8s-only | Homelab/learning | RHEL ecosystem |
Verdict: Harvester is excellent for VM + container hybrid workloads with 3+ nodes, but overkill for container-only infrastructure. Use Talos or k3s for Kubernetes-only clusters, Ubuntu/Fedora for general-purpose servers.
Advanced Features
VM Management (KubeVirt)
Create VMs via YAML:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: ubuntu-vm
spec:
running: true
template:
spec:
domain:
devices:
disks:
- name: root
disk:
bus: virtio
resources:
requests:
memory: 4Gi
cpu: 2
volumes:
- name: root
containerDisk:
image: docker.io/harvester/ubuntu:22.04
Live Migration
Move VMs between nodes:
# Via web UI: VM → Actions → Migrate
# Via kubectl
kubectl patch vm ubuntu-vm --type merge -p '{"spec":{"running":false}}'
kubectl patch vm ubuntu-vm --type merge -p '{"spec":{"running":true}}'
Backup and Restore
Harvester supports VM backups:
# Configure S3 backup target (web UI)
# Create VM snapshot
# Restore from snapshot or backup
Rancher Integration
Manage multiple clusters:
# Import Harvester cluster into Rancher
# Deploy workloads across clusters
# Central authentication and RBAC
Use Case Examples
Use Case 1: Replace VMware + Kubernetes
Scenario: Currently running VMware ESXi for VMs + separate Kubernetes cluster
Harvester Solution:
- Consolidate to 3-node Harvester cluster
- Migrate VMs to KubeVirt
- Deploy containers on same cluster
- Save VMware licensing costs
Benefits:
- Single platform for VMs + containers
- Unified management (web UI + kubectl)
- Built-in HA and live migration
Use Case 2: Homelab with Mixed Workloads
Scenario: Need Windows VMs + Linux containers + storage server
Harvester Solution:
- Windows VMs via KubeVirt (GPU passthrough supported)
- Linux containers via K3s workloads
- Longhorn for persistent storage (NFS export supported)
Benefits:
- No need for separate Proxmox/ESXi
- Kubernetes-native management
- Learn enterprise HCI platform
Use Case 3: Edge Computing
Scenario: Deploy compute at remote sites (3-5 nodes each)
Harvester Solution:
- Harvester cluster at each edge location
- Rancher for central management
- VM + container workloads
Benefits:
- Autonomous operation (no cloud dependency)
- Rancher multi-cluster management
- Built-in storage and networking
Production Readiness
Production Use: ✅ Yes (used in enterprise environments)
High Availability:
- 3+ nodes required for HA
- Witness node for even-node clusters
- VM live migration during maintenance
- Longhorn 3-replica storage
Monitoring:
- Built-in Prometheus + Grafana
- Rancher monitoring integration
- Alerting and notifications
Disaster Recovery:
- VM backups to S3
- Cluster backups (etcd + config)
- Restore to new cluster
Enterprise Features:
- Rancher authentication (LDAP, SAML, OAuth)
- Multi-tenancy (namespaces, RBAC)
- Audit logging
- Network policies
2 - Amazon Web Services Analysis
Technical analysis of Amazon Web Services capabilities for hosting network boot infrastructure
This section contains detailed analysis of Amazon Web Services (AWS) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.
Overview
Amazon Web Services is Amazon’s comprehensive cloud computing platform, offering compute, storage, networking, and managed services. This analysis focuses on AWS’s capabilities to support the network boot architecture decided in ADR-0002.
Key Services Evaluated
- EC2: Virtual machine instances for hosting boot server
- VPN / VPC: Network connectivity and VPN capabilities
- Elastic Load Balancing: Application and Network Load Balancers
- NAT Gateway: Network address translation for outbound connectivity
- VPC: Virtual Private Cloud networking and routing
Documentation Sections
2.1 - AWS Network Boot Protocol Support
Analysis of Amazon Web Services support for TFTP, HTTP, and HTTPS routing for network boot infrastructure
Network Boot Protocol Support on Amazon Web Services
This document analyzes AWS’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.
TFTP (Trivial File Transfer Protocol) Support
Native Support
Status: ❌ Not natively supported by Elastic Load Balancing
AWS’s Elastic Load Balancing services do not support TFTP protocol natively:
- Application Load Balancer (ALB): HTTP/HTTPS only (Layer 7)
- Network Load Balancer (NLB): TCP/UDP support, but not TFTP-aware
- Classic Load Balancer: Deprecated, similar limitations
TFTP operates on UDP port 69 with unique protocol semantics (variable block sizes, retransmissions, port negotiation) that standard load balancers cannot parse.
Implementation Options
Option 1: Direct EC2 Instance Access (Recommended for VPN Scenario)
Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from an EC2 instance:
- Approach: Run TFTP server (e.g.,
tftpd-hpa, dnsmasq) on an EC2 instance - Access: Home lab connects via VPN tunnel to instance’s private IP
- Security Group: Allow UDP/69 from VPN subnet/security group
- Pros:
- Simple implementation
- No load balancer needed (single boot server sufficient for home lab)
- TFTP traffic encrypted through VPN tunnel
- Direct instance-to-client communication
- Cons:
- Single point of failure (no HA)
- Manual failover if instance fails
Option 2: Network Load Balancer (NLB) UDP Passthrough
While NLB doesn’t understand TFTP protocol, it can forward UDP traffic:
- Approach: Configure NLB to forward UDP/69 to target group
- Limitations:
- No TFTP-specific health checks
- Health checks would use TCP or different protocol
- Adds cost and complexity without significant benefit for single server
- Use Case: Only relevant for multi-AZ HA deployment (overkill for home lab)
TFTP Security Considerations
- Encryption: TFTP itself is unencrypted, but VPN tunnel provides encryption
- Security Groups: Restrict UDP/69 to VPN security group or CIDR only
- File Access Control: Configure TFTP server with restricted file access
- Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads
HTTP Support
Native Support
Status: ✅ Fully supported
AWS provides comprehensive HTTP support through multiple services:
Elastic Load Balancing - Application Load Balancer
- Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (preview)
- Port: Any port (typically 80 for HTTP)
- Routing: Path-based, host-based, query string, header-based routing
- Health Checks: HTTP health checks with configurable paths and response codes
- SSL Offloading: Terminate SSL at ALB and use HTTP to backend
- Backend: EC2 instances, ECS, EKS, Lambda
EC2 Direct Access
For VPN scenario, HTTP can be served directly from EC2 instance:
- Approach: Run HTTP server (nginx, Apache, custom service) on EC2
- Access: Home lab accesses via VPN tunnel to private IP
- Security Group: Allow TCP/80 from VPN security group
- Pros: Simpler than ALB for single boot server
HTTP Boot Flow for Network Boot
- PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
- iPXE → HTTP: iPXE chainloads kernel/initrd via HTTP
- Kernel/Initrd: Large boot files served efficiently over HTTP
- Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
- Compression: gzip compression for text-based configs
- CloudFront: Optional CDN for caching boot files (probably overkill for VPN scenario)
- TCP Optimization: AWS network optimized for low-latency TCP
HTTPS Support
Native Support
Status: ✅ Fully supported with advanced features
AWS provides enterprise-grade HTTPS support:
Elastic Load Balancing - Application Load Balancer
- Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 (preview)
- SSL/TLS Termination: Terminate SSL at ALB
- Certificate Management:
- AWS Certificate Manager (ACM) - free SSL certificates with automatic renewal
- Import custom certificates
- Integration with private CA via ACM Private CA
- TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable via security policy)
- Cipher Suites: Predefined security policies (modern, compatible, legacy)
- SNI Support: Multiple certificates on single load balancer
AWS Certificate Manager (ACM)
- Free Certificates: No cost for public SSL certificates used with AWS services
- Automatic Renewal: ACM automatically renews certificates before expiration
- Private CA: ACM Private CA for internal PKI (additional cost)
- Integration: Native integration with ALB, CloudFront, API Gateway
HTTPS for Network Boot
Use Case
Modern UEFI firmware and iPXE support HTTPS boot:
- iPXE HTTPS: iPXE compiled with
DOWNLOAD_PROTO_HTTPS can fetch over HTTPS - UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot
- Security: Boot file integrity verified via HTTPS chain of trust
Implementation on AWS
Certificate Provisioning:
- Use ACM certificate for public domain (free, auto-renewed)
- Use self-signed certificate for VPN-only access (add to iPXE trust store)
- Use ACM Private CA for internal PKI ($400/month - expensive for home lab)
ALB Configuration:
- HTTPS listener on port 443
- Target group pointing to EC2 boot server
- Security policy with TLS 1.2+ minimum
Alternative: Direct EC2 HTTPS:
- Run nginx/Apache with TLS on EC2 instance
- Access via VPN tunnel to private IP with HTTPS
- Simpler setup for VPN-only scenario
- Use Let’s Encrypt or self-signed certificate
Mutual TLS (mTLS) Support
AWS ALB supports mutual TLS authentication (as of 2022):
- Client Certificates: Require client certificates for authentication
- Trust Store: Upload trusted CA certificates to ALB
- Use Case: Ensure only authorized home lab servers can access boot files
- Integration: Combine with VPN for defense-in-depth
- Passthrough Mode: ALB can pass client cert to backend for validation
Routing and Load Balancing Capabilities
VPC Routing
- Route Tables: Define routes to direct traffic through VPN gateway
- Route Propagation: BGP route propagation for VPN connections
- Transit Gateway: Advanced multi-VPC/VPN routing (overkill for home lab)
Security Groups
- Stateful Firewall: Automatic return traffic handling
- Ingress/Egress Rules: Fine-grained control by protocol, port, source/destination
- Security Group Chaining: Reference security groups in rules (elegant for VPN setup)
- VPN Subnet Restriction: Allow traffic only from VPN-connected subnet
Network ACLs (Optional)
- Stateless Firewall: Subnet-level access control
- Defense in Depth: Additional layer beyond security groups
- Use Case: Probably unnecessary for simple VPN boot server
Cost Implications
Data Transfer Costs
- VPN Traffic: Data transfer through VPN gateway charged at standard rates
- Intra-Region: Free for traffic within same region/VPC
- Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
- Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.14/month (US East egress)
Load Balancing Costs
- Application Load Balancer:
$0.0225/hour + $0.008 per LCU-hour ($16-20/month minimum) - Network Load Balancer:
$0.0225/hour + $0.006 per NLCU-hour ($16-18/month minimum) - For VPN Scenario: Load balancer unnecessary (single EC2 instance sufficient)
Compute Costs
- t3.micro Instance: ~$7.50/month (on-demand pricing, US East)
- t4g.micro Instance: ~$6.00/month (ARM-based, cheaper, sufficient for boot server)
- Reserved Instances: Up to 72% savings with 1-year or 3-year commitment
- Savings Plans: Flexible discounts for consistent compute usage
ACM Certificate Costs
- Public Certificates: Free when used with AWS services
- Private CA: $400/month (too expensive for home lab)
Comparison with Requirements
| Requirement | AWS Support | Implementation |
|---|
| TFTP | ⚠️ Via EC2, not ELB | Direct EC2 access via VPN |
| HTTP | ✅ Full support | EC2 or ALB |
| HTTPS | ✅ Full support | EC2 or ALB with ACM |
| VPN Integration | ✅ Native VPN | Site-to-Site VPN or self-managed |
| Load Balancing | ✅ ALB, NLB | Optional for HA |
| Certificate Mgmt | ✅ ACM (free) | Automatic renewal |
| Cost Efficiency | ✅ Low-cost instances | t4g.micro sufficient |
Recommendations
For VPN-Based Architecture (per ADR-0002)
EC2 Instance: Deploy single t4g.micro or t3.micro instance with:
- TFTP server (
tftpd-hpa or dnsmasq) - HTTP server (nginx or simple Python HTTP server)
- Optional HTTPS with Let’s Encrypt or self-signed certificate
VPN Connection: Connect home lab to AWS via:
- Site-to-Site VPN (IPsec) - managed service, higher cost (~$36/month)
- Self-managed WireGuard on EC2 - lower cost, more control
Security Groups: Restrict access to:
- UDP/69 (TFTP) from VPN security group only
- TCP/80 (HTTP) from VPN security group only
- TCP/443 (HTTPS) from VPN security group only
No Load Balancer: For home lab scale, direct EC2 access is sufficient
Health Monitoring: Use CloudWatch for instance and service health
If HA Required (Future Enhancement)
- Deploy multi-AZ EC2 instances with Network Load Balancer
- Use S3 as backend for boot files with EC2 serving as cache
- Implement auto-recovery with Auto Scaling Group (min=max=1)
References
2.2 - AWS WireGuard VPN Support
Analysis of WireGuard VPN deployment options on Amazon Web Services for secure site-to-site connectivity
WireGuard VPN Support on Amazon Web Services
This document analyzes options for deploying WireGuard VPN on AWS to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.
WireGuard Overview
WireGuard is a modern VPN protocol that provides:
- Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
- Performance: High throughput with low overhead
- Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
- Configuration: Simple key-based configuration
- Kernel Integration: Mainline Linux kernel support since 5.6
AWS Native VPN Support
Site-to-Site VPN (IPsec)
Status: ❌ WireGuard not natively supported
AWS’s managed Site-to-Site VPN supports:
- IPsec VPN: IKEv1, IKEv2 with pre-shared keys
- Redundancy: Two VPN tunnels per connection for high availability
- BGP Support: Dynamic routing via BGP
- Transit Gateway: Scalable multi-VPC VPN hub
Limitation: Site-to-Site VPN does not support WireGuard protocol natively.
Cost: Site-to-Site VPN
- VPN Connection: ~$0.05/hour = ~$36/month
- Data Transfer: Standard data transfer out rates (~$0.09/GB for first 10TB)
- Total Estimate: ~$36-50/month for managed IPsec VPN
Self-Managed WireGuard on EC2
Implementation Approach
Since AWS doesn’t offer managed WireGuard, deploy WireGuard on an EC2 instance:
Status: ✅ Fully supported via EC2
Architecture
graph LR
A[Home Lab] -->|WireGuard Tunnel| B[AWS EC2 Instance]
B -->|VPC Network| C[Boot Server EC2]
B -->|IP Forwarding| C
subgraph "Home Network"
A
D[UDM Pro]
D -.WireGuard Client.- A
end
subgraph "AWS VPC"
B[WireGuard Gateway EC2]
C[Boot Server EC2]
endEC2 Configuration
WireGuard Gateway Instance:
- Instance Type: t4g.micro or t3.micro ($6-7.50/month)
- OS: Ubuntu 22.04 LTS or Amazon Linux 2023 (native WireGuard support)
- Source/Dest Check: Disable to allow IP forwarding
- Elastic IP: Allocate Elastic IP for stable WireGuard endpoint
- Security Group: Allow UDP port 51820 from home lab public IP
Boot Server Instance:
- Network: Same VPC as WireGuard gateway
- Private IP Only: No Elastic IP (accessed via VPN)
- Route Traffic: Through WireGuard gateway instance
Installation Steps
# On EC2 Instance (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools
# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key
# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf
Example /etc/wireguard/wg0.conf on AWS EC2:
[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24
Corresponding config on UDM Pro:
[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>
[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <AWS_ELASTIC_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.0.0.0/16
PersistentKeepalive = 25
Enable and Start WireGuard
# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0
# Verify status
sudo wg show
AWS VPC Configuration
Security Groups
Create security group for WireGuard gateway:
aws ec2 create-security-group \
--group-name wireguard-gateway-sg \
--description "WireGuard VPN gateway" \
--vpc-id vpc-xxxxxx
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxxx \
--protocol udp \
--port 51820 \
--cidr <HOME_LAB_PUBLIC_IP>/32
Allow SSH for management (optional, restrict to trusted IP):
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxxx \
--protocol tcp \
--port 22 \
--cidr <TRUSTED_IP>/32
Disable Source/Destination Check
Required for IP forwarding to work:
aws ec2 modify-instance-attribute \
--instance-id i-xxxxxx \
--no-source-dest-check
Elastic IP Allocation
Allocate and associate Elastic IP for stable endpoint:
aws ec2 allocate-address --domain vpc
aws ec2 associate-address \
--instance-id i-xxxxxx \
--allocation-id eipalloc-xxxxxx
Cost: Elastic IP is free when associated with running instance, but charged ~$3.60/month if unattached.
Route Table Configuration
Add route to direct home lab subnet traffic through WireGuard gateway:
aws ec2 create-route \
--route-table-id rtb-xxxxxx \
--destination-cidr-block 192.168.1.0/24 \
--instance-id i-xxxxxx
This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway instance.
UDM Pro WireGuard Integration
Native Support
Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)
The UniFi Dream Machine Pro includes native WireGuard VPN support:
- GUI Configuration: Web UI for WireGuard VPN setup
- Site-to-Site: Support for site-to-site VPN tunnels
- Performance: Hardware acceleration for encryption (if available)
- Routing: Automatic route injection for remote subnets
Configuration Steps on UDM Pro
Network Settings → VPN:
- Create new VPN connection
- Select “WireGuard”
- Generate key pair or import existing
Peer Configuration:
- Peer Public Key: AWS EC2 WireGuard instance’s public key
- Endpoint: AWS Elastic IP address
- Port: 51820
- Allowed IPs: AWS VPC CIDR (e.g., 10.0.0.0/16)
- Persistent Keepalive: 25 seconds
Route Injection:
- UDM Pro automatically adds routes to AWS subnets
- Home lab servers can reach AWS boot server via VPN
Firewall Rules:
- Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN
Alternative: Manual WireGuard on UDM Pro
If native support is insufficient, use wireguard-go via udm-utilities:
- Repository: boostchicken/udm-utilities
- Script:
on_boot.d script to start WireGuard on boot - Persistence: Survives firmware updates with on-boot script
Throughput
WireGuard on EC2 performance varies by instance type:
- t4g.micro (2 vCPU, ARM): ~100-300 Mbps
- t3.micro (2 vCPU, x86): ~100-300 Mbps
- t3.small (2 vCPU): ~500-800 Mbps
- t3.medium (2 vCPU): ~1+ Gbps
For network boot (typical boot = 50-200MB), even t4g.micro is sufficient:
- Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
- Recommendation: t4g.micro adequate and most cost-effective
Latency
- VPN Overhead: WireGuard adds minimal latency (~1-5ms)
- AWS Network: Low-latency network infrastructure
- Total Latency: Primarily dependent on home ISP and AWS region proximity
CPU Usage
- Encryption: ChaCha20 is CPU-efficient
- Kernel Module: Minimal CPU overhead in kernel space
- t4g.micro: Sufficient CPU for home lab VPN throughput
- ARM Advantage: t4g instances use Graviton processors (better price/performance)
Security Considerations
Key Management
- Private Keys: Store securely, never commit to version control
- Key Rotation: Rotate keys periodically (e.g., annually)
- Secrets Manager: Store WireGuard private keys in AWS Secrets Manager
- Retrieve at instance startup via user data script
- Avoid storing in AMIs or instance metadata
- IAM Role: Grant EC2 instance IAM role to read secret
Firewall Hardening
- Security Group Restriction: Limit WireGuard port to home lab public IP only
- Least Privilege: Boot server security group allows only VPN security group
- No Public Access: Boot server has no Elastic IP or public route
Monitoring and Alerts
- CloudWatch Logs: Stream WireGuard logs to CloudWatch
- CloudWatch Alarms: Alert on VPN tunnel down (no recent handshakes)
- VPC Flow Logs: Monitor VPN traffic patterns
DDoS Protection
- UDP Amplification: WireGuard resistant to DDoS amplification attacks
- AWS Shield: Basic DDoS protection included free on all AWS resources
- Shield Advanced: Optional ($3,000/month - overkill for VPN endpoint)
High Availability Options
Multi-AZ Failover
Deploy WireGuard gateways in multiple Availability Zones:
- Primary: us-east-1a WireGuard instance
- Secondary: us-east-1b WireGuard instance
- Failover: UDM Pro switches endpoints if primary fails
- Cost: Doubles instance costs (~$12-15/month for 2 instances)
Auto Scaling Group (Single Instance)
Use Auto Scaling Group with min=max=1 for auto-recovery:
- Health Checks: EC2 status checks
- Auto-Recovery: ASG replaces failed instance automatically
- Elastic IP: Reassociate Elastic IP to new instance via Lambda/script
- Limitation: Brief downtime during recovery (~2-5 minutes)
Health Monitoring
Monitor WireGuard tunnel health with CloudWatch custom metrics:
# On EC2 instance, run periodically via cron
#!/bin/bash
HANDSHAKE=$(wg show wg0 latest-handshakes | awk '{print $2}')
NOW=$(date +%s)
AGE=$((NOW - HANDSHAKE))
aws cloudwatch put-metric-data \
--namespace WireGuard \
--metric-name TunnelAge \
--value $AGE \
--unit Seconds
Alert if handshake age exceeds threshold (e.g., 180 seconds).
User Data Script for Auto-Configuration
EC2 user data script to configure WireGuard on launch:
#!/bin/bash
# Install WireGuard
apt update && apt install -y wireguard wireguard-tools
# Retrieve private key from Secrets Manager
aws secretsmanager get-secret-value \
--secret-id wireguard-server-key \
--query SecretString \
--output text > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key
# Configure interface (full config omitted for brevity)
# ...
# Enable and start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0
Requires IAM instance role with secretsmanager:GetSecretValue permission.
Cost Analysis
Self-Managed WireGuard on EC2
| Component | Cost (US East) |
|---|
| t4g.micro instance (730 hrs/month) | ~$6.00 |
| Elastic IP (attached) | $0.00 |
| Data transfer out (1GB/month) | ~$0.09 |
| Monthly Total | ~$6.09 |
| Annual Total | ~$73 |
With Reserved Instance (1-year, no upfront):
| Component | Cost |
|---|
| t4g.micro RI (1-year) | ~$3.50/month |
| Elastic IP | $0.00 |
| Data transfer | ~$0.09 |
| Monthly Total | ~$3.59 |
| Annual Total | ~$43 |
Site-to-Site VPN (IPsec - if WireGuard not used)
| Component | Cost |
|---|
| VPN Connection (2 tunnels) | ~$36 |
| Data transfer (1GB/month) | ~$0.09 |
| Monthly Total | ~$36 |
| Annual Total | ~$432 |
Cost Savings: Self-managed WireGuard saves ~$360/year vs Site-to-Site VPN (or ~$390/year with Reserved Instance).
Comparison with Requirements
| Requirement | AWS Support | Implementation |
|---|
| WireGuard Protocol | ✅ Via EC2 | Self-managed on instance |
| Site-to-Site VPN | ✅ Yes | WireGuard tunnel |
| UDM Pro Integration | ✅ Native support | WireGuard peer config |
| Cost Efficiency | ✅ Very low cost | t4g.micro ~$6/month (on-demand) |
| Performance | ✅ Sufficient | 100+ Mbps on t4g.micro |
| Security | ✅ Modern crypto | ChaCha20, Curve25519 |
| HA (optional) | ⚠️ Manual setup | Multi-AZ or ASG |
Recommendations
For Home Lab VPN (per ADR-0002)
Self-Managed WireGuard: Deploy on EC2 t4g.micro instance
- Cost: ~$6/month on-demand, ~$3.50/month with Reserved Instance
- Performance: Sufficient for network boot traffic
- Simplicity: Easy to configure and maintain
Single AZ Deployment: Unless HA required, single instance adequate
- Region Selection: Choose region closest to home lab for lowest latency
- AZ: Single AZ sufficient (boot server not mission-critical)
UDM Pro Native WireGuard: Use built-in WireGuard client
- Configuration: Add AWS instance as WireGuard peer in UDM Pro UI
- Route Injection: UDM Pro automatically routes AWS subnets
Security Best Practices:
- Store WireGuard private key in Secrets Manager
- Restrict security group to home lab public IP only
- Use user data script to retrieve key and configure on boot
- Enable CloudWatch logging for VPN events
- Assign IAM instance role with minimal permissions
Monitoring: Set up CloudWatch alarms for:
- Instance status check failures
- High CPU usage
- VPN tunnel age (custom metric)
Cost Optimization
- Reserved Instance: Commit to 1-year Reserved Instance for ~40% savings
- Spot Instance: Consider Spot for even lower cost (~70% savings), but adds complexity (handle interruptions)
- ARM Architecture: Use t4g (Graviton) for 20% better price/performance vs t3
Future Enhancements
- HA Setup: Deploy secondary WireGuard instance in different AZ
- Automated Failover: Lambda function to reassociate Elastic IP on failure
- IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
- Mesh VPN: Expand to mesh topology if multiple sites added
References
3 - Google Cloud Platform Analysis
Technical analysis of Google Cloud Platform capabilities for hosting network boot infrastructure
This section contains detailed analysis of Google Cloud Platform (GCP) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.
Overview
Google Cloud Platform is Google’s suite of cloud computing services, offering compute, storage, networking, and managed services. This analysis focuses on GCP’s capabilities to support the network boot architecture decided in ADR-0002.
Key Services Evaluated
- Compute Engine: Virtual machine instances for hosting boot server
- Cloud VPN / VPC: Network connectivity and VPN capabilities
- Cloud Load Balancing: Layer 4 and Layer 7 load balancing for HTTP/HTTPS
- Cloud NAT: Network address translation for outbound connectivity
- VPC Network: Software-defined networking and routing
Documentation Sections
3.1 - Cloud Storage FUSE (gcsfuse)
Analysis of Google Cloud Storage FUSE for mounting GCS buckets as local filesystems in network boot infrastructure
Overview
Cloud Storage FUSE (gcsfuse) is a FUSE-based filesystem adapter that allows Google Cloud Storage (GCS) buckets to be mounted and accessed as local filesystems on Linux systems. This enables applications to interact with object storage using standard filesystem operations (open, read, write, etc.) rather than requiring GCS-specific APIs.
Project: GoogleCloudPlatform/gcsfuse
License: Apache 2.0
Status: Generally Available (GA)
Latest Version: v2.x (as of 2024)
How gcsfuse Works
gcsfuse translates filesystem operations into GCS API calls:
- Mount Operation:
gcsfuse bucket-name /mount/point maps a GCS bucket to a local directory - Directory Structure: Interprets
/ in object names as directory separators - File Operations: Translates
read(), write(), open(), etc. into GCS API requests - Metadata: Maintains file attributes (size, modification time) via GCS metadata
- Caching: Optional stat, type, list, and file caching to reduce API calls
Example:
- GCS object:
gs://boot-assets/kernels/talos-v1.6.0.img - Mounted path:
/mnt/boot-assets/kernels/talos-v1.6.0.img
Relevance to Network Boot Infrastructure
In the context of ADR-0005 Network Boot Infrastructure, gcsfuse offers a potential approach for serving boot assets from Cloud Storage without custom integration code.
Potential Use Cases
- Boot Asset Storage: Mount
gs://boot-assets/ to /var/lib/boot-server/assets/ - Configuration Sync: Access boot profiles and machine mappings from GCS as local files
- Matchbox Integration: Mount GCS bucket to
/var/lib/matchbox/ for assets/profiles/groups - Simplified Development: Eliminate custom Cloud Storage SDK integration in boot server code
Architecture Pattern
┌─────────────────────────┐
│ Boot Server Process │
│ (Cloud Run/Compute) │
└───────────┬─────────────┘
│ filesystem operations
│ (read, open, stat)
▼
┌─────────────────────────┐
│ gcsfuse mount point │
│ /var/lib/boot-assets │
└───────────┬─────────────┘
│ FUSE layer
│ (translates to GCS API)
▼
┌─────────────────────────┐
│ Cloud Storage Bucket │
│ gs://boot-assets/ │
└─────────────────────────┘
Latency
- Much higher latency than local filesystem: Every operation requires GCS API call(s)
- No default caching: Without caching enabled, every read re-fetches from GCS
- Network round-trip: Minimum ~10-50ms latency per operation (depending on region)
Throughput
Single Large File:
- Read: ~4.1 MiB/s (individual file), up to 63.3 MiB/s (archive files)
- Write: Comparable to
gsutil cp for large files - With parallel downloads: Up to 9x faster for single-threaded reads of large files
Small Files:
- Poor performance for random I/O on small files
- Bulk operations on many small files create significant bottlenecks
ls on directories with thousands of objects can take minutes
Concurrent Access:
- Performance degrades significantly with parallel readers (8 instances: ~30 hours vs 16 minutes with local data)
- Not recommended for high-concurrency scenarios (web servers, NAS)
Streaming Writes (default): Upload data directly to GCS as written
- Up to 40% faster for large sequential writes
- Reduces local disk usage (no staging file)
Parallel Downloads: Download large files using multiple workers
- Up to 9x faster model load times
- Best for single-threaded reads of large files
File Cache: Cache file contents locally (Local SSD, Persistent Disk, or tmpfs)
- Up to 2.3x faster training time (AI/ML workloads)
- Up to 3.4x higher throughput
- Requires explicit cache directory configuration
Metadata Cache: Cache stat, type, and list operations
- Stat and type caches enabled by default
- Configurable TTL (default: 60s, set
-1 for unlimited)
Caching Configuration
gcsfuse provides four types of caching:
1. Stat Cache
Caches file attributes (size, modification time, existence).
# Enable with unlimited size and TTL
gcsfuse \
--stat-cache-max-size-mb=-1 \
--metadata-cache-ttl-secs=-1 \
bucket-name /mount/point
Use case: Reduces API calls for repeated stat() operations (e.g., checking file existence).
2. Type Cache
Caches file vs directory type information.
gcsfuse \
--type-cache-max-size-mb=-1 \
--metadata-cache-ttl-secs=-1 \
bucket-name /mount/point
Use case: Speeds up directory traversal and ls operations.
3. List Cache
Caches directory listing results.
gcsfuse \
--max-conns-per-host=100 \
--metadata-cache-ttl-secs=-1 \
bucket-name /mount/point
Use case: Improves performance for applications that repeatedly list directory contents.
4. File Cache
Caches actual file contents locally.
gcsfuse \
--file-cache-max-size-mb=-1 \
--cache-dir=/mnt/local-ssd \
--file-cache-cache-file-for-range-read=true \
--file-cache-enable-parallel-downloads=true \
bucket-name /mount/point
Use case: Essential for AI/ML training, repeated reads of large files.
Recommended cache storage:
- Local SSD: Fastest, but ephemeral (data lost on restart)
- Persistent Disk: Persistent but slower than Local SSD
- tmpfs (RAM disk): Fastest but limited by memory
Production Configuration Example
# config.yaml for gcsfuse
metadata-cache:
ttl-secs: -1 # Never expire (use only if bucket is read-only or single-writer)
stat-cache-max-size-mb: -1
type-cache-max-size-mb: -1
file-cache:
max-size-mb: -1 # Unlimited (limited by disk space)
cache-file-for-range-read: true
enable-parallel-downloads: true
parallel-downloads-per-file: 16
download-chunk-size-mb: 50
write:
create-empty-file: false # Streaming writes (default)
logging:
severity: info
format: json
gcsfuse --config-file=config.yaml boot-assets /mnt/boot-assets
Limitations and Considerations
Filesystem Semantics
gcsfuse provides approximate POSIX semantics but is not fully POSIX-compliant:
- No atomic rename: Rename operations are copy-then-delete (not atomic)
- No hard links: GCS doesn’t support hard links
- No file locking:
flock() is a no-op - Limited permissions: GCS has simpler ACLs than POSIX permissions
- No sparse files: Writes always materialize full file content
❌ Avoid:
- Serving web content or acting as NAS (concurrent connections)
- Random I/O on many small files (image datasets, text corpora)
- Reading during ML training loops (download first, then train)
- High-concurrency workloads (multiple parallel readers/writers)
✅ Good for:
- Sequential reads of large files (models, checkpoints, kernels)
- Infrequent writes of entire files
- Read-mostly workloads with caching enabled
- Single-writer scenarios
Consistency Trade-offs
With caching enabled:
- Stale reads possible if cache TTL > 0 and external modifications occur
- Safe only for:
- Read-only buckets
- Single-writer, single-mount scenarios
- Workloads tolerant of eventual consistency
Without caching:
- Strong consistency (every read fetches latest from GCS)
- Much slower performance
Resource Requirements
- Disk space: File cache and streaming writes require local storage
- File cache: Size of cached files (can be large for ML datasets)
- Streaming writes: Temporary staging (proportional to concurrent writes)
- Memory: Metadata caches consume RAM
- File handles: Can exceed system limits with high concurrency
- Network bandwidth: All data transfers via GCS API
Installation
On Compute Engine (Container-Optimized OS)
# Install gcsfuse (Container-Optimized OS doesn't include package managers)
export GCSFUSE_VERSION=2.x.x
curl -L -O https://github.com/GoogleCloudPlatform/gcsfuse/releases/download/v${GCSFUSE_VERSION}/gcsfuse_${GCSFUSE_VERSION}_amd64.deb
sudo dpkg -i gcsfuse_${GCSFUSE_VERSION}_amd64.deb
On Debian/Ubuntu
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse
In Docker/Cloud Run
FROM ubuntu:22.04
# Install gcsfuse
RUN apt-get update && apt-get install -y \
curl \
gnupg \
lsb-release \
&& export GCSFUSE_REPO=gcsfuse-$(lsb_release -c -s) \
&& echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list \
&& curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
&& apt-get update \
&& apt-get install -y gcsfuse \
&& rm -rf /var/lib/apt/lists/*
# Create mount point
RUN mkdir -p /mnt/boot-assets
# Mount gcsfuse at startup
CMD gcsfuse --foreground boot-assets /mnt/boot-assets & \
/usr/local/bin/boot-server
Note: Cloud Run does not support FUSE filesystems (requires privileged mode). gcsfuse only works on Compute Engine or GKE.
Network Boot Infrastructure Evaluation
Applicability to ADR-0005
Based on the analysis, gcsfuse is not recommended for the network boot infrastructure for the following reasons:
❌ Cloud Run Incompatibility
- gcsfuse requires FUSE kernel module and privileged containers
- Cloud Run does not support FUSE or privileged mode
- ADR-0005 prefers Cloud Run deployment (HTTP-only boot enables serverless)
- Impact: Blocks Cloud Run deployment, forcing Compute Engine VM
❌ Boot Latency Requirements
- Boot file requests target < 100ms latency (ADR-0005 confirmation criteria)
- gcsfuse adds 10-50ms+ latency per operation (network round-trips)
- Kernel/initrd downloads are latency-sensitive (network boot timeout)
- Impact: May exceed boot timeout thresholds
❌ No Caching for Read-Write Workloads
- Boot server needs to write new assets and read existing ones
- File cache with unlimited TTL requires read-only or single-writer assumption
- Multiple boot server instances (autoscaling) violate single-writer constraint
- Impact: Either accept stale reads or disable caching (slow)
- Machine mapping configs, boot scripts, profiles are small files (KB range)
- gcsfuse performs poorly on small, random I/O
ls operations on directories with many profiles can be slow- Impact: Slow boot configuration lookups
✅ Alternative: Direct Cloud Storage SDK
Using cloud.google.com/go/storage SDK directly offers:
- Lower latency: Direct API calls without FUSE overhead
- Cloud Run compatible: No kernel module or privileged mode required
- Better control: Explicit caching, parallel downloads, streaming
- Simpler deployment: No mount management, no FUSE dependencies
- Cost: Similar API call costs to gcsfuse
Recommended approach (from ADR-0005):
// Custom boot server using Cloud Storage SDK
storage := storage.NewClient(ctx)
bucket := storage.Bucket("boot-assets")
// Stream kernel to boot client
obj := bucket.Object("kernels/talos-v1.6.0.img")
reader, _ := obj.NewReader(ctx)
defer reader.Close()
io.Copy(w, reader) // Stream to HTTP response
When gcsfuse MIGHT Be Useful
Despite the above limitations, gcsfuse could be considered for:
Matchbox on Compute Engine:
- Matchbox expects filesystem paths for assets (
/var/lib/matchbox/assets/) - Compute Engine VM supports FUSE
- Read-heavy workload (boot assets rarely change)
- Could mount
gs://boot-assets/ to /var/lib/matchbox/assets/ with file cache
Development/Testing:
- Quick prototyping without writing Cloud Storage integration
- Local development with production bucket access
- Not recommended for production deployment
Low-Throughput Scenarios:
- Home lab scale (< 10 boots/hour)
- File cache enabled with Local SSD
- Single Compute Engine VM (not autoscaled)
Configuration for Matchbox + gcsfuse:
#!/bin/bash
# Mount boot assets for Matchbox
BUCKET="boot-assets"
MOUNT_POINT="/var/lib/matchbox/assets"
CACHE_DIR="/mnt/disks/local-ssd/gcsfuse-cache"
mkdir -p "$MOUNT_POINT" "$CACHE_DIR"
gcsfuse \
--stat-cache-max-size-mb=-1 \
--type-cache-max-size-mb=-1 \
--metadata-cache-ttl-secs=-1 \
--file-cache-max-size-mb=-1 \
--cache-dir="$CACHE_DIR" \
--file-cache-cache-file-for-range-read=true \
--file-cache-enable-parallel-downloads=true \
--implicit-dirs \
--foreground \
"$BUCKET" "$MOUNT_POINT"
Monitoring and Troubleshooting
Metrics
gcsfuse exposes Prometheus metrics:
gcsfuse --prometheus --prometheus-port=9101 bucket /mnt/point
Key metrics:
gcs_read_count: Number of GCS read operationsgcs_write_count: Number of GCS write operationsgcs_read_bytes: Bytes read from GCSgcs_write_bytes: Bytes written to GCSfs_ops_count: Filesystem operations by type (open, read, write, etc.)fs_ops_error_count: Filesystem operation errors
Logging
# JSON logging for Cloud Logging integration
gcsfuse --log-format=json --log-file=/var/log/gcsfuse.log bucket /mnt/point
Common Issues
Issue: ls on large directories takes minutes
Solution:
- Enable list caching with
--metadata-cache-ttl-secs=-1 - Reduce directory depth (flatten object hierarchy)
- Consider prefix-based filtering instead of full listings
Issue: Stale reads after external bucket modifications
Solution:
- Reduce
--metadata-cache-ttl-secs (default 60s) - Disable caching entirely for strong consistency
- Use versioned object names (immutable assets)
Issue: Transport endpoint is not connected errors
Solution:
- Unmount cleanly before remounting:
fusermount -u /mnt/point - Check GCS bucket permissions (IAM roles)
- Verify network connectivity to
storage.googleapis.com
Issue: High memory usage
Solution:
- Limit metadata cache sizes:
--stat-cache-max-size-mb=1024 - Disable file cache if not needed
- Monitor with
--prometheus metrics
Comparison to Alternatives
gcsfuse vs Direct Cloud Storage SDK
| Aspect | gcsfuse | Cloud Storage SDK |
|---|
| Latency | Higher (FUSE overhead + GCS API) | Lower (direct GCS API) |
| Cloud Run | ❌ Not supported | ✅ Fully supported |
| Development Effort | Low (standard filesystem code) | Medium (SDK integration) |
| Performance | Slower (filesystem abstraction) | Faster (optimized for use case) |
| Caching | Built-in (stat, type, list, file) | Manual (application-level) |
| Streaming | Automatic | Explicit (io.Copy) |
| Dependencies | FUSE kernel module, privileged mode | None (pure Go library) |
Recommendation: Use Cloud Storage SDK directly for production network boot infrastructure.
gcsfuse vs rsync/gsutil Sync
Periodic sync pattern:
# Sync bucket to local disk every 5 minutes
*/5 * * * * gsutil -m rsync -r gs://boot-assets /var/lib/boot-assets
| Aspect | gcsfuse | rsync/gsutil sync |
|---|
| Consistency | Eventual (with caching) | Strong (within sync interval) |
| Disk Usage | Minimal (file cache optional) | Full copy of assets |
| Latency | GCS API per request | Local disk (fast) |
| Sync Lag | Real-time (no caching) or TTL | Sync interval (minutes) |
| Deployment | Requires FUSE | Simple cron job |
Recommendation: For read-heavy, infrequent-write workloads on Compute Engine, rsync/gsutil sync is simpler and faster than gcsfuse.
Conclusion
Cloud Storage FUSE (gcsfuse) provides a convenient filesystem abstraction over GCS buckets, but is not recommended for the network boot infrastructure due to:
- Cloud Run incompatibility (requires FUSE kernel module)
- Added latency (FUSE overhead + network round-trips)
- Poor performance for small files and concurrent access
- Caching trade-offs (consistency vs performance)
Recommended alternatives:
- Custom Boot Server: Direct Cloud Storage SDK integration (
cloud.google.com/go/storage) - Matchbox on Compute Engine: rsync/gsutil sync to local disk
- Cloud Run Deployment: Direct SDK (no gcsfuse possible)
gcsfuse may be useful for development/testing or Matchbox prototyping on Compute Engine, but production deployments should use direct SDK integration or periodic sync for optimal performance and Cloud Run compatibility.
References
3.2 - GCP Network Boot Protocol Support
Analysis of Google Cloud Platform’s support for TFTP, HTTP, and HTTPS routing for network boot infrastructure
This document analyzes GCP’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.
TFTP (Trivial File Transfer Protocol) Support
Native Support
Status: ❌ Not natively supported by Cloud Load Balancing
GCP’s Cloud Load Balancing services (Application Load Balancer, Network Load Balancer) do not support TFTP protocol natively. TFTP operates on UDP port 69 and has unique protocol requirements that are not compatible with GCP’s load balancing services.
Implementation Options
Option 1: Direct VM Access (Recommended for VPN Scenario)
Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from a Compute Engine VM without load balancing:
- Approach: Run TFTP server (e.g.,
tftpd-hpa, dnsmasq) on a Compute Engine VM - Access: Home lab connects via VPN tunnel to the VM’s private IP
- Routing: VPC firewall rules allow UDP/69 from VPN subnet
- Pros:
- Simple implementation
- No need for load balancing (single boot server sufficient)
- TFTP traffic encrypted through VPN tunnel
- Direct VM-to-client communication
- Cons:
- Single point of failure (no load balancing/HA)
- Manual failover required if VM fails
Option 2: Network Load Balancer (NLB) Passthrough
While NLB doesn’t parse TFTP protocol, it can forward UDP traffic:
- Approach: Configure Network Load Balancer for UDP/69 passthrough
- Limitations:
- No protocol-aware health checks for TFTP
- Health checks would use TCP or HTTP on alternate port
- Adds complexity without significant benefit for single boot server
- Use Case: Only relevant for multi-region HA deployment (overkill for home lab)
TFTP Security Considerations
- Encryption: TFTP protocol itself is unencrypted, but VPN tunnel provides encryption
- Firewall Rules: Restrict UDP/69 to VPN subnet only (no public access)
- File Access Control: Configure TFTP server with restricted file access
- Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads
HTTP Support
Native Support
Status: ✅ Fully supported
GCP provides comprehensive HTTP support through multiple services:
Cloud Load Balancing - Application Load Balancer
- Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (QUIC)
- Port: Any port (typically 80 for HTTP)
- Routing: URL-based routing, host-based routing, path-based routing
- Health Checks: HTTP health checks with configurable paths
- SSL Offloading: Can terminate SSL at load balancer and use HTTP backend
- Backend: Compute Engine VMs, instance groups, Cloud Run, GKE
Compute Engine Direct Access
For VPN scenario, HTTP can be served directly from VM:
- Approach: Run HTTP server (nginx, Apache, custom service) on Compute Engine VM
- Access: Home lab accesses via VPN tunnel to private IP
- Firewall: VPC firewall rules allow TCP/80 from VPN subnet
- Pros: Simpler than load balancer for single boot server
HTTP Boot Flow for Network Boot
- PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
- iPXE → HTTP: iPXE chainloads boot files via HTTP from same server
- Kernel/Initrd: Large boot files served efficiently over HTTP
- Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
- Compression: gzip compression for text-based boot configs
- Caching: Cloud CDN can cache boot files for faster delivery
- TCP Optimization: GCP’s network optimized for low-latency TCP
HTTPS Support
Native Support
Status: ✅ Fully supported with advanced features
GCP provides enterprise-grade HTTPS support:
Cloud Load Balancing - Application Load Balancer
- Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 with QUIC
- SSL/TLS Termination: Terminate SSL at load balancer
- Certificate Management:
- Google-managed SSL certificates (automatic renewal)
- Self-managed certificates (bring your own)
- Certificate Map for multiple domains
- TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable minimum version)
- Cipher Suites: Modern, compatible, or custom cipher suites
- mTLS Support: Mutual TLS authentication (client certificates)
Certificate Manager
- Managed Certificates: Automatic provisioning and renewal via Let’s Encrypt integration
- Private CA: Integration with Google Cloud Certificate Authority Service
- Certificate Maps: Route different domains to different backends based on SNI
- Certificate Monitoring: Automatic alerts before expiration
HTTPS for Network Boot
Use Case
Modern UEFI firmware and iPXE support HTTPS boot:
- iPXE HTTPS: iPXE compiled with
DOWNLOAD_PROTO_HTTPS can fetch over HTTPS - UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot (RFC 3720 iSCSI boot)
- Security: Boot file integrity verified via HTTPS chain of trust
Implementation on GCP
Certificate Provisioning:
- Use Google-managed certificate for public domain (if boot server has public DNS)
- Use self-signed certificate for VPN-only access (add to iPXE trust store)
- Use private CA for internal PKI
Load Balancer Configuration:
- HTTPS frontend (port 443)
- Backend service to Compute Engine VM running boot server
- SSL policy with TLS 1.2+ minimum
Alternative: Direct VM HTTPS:
- Run nginx/Apache with TLS on Compute Engine VM
- Access via VPN tunnel to private IP with HTTPS
- Simpler setup for VPN-only scenario
mTLS Support for Enhanced Security
GCP’s Application Load Balancer supports mutual TLS authentication:
- Client Certificates: Require client certificates for additional authentication
- Certificate Validation: Validate client certificates against trusted CA
- Use Case: Ensure only authorized home lab servers can access boot files
- Integration: Combine with VPN for defense-in-depth
Routing and Load Balancing Capabilities
VPC Routing
- Custom Routes: Define routes to direct traffic through VPN gateway
- Route Priority: Configure route priorities for failover scenarios
- BGP Support: Dynamic routing with Cloud Router (for advanced VPN setups)
Firewall Rules
- Ingress/Egress Rules: Fine-grained control over traffic
- Source/Destination Filters: IP ranges, tags, service accounts
- Protocol Filtering: Allow specific protocols (UDP/69, TCP/80, TCP/443)
- VPN Subnet Restriction: Limit access to VPN-connected home lab subnet
Cloud Armor (Optional)
For additional security if boot server has public access:
- DDoS Protection: Layer 3/4 DDoS mitigation
- WAF Rules: Application-level filtering
- IP Allowlisting: Restrict to known public IPs
- Rate Limiting: Prevent abuse
Cost Implications
Network Egress Costs
- VPN Traffic: Egress to VPN endpoint charged at standard internet egress rates
- Intra-Region: Free for traffic within same region
- Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
- Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.18/month (US egress)
Load Balancing Costs
- Application Load Balancer: ~$0.025/hour + $0.008 per LCU-hour
- Network Load Balancer: ~$0.025/hour + data processing charges
- For VPN Scenario: Load balancer likely unnecessary (single VM sufficient)
Compute Costs
- e2-micro Instance: ~$6-7/month (suitable for boot server)
- f1-micro Instance: ~$4-5/month (even smaller, might suffice)
- Reserved/Committed Use: Discounts for long-term commitment
Comparison with Requirements
| Requirement | GCP Support | Implementation |
|---|
| TFTP | ⚠️ Via VM, not LB | Direct VM access via VPN |
| HTTP | ✅ Full support | VM or ALB |
| HTTPS | ✅ Full support | VM or ALB with Certificate Manager |
| VPN Integration | ✅ Native VPN | Cloud VPN or self-managed WireGuard |
| Load Balancing | ✅ ALB, NLB | Optional for HA |
| Certificate Mgmt | ✅ Managed certs | Certificate Manager |
| Cost Efficiency | ✅ Low-cost VMs | e2-micro sufficient |
Recommendations
For VPN-Based Architecture (per ADR-0002)
Compute Engine VM: Deploy single e2-micro VM with:
- TFTP server (
tftpd-hpa or dnsmasq) - HTTP server (nginx or simple Python HTTP server)
- Optional HTTPS with self-signed certificate
VPN Tunnel: Connect home lab to GCP via:
- Cloud VPN (IPsec) - easier setup, higher cost
- Self-managed WireGuard on Compute Engine - lower cost, more control
VPC Firewall: Restrict access to:
- UDP/69 (TFTP) from VPN subnet only
- TCP/80 (HTTP) from VPN subnet only
- TCP/443 (HTTPS) from VPN subnet only
No Load Balancer: For home lab scale, direct VM access is sufficient
Health Monitoring: Use Cloud Monitoring for VM and service health
If HA Required (Future Enhancement)
- Deploy multi-zone VMs with Network Load Balancer
- Use Cloud Storage as backend for boot files with VM serving as cache
- Implement failover automation with Cloud Functions
References
3.3 - GCP WireGuard VPN Support
Analysis of WireGuard VPN deployment options on Google Cloud Platform for secure site-to-site connectivity
This document analyzes options for deploying WireGuard VPN on GCP to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.
WireGuard Overview
WireGuard is a modern VPN protocol that provides:
- Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
- Performance: High throughput with low overhead
- Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
- Configuration: Simple key-based configuration
- Kernel Integration: Mainline Linux kernel support since 5.6
GCP Native VPN Support
Cloud VPN (IPsec)
Status: ❌ WireGuard not natively supported
GCP’s managed Cloud VPN service supports:
- IPsec VPN: IKEv1, IKEv2 with PSK or certificate authentication
- HA VPN: Highly available VPN with 99.99% SLA
- Classic VPN: Single-tunnel VPN (deprecated)
Limitation: Cloud VPN does not support WireGuard protocol natively.
Cost: Cloud VPN
- HA VPN: ~$0.05/hour per tunnel × 2 tunnels = ~$73/month
- Egress: Standard internet egress rates (~$0.12/GB for first 1TB)
- Total Estimate: ~$75-100/month for managed VPN
Self-Managed WireGuard on Compute Engine
Implementation Approach
Since GCP doesn’t offer managed WireGuard, deploy WireGuard on a Compute Engine VM:
Status: ✅ Fully supported via Compute Engine
Architecture
graph LR
A[Home Lab] -->|WireGuard Tunnel| B[GCP Compute Engine VM]
B -->|Private VPC Network| C[Boot Server VM]
B -->|IP Forwarding| C
subgraph "Home Network"
A
D[UDM Pro]
D -.WireGuard Client.- A
end
subgraph "GCP VPC"
B[WireGuard Gateway VM]
C[Boot Server VM]
endVM Configuration
WireGuard Gateway VM:
- Instance Type: e2-micro or f1-micro ($4-7/month)
- OS: Ubuntu 22.04 LTS or Debian 12 (native WireGuard kernel support)
- IP Forwarding: Enable IP forwarding to route traffic to other VMs
- External IP: Static external IP for stable WireGuard endpoint
- Firewall: Allow UDP port 51820 (WireGuard) from home lab public IP
Boot Server VM:
- Network: Same VPC as WireGuard gateway
- Private IP Only: No external IP (accessed via VPN)
- Route Traffic: Through WireGuard gateway VM
Installation Steps
# On GCP Compute Engine VM (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools
# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key
# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf
Example /etc/wireguard/wg0.conf on GCP VM:
[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o ens4 -j MASQUERADE
[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24
Corresponding config on UDM Pro:
[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>
[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <GCP_VM_EXTERNAL_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.128.0.0/20
PersistentKeepalive = 25
Enable and Start WireGuard
# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0
# Verify status
sudo wg show
GCP VPC Configuration
Firewall Rules
Create VPC firewall rule to allow WireGuard:
gcloud compute firewall-rules create allow-wireguard \
--direction=INGRESS \
--priority=1000 \
--network=default \
--action=ALLOW \
--rules=udp:51820 \
--source-ranges=<HOME_LAB_PUBLIC_IP>/32 \
--target-tags=wireguard-gateway
Tag the WireGuard VM:
gcloud compute instances add-tags wireguard-gateway-vm \
--tags=wireguard-gateway \
--zone=us-central1-a
Static External IP
Reserve static IP for stable WireGuard endpoint:
gcloud compute addresses create wireguard-gateway-ip \
--region=us-central1
gcloud compute instances delete-access-config wireguard-gateway-vm \
--access-config-name="external-nat" \
--zone=us-central1-a
gcloud compute instances add-access-config wireguard-gateway-vm \
--access-config-name="external-nat" \
--address=wireguard-gateway-ip \
--zone=us-central1-a
Cost: Static IP ~$3-4/month if VM is always running (free if attached to running VM in some regions).
Route Configuration
For traffic from boot server to reach home lab via WireGuard VM:
gcloud compute routes create route-to-homelab \
--network=default \
--priority=100 \
--destination-range=192.168.1.0/24 \
--next-hop-instance=wireguard-gateway-vm \
--next-hop-instance-zone=us-central1-a
This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway VM.
UDM Pro WireGuard Integration
Native Support
Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)
The UniFi Dream Machine Pro includes native WireGuard VPN support:
- GUI Configuration: Web UI for WireGuard VPN setup
- Site-to-Site: Support for site-to-site VPN tunnels
- Performance: Hardware acceleration for encryption (if available)
- Routing: Automatic route injection for remote subnets
Configuration Steps on UDM Pro
Network Settings → VPN:
- Create new VPN connection
- Select “WireGuard”
- Generate key pair or import existing
Peer Configuration:
- Peer Public Key: GCP WireGuard VM’s public key
- Endpoint: GCP VM’s static external IP
- Port: 51820
- Allowed IPs: GCP VPC subnet (e.g., 10.128.0.0/20)
- Persistent Keepalive: 25 seconds
Route Injection:
- UDM Pro automatically adds routes to GCP subnets
- Home lab servers can reach GCP boot server via VPN
Firewall Rules:
- Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN
Alternative: Manual WireGuard on UDM Pro
If native support is insufficient, use wireguard-go via udm-utilities:
- Repository: boostchicken/udm-utilities
- Script:
on_boot.d script to start WireGuard - Persistence: Survives firmware updates with on-boot script
Throughput
WireGuard on Compute Engine performance:
- e2-micro (2 vCPU, shared core): ~100-300 Mbps
- e2-small (2 vCPU): ~500-800 Mbps
- e2-medium (2 vCPU): ~1+ Gbps
For network boot (typical boot = 50-200MB), even e2-micro is sufficient:
- Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
- Recommendation: e2-micro adequate for home lab scale
Latency
- VPN Overhead: WireGuard adds minimal latency (~1-5ms overhead)
- GCP Network: Low-latency network to most regions
- Total Latency: Primarily dependent on home ISP and GCP region proximity
CPU Usage
- Encryption: ChaCha20 is CPU-efficient
- Kernel Module: Minimal CPU overhead in kernel space
- e2-micro: Sufficient CPU for home lab VPN throughput
Security Considerations
Key Management
- Private Keys: Store securely, never commit to version control
- Key Rotation: Rotate keys periodically (e.g., annually)
- Secret Manager: Store WireGuard private keys in GCP Secret Manager
- Retrieve at VM startup via startup script
- Avoid storing in VM metadata or disk images
Firewall Hardening
- Source IP Restriction: Limit WireGuard port to home lab public IP only
- Least Privilege: Boot server firewall allows only VPN subnet
- No Public Access: Boot server has no external IP
Monitoring and Alerts
- Cloud Logging: Log WireGuard connection events
- Cloud Monitoring: Alert on VPN tunnel down
- Metrics: Monitor handshake failures, data transfer
DDoS Protection
- UDP Amplification: WireGuard resistant to DDoS amplification
- Cloud Armor: Optional layer for additional DDoS protection (overkill for VPN)
High Availability Options
Multi-Region Failover
Deploy WireGuard gateways in multiple regions:
- Primary: us-central1 WireGuard VM
- Secondary: us-east1 WireGuard VM
- Failover: UDM Pro switches endpoints if primary fails
- Cost: Doubles VM costs (~$8-14/month for 2 VMs)
Health Checks
Monitor WireGuard tunnel health:
# On UDM Pro (via SSH)
wg show wg0 latest-handshakes
# If handshake timestamp old (>3 minutes), tunnel may be down
Automate failover with script on UDM Pro or external monitoring.
Startup Scripts for Auto-Healing
GCP VM startup script to ensure WireGuard starts on boot:
#!/bin/bash
# /etc/startup-script.sh
# Retrieve WireGuard private key from Secret Manager
gcloud secrets versions access latest --secret="wireguard-server-key" > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key
# Start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0
Attach as metadata:
gcloud compute instances add-metadata wireguard-gateway-vm \
--metadata-from-file startup-script=/path/to/startup-script.sh \
--zone=us-central1-a
Cost Analysis
Self-Managed WireGuard on Compute Engine
| Component | Cost |
|---|
| e2-micro VM (730 hrs/month) | ~$6.50 |
| Static External IP | ~$3.50 |
| Egress (1GB/month boot traffic) | ~$0.12 |
| Monthly Total | ~$10.12 |
| Annual Total | ~$121 |
Cloud VPN (IPsec - if WireGuard not used)
| Component | Cost |
|---|
| HA VPN Gateway (2 tunnels) | ~$73 |
| Egress (1GB/month) | ~$0.12 |
| Monthly Total | ~$73 |
| Annual Total | ~$876 |
Cost Savings: Self-managed WireGuard saves ~$755/year vs Cloud VPN.
Comparison with Requirements
| Requirement | GCP Support | Implementation |
|---|
| WireGuard Protocol | ✅ Via Compute Engine | Self-managed on VM |
| Site-to-Site VPN | ✅ Yes | WireGuard tunnel |
| UDM Pro Integration | ✅ Native support | WireGuard peer config |
| Cost Efficiency | ✅ Low cost | e2-micro ~$10/month |
| Performance | ✅ Sufficient | 100+ Mbps on e2-micro |
| Security | ✅ Modern crypto | ChaCha20, Curve25519 |
| HA (optional) | ⚠️ Manual setup | Multi-region VMs |
Recommendations
For Home Lab VPN (per ADR-0002)
Self-Managed WireGuard: Deploy on Compute Engine e2-micro VM
- Cost: ~$10/month (vs ~$73/month for Cloud VPN)
- Performance: Sufficient for network boot traffic
- Simplicity: Easy to configure and maintain
Single Region Deployment: Unless HA required, single VM adequate
- Region Selection: Choose region closest to home lab for lowest latency
- Zone: Single zone sufficient (boot server not mission-critical)
UDM Pro Native WireGuard: Use built-in WireGuard client
- Configuration: Add GCP VM as WireGuard peer in UDM Pro UI
- Route Injection: UDM Pro automatically routes GCP subnets
Security Best Practices:
- Store WireGuard private key in Secret Manager
- Restrict WireGuard port to home public IP only
- Use startup script to configure VM on boot
- Enable Cloud Logging for VPN events
Monitoring: Set up Cloud Monitoring alerts for:
- VM down
- High CPU usage (indicates traffic spike or issue)
- Firewall rule blocks (indicates misconfiguration)
Future Enhancements
- HA Setup: Deploy secondary WireGuard VM in different region
- Automated Failover: Script on UDM Pro to switch endpoints
- IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
- Mesh VPN: Expand to mesh topology if multiple sites added
References
4 - HP ProLiant DL360 Gen9 Analysis
Technical analysis of HP ProLiant DL360 Gen9 server capabilities with focus on network boot support
This section contains detailed analysis of the HP ProLiant DL360 Gen9 server platform, including hardware specifications, network boot capabilities, and configuration guidance for home lab deployments.
Overview
The HP ProLiant DL360 Gen9 is a 1U rack-mountable server released by HPE as part of their Generation 9 (Gen9) product line, introduced in 2014. It’s a popular choice for home labs due to its balance of performance, density, and relative power efficiency compared to earlier generations.
Key Features
- Form Factor: 1U rack-mountable
- Processor Support: Dual Intel Xeon E5-2600 v3/v4 processors (Haswell/Broadwell)
- Memory: Up to 768GB DDR4 RAM (24 DIMM slots)
- Storage: Flexible SFF/LFF drive configurations
- Network: Integrated quad-port 1GbE or 10GbE FlexibleLOM options
- Management: iLO 4 (Integrated Lights-Out) with remote KVM and virtual media
- Boot Options: UEFI and Legacy BIOS support with extensive network boot capabilities
Documentation Sections
4.1 - Configuration Guide
Setup, optimization, and configuration recommendations for HP ProLiant DL360 Gen9 in home lab environments
Initial Setup
Hardware Assembly
Install Processors:
- Use thermal paste (HPE thermal grease recommended)
- Align CPU carefully with socket (LGA 2011-3)
- Secure heatsink with proper torque (hand-tighten screws in cross pattern)
- Install both CPUs for dual-socket configuration
Install Memory:
- Populate channels evenly (see Memory Configuration below)
- Seat DIMMs firmly until retention clips engage
- Verify all DIMMs recognized in POST
Install Storage:
- Insert drives into hot-swap caddies
- Label drives clearly for identification
- Configure RAID controller (see Storage Configuration below)
Install Network Cards:
- FlexibleLOM: Slide into dedicated slot until seated
- PCIe cards: Ensure low-profile brackets, secure with screw
- Note MAC addresses for DHCP reservations
Connect Power:
- Install PSUs (both for redundancy)
- Connect power cords
- Verify PSU LEDs indicate proper operation
Initial Power-On:
- Press power button
- Monitor POST on screen or via iLO remote console
- Address any POST errors before proceeding
iLO 4 Initial Configuration
Physical iLO Connection
- Connect Ethernet cable to dedicated iLO port (not FlexibleLOM)
- Default iLO IP: Obtains via DHCP, or use temporary address via RBSU
- Check DHCP server logs for iLO MAC and assigned IP
First Login
- Access iLO web interface:
https://<ilo-ip> - Default credentials:
- Username:
Administrator - Password: On label on server pull-out tab (or rear label)
- Immediately change default password (Administration > Access Settings)
Essential iLO Settings
Network Configuration (Administration > Network):
- Set static IP or DHCP reservation
- Configure DNS servers
- Set hostname (e.g.,
ilo-dl360-01) - Enable SNTP time sync
Security (Administration > Security):
- Enforce HTTPS only (disable HTTP)
- Configure SSH key authentication if using CLI
- Set strong password policy
- Enable iLO Security features
Access (Administration > Access Settings):
- Configure iLO username/password for automation
- Create additional user accounts (separation of duties)
- Set session timeout (default: 30 minutes)
Date and Time (Administration > Date and Time):
- Set NTP servers for accurate timestamps
- Configure timezone
Licenses (Administration > Licensing):
- Install iLO Advanced license key (required for full virtual media)
- License can be purchased or acquired from secondary market
iLO Firmware Update
Before production use, update iLO to latest version:
- Download latest iLO 4 firmware from HPE Support Portal
- Administration > Firmware > Update Firmware
- Upload
.bin file, apply update - iLO will reboot automatically (system stays running)
System ROM (BIOS/UEFI) Configuration
Accessing RBSU
- Local: Press F9 during POST
- Remote: iLO Remote Console > Power > Momentary Press > Press F9 when prompted
Boot Mode Selection
System Configuration > BIOS/Platform Configuration (RBSU) > Boot Mode:
Recommendation: Use UEFI Mode unless legacy compatibility required
Boot Order Configuration
System Configuration > BIOS/Platform Configuration (RBSU) > Boot Options > UEFI Boot Order:
Recommended order for network boot deployment:
- Network Boot: FlexibleLOM or PCIe NIC
- Internal Storage: RAID controller or disk
- Virtual Media: iLO virtual CD/DVD (for installation media)
- USB: For rescue/recovery
Enable Network Boot:
- System Configuration > BIOS/Platform Configuration (RBSU) > Network Options > Network Boot
- Set to “Enabled”
System Configuration > BIOS/Platform Configuration (RBSU) > Power Management:
Power Regulator Mode:
- HP Dynamic Power Savings: Balanced power/performance (recommended for home lab)
- HP Static High Performance: Maximum performance, higher power draw
- HP Static Low Power: Minimize power, reduced performance
- OS Control: Let OS manage (e.g., Linux cpufreq)
Collaborative Power Control: Disabled (for standalone servers)
Minimum Processor Idle Power Core C-State: C6 (lower idle power)
Energy/Performance Bias: Balanced Performance (or Maximum Performance for compute workloads)
Recommendation: Start with “Dynamic Power Savings” and adjust based on workload
Memory Configuration
Optimal Population (dual-CPU configuration):
For maximum performance, populate all channels before adding second DIMM per channel:
64GB (8x 8GB):
- CPU1: Slots 1, 4, 7, 10 and CPU2: Slots 1, 4, 7, 10
- Result: 4 channels per CPU, 1 DIMM per channel
128GB (8x 16GB):
- Same as above with 16GB DIMMs
192GB (12x 16GB):
- CPU1: Slots 1, 4, 7, 10, 2, 5 and CPU2: Slots 1, 4, 7, 10, 2, 5
- Result: 4 channels per CPU, some with 2 DIMMs per channel
768GB (24x 32GB):
Check Configuration: RBSU > System Information > Memory Information
Processor Options
System Configuration > BIOS/Platform Configuration (RBSU) > Processor Options:
Intel Hyperthreading: Enabled (recommended for most workloads)
- Doubles logical cores (e.g., 12-core CPU shows as 24 cores)
- Benefits most virtualization and multi-threaded workloads
- Disable only for specific security compliance (e.g., some cloud providers)
Intel Virtualization Technology (VT-x): Enabled (required for hypervisors)
Intel VT-d (IOMMU): Enabled (required for PCI passthrough, SR-IOV)
Turbo Boost: Enabled (allows CPU to exceed base clock)
Cores Enabled: All (or reduce to lower power/heat if needed)
Integrated Devices
System Configuration > BIOS/Platform Configuration (RBSU) > System Options > Integrated Devices:
- Embedded SATA Controller: Enabled (if using SATA drives)
- Embedded RAID Controller: Enabled (for Smart Array controllers)
- SR-IOV: Enabled (if using virtual network interfaces with VMs)
Network Controller Options
For each NIC (FlexibleLOM, PCIe):
System Configuration > BIOS/Platform Configuration (RBSU) > Network Options > [Adapter]:
- Network Boot: Enabled (for network boot on that NIC)
- PXE/iSCSI: Select PXE for standard network boot
- Link Speed: Auto-Negotiation (recommended) or force 1G/10G
- IPv4: Enabled (for IPv4 PXE boot)
- IPv6: Enabled (if using IPv6 PXE boot)
Boot Order: Configure which NIC boots first if multiple are enabled
Secure Boot Configuration
System Configuration > BIOS/Platform Configuration (RBSU) > Boot Options > Secure Boot:
- Secure Boot: Disabled (for unsigned boot loaders, custom kernels)
- Secure Boot: Enabled (for signed boot loaders, Windows, some Linux distros)
Note: If using PXE with unsigned images (e.g., custom iPXE), Secure Boot must be disabled
Firmware Updates
Update System ROM to latest version:
Via iLO:
- iLO web > Administration > Firmware > Update Firmware
- Upload System ROM
.fwpkg or .bin file - Server reboots automatically to apply
Via Service Pack for ProLiant (SPP):
- Download SPP ISO from HPE Support Portal
- Mount via iLO Virtual Media
- Boot server from SPP ISO
- Smart Update Manager (SUM) runs in Linux environment
- Select components to update (System ROM, iLO, controller firmware, NIC firmware)
- Apply updates, reboot
Recommendation: Use SPP for comprehensive updates on initial setup, then iLO for individual component updates
Storage Configuration
Smart Array Controller Setup
Access Smart Array Configuration
- During POST: Press F5 when “Smart Array Configuration Utility” message appears
- Via RBSU: System Configuration > BIOS/Platform Configuration (RBSU) > System Options > ROM-Based Setup Utility > Smart Array Configuration
Create RAID Arrays
Delete Existing Arrays (if reconfiguring):
- Select controller > Configuration > Delete Array
- Confirm deletion (data loss warning)
Create New Array:
- Select controller > Configuration > Create Array
- Select physical drives to include
- Choose RAID level:
- RAID 0: Striping, no redundancy (maximum performance, maximum capacity)
- RAID 1: Mirroring (redundancy, half capacity, good for boot drives)
- RAID 5: Striping + parity (redundancy, n-1 capacity, balanced)
- RAID 6: Striping + double parity (dual-drive failure tolerance, n-2 capacity)
- RAID 10: Mirror + stripe (high performance + redundancy, half capacity)
- Configure spare drives (hot spares for automatic rebuild)
- Create logical drive
- Set bootable flag if boot drive
Recommended Configurations:
- Boot/OS: 2x SSD in RAID 1 (redundancy, fast boot)
- Data (performance): 4-6x SSD in RAID 10 (fast, redundant)
- Data (capacity): 4-8x HDD in RAID 6 (capacity, dual-drive tolerance)
Controller Settings
Cache Settings:
- Write Cache: Enabled (requires battery/flash-backed cache)
- Read Cache: Enabled
- No-Battery Write Cache: Disabled (data safety) or Enabled (performance, risk)
Rebuild Priority: Medium or High (faster rebuild, may impact performance)
Surface Scan Delay: 3-7 days (periodic integrity check)
HBA Mode (Non-RAID)
For software RAID (ZFS, mdadm, Ceph):
- Access Smart Array Configuration (F5 during POST)
- Controller > Configuration > Enable HBA Mode
- Confirm (RAID arrays will be deleted)
- Reboot
Note: Not all Smart Array controllers support HBA mode. Check compatibility. Alternative: Use separate LSI HBA in PCIe slot.
Network Configuration for Boot
DHCP Server Setup
For PXE/UEFI network boot, configure DHCP server with appropriate options:
ISC DHCP Example (/etc/dhcp/dhcpd.conf):
# Define subnet
subnet 192.168.10.0 netmask 255.255.255.0 {
range 192.168.10.100 192.168.10.200;
option routers 192.168.10.1;
option domain-name-servers 192.168.10.1;
# PXE boot options
next-server 192.168.10.5; # TFTP server IP
# Differentiate UEFI vs BIOS
if exists user-class and option user-class = "iPXE" {
# iPXE boot script
filename "http://boot.example.com/boot.ipxe";
} elsif option arch = 00:07 or option arch = 00:09 {
# UEFI (x86-64)
filename "bootx64.efi";
} else {
# Legacy BIOS
filename "undionly.kpxe";
}
}
# Static reservation for DL360
host dl360-01 {
hardware ethernet xx:xx:xx:xx:xx:xx; # FlexibleLOM MAC
fixed-address 192.168.10.50;
option host-name "dl360-01";
}
FlexibleLOM Configuration
Configure FlexibleLOM NIC for network boot:
- RBSU > Network Options > FlexibleLOM
- Enable “Network Boot”
- Select PXE or iSCSI
- Configure IPv4/IPv6 as needed
- Set as first boot device in boot order
Multi-NIC Boot Priority
If multiple NICs have network boot enabled:
- RBSU > Network Options > Network Boot Order
- Drag/drop to prioritize NIC boot order
- First NIC in list attempts boot first
Recommendation: Enable network boot on one NIC (typically FlexibleLOM port 1) to avoid confusion
Operating System Installation
- Download OS ISO (e.g., Ubuntu Server, ESXi, Proxmox)
- Upload ISO to HTTP/HTTPS server or local file
- iLO Remote Console > Virtual Devices > Image File CD-ROM/DVD
- Browse to ISO location, click “Insert Media”
- Set boot order to prioritize virtual media
- Reboot server, boot from virtual CD/DVD
- Proceed with OS installation
Network Installation (PXE)
See Network Boot Capabilities for detailed PXE/UEFI boot setup
Quick workflow:
- Configure DHCP server with PXE options
- Setup TFTP server with boot files
- Enable network boot in BIOS
- Reboot, server PXE boots
- Select OS installer from PXE menu
- Automated installation proceeds (Kickstart/Preseed/Ignition)
Optimization for Specific Workloads
Virtualization (ESXi, Proxmox, Hyper-V)
BIOS Settings:
- Hyperthreading: Enabled
- VT-x: Enabled
- VT-d: Enabled
- Power Management: Dynamic or OS Control
- Turbo Boost: Enabled
Hardware:
- Maximum memory (384GB+ recommended)
- Fast storage (SSD RAID 10 for VM storage)
- 10GbE networking for VM traffic
Configuration:
- Pass through NICs to VMs (SR-IOV or PCI passthrough)
- Use storage controller in HBA mode for direct disk access to VM storage (ZFS, Ceph)
BIOS Settings:
- Hyperthreading: Enabled
- VT-x/VT-d: Enabled (for nested virtualization, kata containers)
- Power Management: Dynamic or High Performance
Hardware:
- 128GB+ RAM for multi-tenant workloads
- Fast local NVMe/SSD for container image cache and ephemeral storage
- 10GbE for pod networking
OS Recommendations:
- Talos Linux: Network-bootable, immutable k8s OS
- Flatcar Container Linux: Auto-updating, minimal OS
- Ubuntu Server: Broad compatibility, snap/docker native
Storage Server (NAS, SAN)
BIOS Settings:
- Disable Hyperthreading (slight performance improvement for ZFS)
- VT-d: Enabled (if passing through HBA to VM)
- Power Management: High Performance
Hardware:
- Maximum drive bays (8-10 SFF)
- HBA mode or separate LSI HBA controller
- 10GbE or bonded 1GbE for network storage traffic
- ECC memory (critical for ZFS)
Software:
- TrueNAS SCALE (Linux-based, k8s apps)
- OpenMediaVault (Debian-based, plugins)
- Ubuntu + ZFS (custom setup)
Compute/HPC Workloads
BIOS Settings:
- Hyperthreading: Depends on workload (test both)
- Turbo Boost: Enabled
- Power Management: Maximum Performance
- C-States: Disabled (reduce latency)
Hardware:
- High core count CPUs (E5-2680 v4, 2690 v4)
- Maximum memory bandwidth (populate all channels)
- Fast local scratch storage (NVMe)
Monitoring and Maintenance
iLO Health Monitoring
Information > System Information:
- CPU temperature and status
- Memory status
- Drive status (via controller)
- Fan speeds
- PSU status
- Overall system health LED status
Alerting (Administration > Alerting):
- Configure email alerts for:
- Fan failures
- Temperature warnings
- Drive failures
- Memory errors
- PSU failures
- Set up SNMP traps for integration with monitoring systems (Nagios, Zabbix, Prometheus)
Integrated Management Log (IML)
Information > Integrated Management Log:
- View hardware events and errors
- Filter by severity (Informational, Caution, Critical)
- Export log for troubleshooting
Regular Checks:
- Review IML weekly for early warning signs
- Address caution-level events before they become critical
Firmware Update Cadence
Recommendation:
- iLO: Update quarterly or when security advisories released
- System ROM: Update annually or for bug fixes
- Storage Controller: Update when issues arise or annually
- NIC Firmware: Update when issues arise
Method: Use SPP for annual comprehensive updates, iLO web interface for individual component updates
Physical Maintenance
Monthly:
- Check fan noise (increased noise may indicate clogged air filters or failing fan)
- Verify PSU and drive LEDs (no amber lights)
- Check iLO for alerts
Quarterly:
- Clean air filters (if accessible, depends on rack airflow)
- Verify backup of iLO configuration
- Test iLO Virtual Media functionality
Annually:
- Update all firmware via SPP
- Verify RAID battery/flash-backed cache status
- Review and update BIOS settings as workload evolves
Troubleshooting Common Issues
Server Won’t Power On
- Check PSU power cords connected
- Verify PSU LEDs indicate power
- Press iLO power button via web interface
- Check iLO IML for power-related errors
- Reseat PSUs, check for blown fuses
POST Errors
Memory Errors:
- Reseat memory DIMMs
- Test with minimal configuration (1 DIMM per CPU)
- Replace failing DIMMs identified in POST
CPU Errors:
- Verify heatsink properly seated
- Check thermal paste application
- Reseat CPU (careful with pins)
Drive Errors:
- Check drive connection to caddy
- Verify controller recognizes drive
- Replace failing drive
No Network Boot
See Network Boot Troubleshooting for detailed diagnostics
Quick checks:
- Verify NIC link light
- Confirm network boot enabled in BIOS
- Check DHCP server logs for PXE request
- Test TFTP server accessibility
iLO Not Accessible
- Check physical Ethernet connection to iLO port
- Verify switch port active
- Reset iLO: Press and hold iLO NMI button (rear) for 5 seconds
- Factory reset iLO via jumper (see maintenance guide)
- Check iLO firmware version, update if outdated
High Fan Noise
- Check ambient temperature (<25°C recommended)
- Verify airflow not blocked (front/rear clearance)
- Clean dust from intake (compressed air)
- Check iLO temperature sensors for elevated temps
- Lower CPU TDP if temperatures excessive (lower power CPUs)
- Verify all fans operational (replace failed fans)
Security Hardening
iLO Security
- Change Default Credentials: Immediately on first boot
- Disable Unused Services: SSH, IPMI if not needed
- Use HTTPS Only: Disable HTTP (Administration > Network > HTTP Port)
- Network Isolation: Dedicated management VLAN, firewall iLO access
- Update Firmware: Apply security patches promptly
- Account Management: Use separate accounts, least privilege
BIOS/UEFI Security
- BIOS Password: Set administrator password (RBSU > System Options > BIOS Admin Password)
- Secure Boot: Enable if using signed boot loaders
- Boot Order Lock: Prevent unauthorized boot device changes
- TPM: Enable if using BitLocker or LUKS disk encryption
Operating System Security
- Minimal Installation: Install only required packages
- Firewall: Enable host firewall (iptables, firewalld, ufw)
- SSH Hardening: Key-based auth, disable password auth, non-standard port
- Automatic Updates: Enable for security patches
- Monitoring: Deploy intrusion detection (fail2ban, OSSEC)
Conclusion
Proper configuration of the HP ProLiant DL360 Gen9 ensures optimal performance, reliability, and manageability for home lab and production deployments. The combination of UEFI boot capabilities, iLO remote management, and flexible hardware configuration makes the DL360 Gen9 a versatile platform for virtualization, containerization, storage, and compute workloads.
Key takeaways:
- Update firmware early (iLO, System ROM, controllers)
- Configure iLO for remote management and monitoring
- Choose boot mode (UEFI recommended) and configure network boot appropriately
- Optimize BIOS settings for specific workload (virtualization, storage, compute)
- Implement security hardening (iLO, BIOS, OS)
- Establish monitoring and maintenance schedule
For network boot-specific configuration, refer to the Network Boot Capabilities guide.
4.2 - Hardware Specifications
Detailed hardware specifications and configuration options for HP ProLiant DL360 Gen9
System Overview
The HP ProLiant DL360 Gen9 is a dual-socket 1U rack server designed for data center and enterprise deployments, also popular in home lab environments due to its performance and manageability.
Generation: Gen9 (2014-2017 product cycle)
Form Factor: 1U rack-mountable (19-inch standard rack)
Dimensions: 43.46 x 67.31 x 4.29 cm (17.1 x 26.5 x 1.69 in)
Processor Support
Supported CPU Families
The DL360 Gen9 supports Intel Xeon E5-2600 v3 and v4 series processors:
Popular CPU Options
Value: E5-2620 v3/v4 (6 cores, 15MB cache, 85W)
Balanced: E5-2650 v3/v4 (10-12 cores, 25-30MB cache, 105W)
Performance: E5-2680 v3/v4 (12-14 cores, 30-35MB cache, 120W)
High Core Count: E5-2699 v4 (22 cores, 55MB cache, 145W)
Configuration Options
- Single Processor: One CPU socket populated (budget option)
- Dual Processor: Both sockets populated (full performance)
Note: Memory and I/O performance scales with processor count. Single-CPU configuration limits memory channels and PCIe lanes.
Memory Architecture
Memory Specifications
- Type: DDR4 RDIMM or LRDIMM
- Speed: DDR4-2133 (v3) or DDR4-2400 (v4)
- Slots: 24 DIMM slots (12 per processor)
- Maximum Capacity:
- 768GB with 32GB RDIMMs
- 1.5TB with 64GB LRDIMMs (v4 processors)
- Minimum: 8GB (1x 8GB DIMM)
Memory Configuration Rules
- Channels per CPU: 4 channels, 3 DIMMs per channel
- Population: Populate channels evenly for optimal bandwidth
- Mixing: Do not mix RDIMM and LRDIMM types
- Speed: All DIMMs run at speed of slowest DIMM
Recommended Configurations
Basic Home Lab (Single CPU):
- 4x 16GB = 64GB (one DIMM per channel on both memory boards)
Standard (Dual CPU):
- 8x 16GB = 128GB (one DIMM per channel)
- 12x 16GB = 192GB (two DIMMs per channel on primary channels)
High Capacity (Dual CPU):
- 24x 32GB = 768GB (all slots populated, RDIMM)
Performance Priority: Populate all channels before adding second DIMM per channel
Storage Options
Drive Bay Configurations
The DL360 Gen9 offers multiple drive bay configurations:
- 8 SFF (2.5-inch): Most common configuration
- 10 SFF: Extended bay version
- 4 LFF (3.5-inch): Less common in 1U form factor
Drive Types Supported
- SAS: 12Gb/s, 6Gb/s (enterprise-grade)
- SATA: 6Gb/s, 3Gb/s (value option)
- SSD: SAS/SATA SSD, NVMe (with appropriate controller)
Storage Controllers
Smart Array Controllers (HPE proprietary RAID):
- P440ar: Entry-level, 2GB FBWC (Flash-Backed Write Cache), RAID 0/1/5/6/10
- P840ar: High-performance, 4GB FBWC, RAID 0/1/5/6/10/50/60
- P440: PCIe card version, 2GB FBWC
- P840: PCIe card version, 4GB FBWC
HBA Mode (non-RAID pass-through):
- Smart Array controllers in HBA mode for software RAID (ZFS, mdadm)
- Limited support; check firmware version
Alternative Controllers:
- LSI/Broadcom HBA controllers in PCIe slots
- H240ar (12Gb/s HBA mode)
Boot Drive Options
For network-focused deployments:
- Minimal Local Storage: 2x SSD in RAID 1 for hypervisor/OS
- USB/SD Boot: iLO supports USB boot, SD card (internal USB)
- Diskless: Pure network boot (subject of network-boot.md)
Network Connectivity
Integrated FlexibleLOM
The DL360 Gen9 includes a FlexibleLOM slot for swappable network adapters:
Common FlexibleLOM Options:
HPE 366FLR: 4x 1GbE (Broadcom BCM5719)
- Most common, good for general use
- Supports PXE, UEFI network boot, SR-IOV
HPE 560FLR-SFP+: 2x 10GbE SFP+ (Intel X710)
- High performance, fiber or DAC
- Supports PXE, UEFI boot, SR-IOV, RDMA (RoCE)
HPE 361i: 2x 1GbE (Intel I350)
- Entry-level, good driver support
PCIe Expansion Slots
Slot Configuration:
- Slot 1: PCIe 3.0 x16 (low-profile)
- Slot 2: PCIe 3.0 x8 (low-profile)
- Slot 3: PCIe 3.0 x8 (low-profile) - optional, depends on riser
Network Card Options:
- Intel X520/X710 (10GbE)
- Mellanox ConnectX-3/ConnectX-4 (10/25/40GbE, InfiniBand)
- Broadcom NetXtreme (1/10/25GbE)
Note: Ensure cards are low-profile for 1U chassis compatibility
Power Supply
PSU Options
- 500W: Single PSU, non-redundant (not recommended)
- 800W: Common, supports dual CPU + moderate expansion
- 1400W: High-power, dual CPU with high TDP + GPUs
- Redundancy: 1+1 redundant hot-plug recommended
Power Configuration
- Platinum Efficiency: 94%+ at 50% load
- Hot-Plug: Replace without powering down
- Auto-Switching: 100-240V AC, 50/60Hz
Home Lab Power Draw (typical):
- Idle (dual E5-2650 v3, 128GB RAM): 100-130W
- Load: 200-350W depending on CPU and drive configuration
Power Management
- HPE Dynamic Power Capping: Limit max power via iLO
- Collaborative Power: Share power budget across chassis in blade environments
- Energy Efficient Ethernet (EEE): Reduce NIC power during low utilization
Cooling and Acoustics
Fan Configuration
- 6x Hot-Plug Fans: Front-mounted, redundant (N+1)
- Variable Speed: Controlled by System ROM based on thermal sensors
- iLO Management: Monitor fan speed, temperature via iLO
Thermal Management
- Temperature Range: 10-35°C (50-95°F) operating
- Altitude: Up to 3,050m (10,000 ft) at reduced temperature
- Airflow: Front-to-back, ensure clear intake and exhaust
Noise Level
- Idle: ~45 dBA (quiet for 1U server)
- Load: 55-70 dBA depending on thermal demand
- Home Lab Consideration: Audible but acceptable in dedicated space; louder than desktop workstation
Noise Reduction:
- Run lower TDP CPUs (e.g., E5-2620 series)
- Maintain ambient temperature <25°C
- Ensure adequate airflow (not in enclosed cabinet without ventilation)
Management - iLO 4
iLO 4 Features
The Integrated Lights-Out 4 (iLO 4) provides out-of-band management:
- Web Interface: HTTPS management console
- Remote Console: HTML5 or Java-based KVM
- Virtual Media: Mount ISOs/images remotely
- Power Control: Power on/off, reset, cold boot
- Monitoring: Sensors, event logs, hardware health
- Alerting: Email alerts, SNMP traps, syslog
- Scripting: RESTful API (Redfish standard)
iLO Licensing
- iLO Standard (included): Basic management, remote console
- iLO Advanced (license required):
- Virtual media
- Remote console performance improvements
- Directory integration (LDAP/AD)
- Graphical remote console
- iLO Advanced Premium (license required):
- Insight Remote Support
- Federation
- Jitter smoothing
Home Lab: iLO Advanced license highly recommended for virtual media and full remote console features
iLO Network Configuration
- Dedicated iLO Port: Separate 1GbE management port (recommended)
- Shared LOM: Share FlexibleLOM port with OS (not recommended for isolation)
Security: Isolate iLO on dedicated management VLAN, disable if not needed
BIOS and Firmware
System ROM (BIOS/UEFI)
- Firmware Type: UEFI 2.31 or later
- Boot Modes: UEFI, Legacy BIOS, or hybrid
- Configuration: RBSU (ROM-Based Setup Utility) accessible via F9
Firmware Update Methods
- Service Pack for ProLiant (SPP): Comprehensive bundle of all firmware
- iLO Online Flash: Update via web interface
- Online ROM Flash: Linux utility for online updates
- USB Flash: Boot from USB with firmware update utility
Recommended Practice: Update to latest SPP for security patches and feature improvements
Secure Boot
- UEFI Secure Boot: Supported, validates boot loader signatures
- TPM: Optional Trusted Platform Module 1.2 or 2.0
- Boot Order Protection: Prevent unauthorized boot device changes
Expansion and Modularity
GPU Support
Limited GPU support due to 1U form factor and power constraints:
- Low-Profile GPUs: Nvidia T4, AMD Instinct MI25 (may require custom cooling)
- Power: Consider 1400W PSU for high-power GPUs
- Not Ideal: For GPU-heavy workloads, consider 2U+ servers (e.g., DL380 Gen9)
USB Ports
- Front: 1x USB 3.0
- Rear: 2x USB 3.0
- Internal: 1x USB 2.0 (for SD/USB boot device)
Serial Port
- Rear serial port for legacy console access
- Useful for network equipment serial console, debug
Home Lab Considerations
Pros for Home Lab
- Density: 1U form factor saves rack space
- iLO Management: Enterprise remote management without KVM
- Network Boot: Excellent PXE/UEFI boot support (see network-boot.md)
- Serviceability: Hot-swap drives, PSU, fans
- Documentation: Extensive HPE documentation and community support
- Parts Availability: Common on secondary market, affordable
Cons for Home Lab
- Noise: Louder than tower servers or workstations
- Power: Higher idle power than consumer hardware (100-130W idle)
- 1U Limitations: Limited GPU, PCIe expansion vs 2U/4U chassis
- Firmware: Requires HPE account for SPP downloads (free but registration required)
Recommended Home Lab Configuration
Budget (~$500-800 used):
- Dual E5-2620 v3 or v4 (6 cores each, 85W TDP)
- 128GB RAM (8x 16GB DDR4)
- 2x SSD (boot), 4-6x HDD/SSD (data)
- HPE 366FLR (4x 1GbE)
- Dual 500W or 800W PSU (redundant)
- iLO Advanced license
Performance (~$1000-1500 used):
- Dual E5-2680 v4 (14 cores each, 120W TDP)
- 256GB RAM (16x 16GB DDR4)
- 2x NVMe SSD (boot/cache), 6-8x SSD (data)
- HPE 560FLR-SFP+ (2x 10GbE) + PCIe 4x1GbE card
- Dual 800W PSU
- iLO Advanced license
Comparison with Other Generations
vs Gen8 (Previous)
Gen9 Advantages:
- DDR4 vs DDR3 (lower power, higher capacity)
- Better UEFI support and HTTP boot
- Newer processor architecture (Haswell/Broadwell vs Sandy Bridge/Ivy Bridge)
- iLO 4 vs iLO 3 (better HTML5 console)
Gen8 Advantages:
- Lower cost on secondary market
- Adequate for light workloads
vs Gen10 (Next)
Gen10 Advantages:
- Newer CPUs (Skylake-SP/Cascade Lake)
- More PCIe lanes
- Better UEFI firmware and security features
- DDR4-2666/2933 support
Gen9 Advantages:
- Lower cost (mature product cycle)
- Excellent value for performance/dollar
- Still well-supported by modern OS and firmware
Technical Resources
- QuickSpecs: HPE ProLiant DL360 Gen9 Server QuickSpecs
- User Guide: HPE ProLiant DL360 Gen9 Server User Guide
- Maintenance and Service Guide: Detailed disassembly and part replacement
- Firmware Downloads: HPE Support Portal (requires free account)
Summary
The HP ProLiant DL360 Gen9 remains an excellent choice for home labs and small deployments in 2024-2025. Its balance of performance (dual Xeon v4, 768GB RAM capacity), manageability (iLO 4), and network boot capabilities make it particularly well-suited for virtualization, container hosting, and infrastructure automation workflows. While not the latest generation, it offers strong value with robust firmware support and wide secondary market availability.
Best For:
- Virtualization hosts (ESXi, Proxmox, Hyper-V)
- Kubernetes/container platforms
- Network boot/diskless deployments
- Storage servers (with appropriate controller)
- General compute workloads
Avoid For:
- GPU-intensive workloads (1U constraints)
- Noise-sensitive environments (unless isolated)
- Extreme low-power requirements (100W+ idle)
4.3 - Network Boot Capabilities
Comprehensive analysis of network boot support on HP ProLiant DL360 Gen9
Overview
The HP ProLiant DL360 Gen9 provides robust network boot capabilities through multiple protocols and firmware interfaces. This makes it particularly well-suited for diskless deployments, automated provisioning, and infrastructure-as-code workflows.
Supported Network Boot Protocols
PXE (Preboot Execution Environment)
The DL360 Gen9 fully supports PXE boot via both legacy BIOS and UEFI firmware modes:
iPXE Support
The DL360 Gen9 can boot iPXE, enabling advanced features:
- Chainloading: Boot standard PXE, then chainload iPXE for enhanced capabilities
- HTTP/HTTPS Boot: Download kernels and images over HTTP(S) instead of TFTP
- SAN Boot: iSCSI and AoE (ATA over Ethernet) support
- Scripting: Conditional boot logic and dynamic configuration
- Embedded Scripts: iPXE can be compiled with embedded boot scripts
Implementation Methods:
- Chainload from standard PXE: DHCP points to
undionly.kpxe or ipxe.efi - Flash iPXE to FlexibleLOM option ROM (advanced, requires care)
- Boot iPXE from USB, then continue network boot
UEFI HTTP Boot
Native UEFI HTTP boot is supported on Gen9 servers with recent firmware:
- Protocol: RFC 7230 HTTP/1.1
- Requirements:
- UEFI firmware version 2.40 or later (check via iLO)
- DHCP option 60 (vendor class identifier) = “HTTPClient”
- DHCP option 67 pointing to HTTP(S) URL
- Advantages:
- No TFTP server required
- Faster transfers than TFTP
- Support for HTTPS with certificate validation
- Better suited for large images (kernels, initramfs)
- Limitations:
- UEFI mode only (not available in legacy BIOS)
- Requires DHCP server with HTTP URL support
HTTP(S) Boot Configuration
For UEFI HTTP boot on DL360 Gen9:
# Example ISC DHCP configuration for UEFI HTTP boot
class "httpclients" {
match if substring(option vendor-class-identifier, 0, 10) = "HTTPClient";
}
pool {
allow members of "httpclients";
option vendor-class-identifier "HTTPClient";
# Point to HTTP boot URI
filename "http://boot.example.com/boot/efi/bootx64.efi";
}
Network Interface Options
The DL360 Gen9 supports multiple network adapter configurations for boot:
FlexibleLOM (LOM = LAN on Motherboard)
HPE FlexibleLOM slot supports:
- HPE 366FLR: Quad-port 1GbE (Broadcom BCM5719)
- HPE 560FLR-SFP+: Dual-port 10GbE (Intel X710)
- HPE 361i: Dual-port 1GbE (Intel I350)
All FlexibleLOM adapters support PXE and UEFI network boot. The option ROM can be configured via BIOS/UEFI settings.
PCIe Network Adapters
Standard PCIe network cards with PXE/UEFI boot ROM support:
- Intel X520, X710 series (10GbE)
- Broadcom NetXtreme series
- Mellanox ConnectX-3/4 (with appropriate firmware)
Boot Priority: Configure via System ROM > Network Boot Options to select which NIC boots first.
Firmware Configuration
Accessing Boot Configuration
- RBSU (ROM-Based Setup Utility): Press F9 during POST
- iLO 4 Remote Console: Access via network, then virtual F9
- UEFI System Utilities: Modern interface for UEFI firmware settings
Key Settings
Navigate to: System Configuration > BIOS/Platform Configuration (RBSU) > Network Boot Options
- Network Boot: Enable/Disable
- Boot Mode: UEFI or Legacy BIOS
- IPv4/IPv6: Enable protocol support
- Boot Retry: Number of attempts before falling back to next boot device
- Boot Order: Prioritize network boot in boot sequence
Per-NIC Configuration
In RBSU > Network Options:
- Option ROM: Enable/Disable per adapter
- Link Speed: Force speed/duplex or auto-negotiate
- VLAN: VLAN tagging for boot (if supported by DHCP/PXE environment)
- PXE Menu: Enable interactive PXE menu (Ctrl+S during PXE boot)
iLO 4 Integration
The DL360 Gen9’s iLO 4 provides additional network boot features:
- Mount ISO images remotely via iLO Virtual Media
- Boot from network-attached ISO without physical media
- Useful for OS installation or diagnostics
Workflow:
- Upload ISO to HTTP/HTTPS server or use SMB/NFS share
- iLO Remote Console > Virtual Devices > Image File CD-ROM/DVD
- Set boot order to prioritize virtual optical drive
- Reboot server
Scripted Deployment via iLO
iLO 4 RESTful API allows:
- Setting one-time boot to network via API call
- Automating PXE boot for provisioning pipelines
- Integration with tools like Terraform, Ansible
Example using iLO RESTful API:
curl -k -u admin:password -X PATCH \
https://ilo-hostname/redfish/v1/Systems/1/ \
-d '{"Boot":{"BootSourceOverrideTarget":"Pxe","BootSourceOverrideEnabled":"Once"}}'
Boot Process Flow
Legacy BIOS PXE Boot
- Server powers on, initializes NICs
- NIC sends DHCPDISCOVER with PXE vendor options
- DHCP server responds with IP, TFTP server (option 66), boot file (option 67)
- NIC downloads NBP (Network Bootstrap Program) via TFTP
- NBP executes (e.g., pxelinux.0 loads syslinux menu)
- User selects boot target or automated script continues
- Kernel and initramfs download and boot
UEFI PXE Boot
- UEFI firmware initializes network stack
- UEFI PXE driver sends DHCPv4/v6 DISCOVER
- DHCP responds with boot file (e.g.,
bootx64.efi) - UEFI downloads boot file via TFTP
- UEFI loads and executes boot loader (GRUB2, systemd-boot, iPXE)
- Boot loader may download additional files (kernel, initrd, config)
- OS boots
UEFI HTTP Boot
- UEFI firmware with HTTP Boot support enabled
- DHCP request includes “HTTPClient” vendor class
- DHCP responds with HTTP(S) URL in option 67
- UEFI HTTP client downloads boot file over HTTP(S)
- Execution continues as with UEFI PXE
TFTP vs HTTP
- TFTP: Slow for large files (typical: 1-5 MB/s)
- Use for small boot loaders only
- Chainload to iPXE or HTTP boot for better performance
- HTTP: 10-100x faster depending on network and server
- Recommended for kernels, initramfs, live OS images
- iPXE or UEFI HTTP boot required
Network Speed Impact
DL360 Gen9 boot performance by NIC speed:
- 1GbE: Adequate for most PXE deployments (100-125 MB/s theoretical max)
- 10GbE: Significant improvement for large image downloads (1-2 GB/s)
- Bonding/Teaming: Not typically used for boot (single NIC boots)
Recommendation: For production diskless nodes or frequent re-provisioning, 10GbE with HTTP boot provides best performance.
Common Use Cases
1. Automated OS Provisioning
Boot into installer via PXE:
- Kickstart (RHEL/CentOS/Rocky)
- Preseed (Debian/Ubuntu)
- Ignition (Fedora CoreOS, Flatcar)
2. Diskless Boot
Boot OS entirely from network/RAM:
- Network root: NFS or iSCSI root filesystem
- Overlay: Persistent storage via network overlay
- Stateless: Boot identical image, no local state
3. Rescue and Diagnostics
Boot live environments:
- SystemRescue
- Clonezilla
- Memtest86+
- Hardware diagnostics (HPE Service Pack for ProLiant)
4. Kubernetes/Container Hosts
PXE boot immutable OS images:
- Talos Linux: API-driven, diskless k8s nodes
- Flatcar Container Linux: Automated updates
- k3OS: Lightweight k8s OS
Troubleshooting
PXE Boot Fails
Symptoms: “PXE-E51: No DHCP or proxy DHCP offers received” or timeout
Checks:
- Verify NIC link light and switch port status
- Confirm DHCP server is responding (check DHCP logs)
- Ensure DHCP options 66 and 67 are set correctly
- Test TFTP server accessibility (
tftp -i <server> GET <file>) - Check BIOS/UEFI network boot is enabled
- Verify boot order prioritizes network boot
- Disable Secure Boot if using unsigned boot files
UEFI Network Boot Not Available
Symptoms: Network boot option missing in UEFI boot menu
Resolution:
- Enter RBSU (F9), navigate to Network Options
- Ensure at least one NIC has “Option ROM” enabled
- Verify Boot Mode is set to UEFI (not Legacy)
- Update System ROM to latest version if option is missing
- Some FlexibleLOM cards require firmware update for UEFI boot support
HTTP Boot Fails
Symptoms: UEFI HTTP boot option present but fails to download
Checks:
- Verify firmware version supports HTTP boot (>=2.40)
- Ensure DHCP option 67 contains valid HTTP(S) URL
- Test URL accessibility from another client
- Check DNS resolution if using hostname in URL
- For HTTPS: Verify certificate is trusted (or disable cert validation in test)
Slow PXE Boot
Symptoms: Boot process takes minutes instead of seconds
Optimizations:
- Switch from TFTP to HTTP (chainload iPXE or use UEFI HTTP boot)
- Increase TFTP server block size (
tftp-hpa --blocksize 1468) - Tune DHCP response times (reduce lease query delays)
- Use local network segment for boot server (avoid WAN/VPN)
- Enable NIC interrupt coalescing in BIOS for 10GbE
Security Considerations
Secure Boot
DL360 Gen9 supports UEFI Secure Boot:
- Validates signed boot loaders (shim, GRUB, kernel)
- Prevents unsigned code execution during boot
- Required for some compliance scenarios
Configuration: RBSU > Boot Options > Secure Boot = Enabled
Implications for Network Boot:
- Must use signed boot loaders (e.g., shim.efi signed by Microsoft/vendor)
- Custom kernels require signing or disabling Secure Boot
- iPXE must be signed or chainloaded from signed shim
Network Security
Risks:
- PXE/TFTP is unencrypted and unauthenticated
- Attacker on network can serve malicious boot images
- DHCP spoofing can redirect to malicious boot server
Mitigations:
- Network Segmentation: Isolate PXE boot to management VLAN
- DHCP Snooping: Prevent rogue DHCP servers on switch
- HTTPS Boot: Use UEFI HTTP boot with TLS and certificate validation
- iPXE with HTTPS: Chainload iPXE, then use HTTPS for all downloads
- Signed Images: Use Secure Boot with signed boot chain
- 802.1X: Require network authentication before DHCP (complex for PXE)
iLO Security
- Change default iLO password immediately
- Use TLS for iLO web interface and API
- Restrict iLO network access (firewall, separate VLAN)
- Disable iLO Virtual Media if not needed
- Enable iLO Security Override for extra security during boot
Firmware and Driver Resources
Required Firmware Versions
For optimal network boot support:
- System ROM: v2.60 or later (latest recommended)
- iLO 4 Firmware: v2.80 or later
- NIC Firmware: Latest for specific FlexibleLOM/PCIe card
Check current versions: iLO web interface > Information > Firmware Information
Updating Firmware
Methods:
HPE Service Pack for ProLiant (SPP): Comprehensive update bundle
- Boot from SPP ISO (via iLO Virtual Media or USB)
- Runs Smart Update Manager (SUM) in Linux environment
- Updates all firmware, drivers, system ROM automatically
iLO Web Interface: Individual component updates
- System ROM: Administration > Firmware > Update Firmware
- Upload .fwpkg or .bin files from HPE support site
Online Flash Component: Linux Online ROM Flash utility
- Install
hp-firmware-* packages - Run updates while OS is running (requires reboot to apply)
Download Source: https://support.hpe.com/connect/s/product?language=en_US&kmpmoid=1010026910 (requires HPE Passport account, free registration)
Best Practices
- Use UEFI Mode: Better security, IPv6 support, larger disk support
- Enable HTTP Boot: Faster and more reliable than TFTP for large files
- Chainload iPXE: Flexibility of iPXE with standard PXE infrastructure
- Update Firmware: Keep System ROM and iLO current for bug fixes and features
- Isolate Boot Network: Use dedicated management VLAN for PXE/provisioning
- Test Failover: Configure multiple DHCP servers and boot mirrors for redundancy
- Document Configuration: Record BIOS settings, DHCP config, and boot infrastructure
- Monitor iLO Logs: Track boot failures and hardware issues via iLO event log
References
- HPE ProLiant DL360 Gen9 Server User Guide
- HPE UEFI System Utilities User Guide
- iLO 4 User Guide (firmware version 2.80)
- Intel PXE Specification v2.1
- UEFI Specification v2.8 (HTTP Boot)
- iPXE Documentation: https://ipxe.org/
Conclusion
The HP ProLiant DL360 Gen9 provides enterprise-grade network boot capabilities suitable for both traditional PXE deployments and modern UEFI HTTP boot scenarios. Its flexible configuration options, mature firmware support, and iLO integration make it an excellent platform for automated provisioning, diskless computing, and infrastructure-as-code workflows in home lab environments.
For home lab use, the recommended configuration is:
- UEFI boot mode with Secure Boot disabled (unless required)
- iPXE chainloading for flexibility and HTTP performance
- iLO 4 configured for remote management and scripted provisioning
- Latest firmware for stability and feature support
5 - Matchbox Analysis
Analysis of Matchbox network boot service capabilities and architecture
Matchbox Network Boot Analysis
This section contains a comprehensive analysis of Matchbox, a network boot service for provisioning bare-metal machines.
Overview
Matchbox is an HTTP and gRPC service developed by Poseidon that automates bare-metal machine provisioning through network booting. It matches machines to configuration profiles based on hardware attributes and serves boot configurations, kernel images, and provisioning configs.
Primary Repository: poseidon/matchbox
Documentation: https://matchbox.psdn.io/
License: Apache 2.0
Key Features
- Network Boot Support: iPXE, PXELINUX, GRUB2 chainloading
- OS Provisioning: Fedora CoreOS, Flatcar Linux, RHEL CoreOS
- Configuration Management: Ignition v3.x configs, Butane transpilation
- Machine Matching: Label-based matching (MAC, UUID, hostname, serial, custom)
- API: Read-only HTTP API + authenticated gRPC API
- Asset Serving: Local caching of OS images for faster deployment
- Templating: Go template support for dynamic configuration
Use Cases
- Bare-metal Kubernetes clusters - Provision CoreOS nodes for k8s
- Lab/development environments - Quick PXE boot for testing
- Datacenter provisioning - Automate OS installation across fleets
- Immutable infrastructure - Declarative machine provisioning via Terraform
Analysis Contents
Quick Architecture
┌─────────────┐
│ Machine │ PXE Boot
│ (BIOS/UEFI)│───┐
└─────────────┘ │
│
┌─────────────┐ │ DHCP/TFTP
│ dnsmasq │◄──┘ (chainload to iPXE)
│ DHCP+TFTP │
└─────────────┘
│
│ HTTP
▼
┌─────────────────────────┐
│ Matchbox │
│ ┌──────────────────┐ │
│ │ HTTP Endpoints │ │ /boot.ipxe, /ignition
│ └──────────────────┘ │
│ ┌──────────────────┐ │
│ │ gRPC API │ │ Terraform provider
│ └──────────────────┘ │
│ ┌──────────────────┐ │
│ │ Profile/Group │ │ Match machines
│ │ Matcher │ │ to configs
│ └──────────────────┘ │
└─────────────────────────┘
Technology Stack
- Language: Go
- Config Formats: Ignition JSON, Butane YAML
- Boot Protocols: PXE, iPXE, GRUB2
- APIs: HTTP (read-only), gRPC (authenticated)
- Deployment: Binary, container (Podman/Docker), Kubernetes
Integration Points
- Terraform:
terraform-provider-matchbox for declarative provisioning - Ignition/Butane: CoreOS provisioning configs
- dnsmasq: Reference DHCP/TFTP/DNS implementation (
quay.io/poseidon/dnsmasq) - Asset sources: Can serve local or remote (HTTPS) OS images
5.1 - Configuration Model
Analysis of Matchbox’s profile, group, and templating system
Matchbox Configuration Model
Matchbox uses a flexible configuration model based on Profiles (what to provision) and Groups (which machines get which profile), with support for templating and metadata.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Matchbox Store │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Profiles │ │ Groups │ │ Assets │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Matcher Engine │ │
│ │ (Label-based group selection) │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Template Renderer │ │
│ │ (Go templates + metadata) │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
Rendered Config (iPXE, Ignition, etc.)
Data Directory Structure
Matchbox uses a FileStore (default) that reads from -data-path (default: /var/lib/matchbox):
/var/lib/matchbox/
├── groups/ # Machine group definitions (JSON)
│ ├── default.json
│ ├── node1.json
│ └── us-west.json
├── profiles/ # Profile definitions (JSON)
│ ├── worker.json
│ ├── controller.json
│ └── etcd.json
├── ignition/ # Ignition configs (.ign) or Butane (.yaml)
│ ├── worker.ign
│ ├── controller.ign
│ └── butane-example.yaml
├── cloud/ # Cloud-Config templates (DEPRECATED)
│ └── legacy.yaml.tmpl
├── generic/ # Arbitrary config templates
│ ├── setup.cfg
│ └── metadata.yaml.tmpl
└── assets/ # Static files (kernel, initrd)
├── fedora-coreos/
└── flatcar/
Version control: Poseidon recommends keeping /var/lib/matchbox under git for auditability and rollback.
Profiles
Profiles define what to provision: network boot settings (kernel, initrd, args) and config references (Ignition, Cloud-Config, generic).
Profile Schema
{
"id": "worker",
"name": "Fedora CoreOS Worker Node",
"boot": {
"kernel": "/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-kernel-x86_64",
"initrd": [
"--name main /assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img"
],
"args": [
"initrd=main",
"coreos.live.rootfs_url=http://matchbox.example.com:8080/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-rootfs.x86_64.img",
"coreos.inst.install_dev=/dev/sda",
"coreos.inst.ignition_url=http://matchbox.example.com:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
]
},
"ignition_id": "worker.ign",
"cloud_id": "",
"generic_id": ""
}
Profile Fields
| Field | Type | Required | Description |
|---|
id | string | ✅ | Unique profile identifier (referenced by groups) |
name | string | ❌ | Human-readable description |
boot | object | ❌ | Network boot configuration |
boot.kernel | string | ❌ | Kernel URL (HTTP/HTTPS or /assets path) |
boot.initrd | array | ❌ | Initrd URLs (can specify --name for multi-initrd) |
boot.args | array | ❌ | Kernel command-line arguments |
ignition_id | string | ❌ | Ignition/Butane config filename in ignition/ |
cloud_id | string | ❌ | Cloud-Config filename in cloud/ (deprecated) |
generic_id | string | ❌ | Generic config filename in generic/ |
Boot Configuration Patterns
Pattern 1: Live PXE (RAM-based, ephemeral)
Boot and run OS entirely from RAM, no disk install:
{
"boot": {
"kernel": "/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-kernel-x86_64",
"initrd": [
"--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img"
],
"args": [
"initrd=main",
"coreos.live.rootfs_url=http://matchbox/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-rootfs.x86_64.img",
"ignition.config.url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
]
}
}
Use case: Diskless workers, testing, ephemeral compute
Pattern 2: Disk Install (persistent)
PXE boot live image, install to disk, reboot to disk:
{
"boot": {
"kernel": "/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-kernel-x86_64",
"initrd": [
"--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img"
],
"args": [
"initrd=main",
"coreos.live.rootfs_url=http://matchbox/assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-rootfs.x86_64.img",
"coreos.inst.install_dev=/dev/sda",
"coreos.inst.ignition_url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp}"
]
}
}
Key difference: coreos.inst.install_dev triggers disk install before reboot
Pattern 3: Multi-initrd (layered)
Multiple initrds can be loaded (e.g., base + drivers):
{
"initrd": [
"--name main /assets/fedora-coreos/VERSION/fedora-coreos-VERSION-live-initramfs.x86_64.img",
"--name drivers /assets/drivers/custom-drivers.img"
],
"args": [
"initrd=main,drivers",
"..."
]
}
Config References
Ignition Configs
Direct Ignition (.ign files):
{
"ignition_id": "worker.ign"
}
File: /var/lib/matchbox/ignition/worker.ign
{
"ignition": { "version": "3.3.0" },
"systemd": {
"units": [{
"name": "example.service",
"enabled": true,
"contents": "[Service]\nType=oneshot\nExecStart=/usr/bin/echo Hello\n\n[Install]\nWantedBy=multi-user.target"
}]
}
}
Butane Configs (transpiled to Ignition):
{
"ignition_id": "worker.yaml"
}
File: /var/lib/matchbox/ignition/worker.yaml
variant: fcos
version: 1.5.0
passwd:
users:
- name: core
ssh_authorized_keys:
- ssh-ed25519 AAAA...
systemd:
units:
- name: etcd.service
enabled: true
Matchbox automatically:
- Detects Butane format (file doesn’t end in
.ign or .ignition) - Transpiles Butane → Ignition using embedded library
- Renders templates with group metadata
- Serves as Ignition v3.3.0
Generic Configs
For non-Ignition configs (scripts, YAML, arbitrary data):
{
"generic_id": "setup-script.sh.tmpl"
}
File: /var/lib/matchbox/generic/setup-script.sh.tmpl
#!/bin/bash
# Rendered with group metadata
NODE_NAME={{.node_name}}
CLUSTER_ID={{.cluster_id}}
echo "Provisioning ${NODE_NAME} in cluster ${CLUSTER_ID}"
Access via: GET /generic?uuid=...&mac=...
Groups
Groups match machines to profiles using selectors (label matching) and provide metadata for template rendering.
Group Schema
{
"id": "node1-worker",
"name": "Worker Node 1",
"profile": "worker",
"selector": {
"mac": "52:54:00:89:d8:10",
"uuid": "550e8400-e29b-41d4-a716-446655440000"
},
"metadata": {
"node_name": "worker-01",
"cluster_id": "prod-cluster",
"etcd_endpoints": "https://10.0.1.10:2379,https://10.0.1.11:2379",
"ssh_authorized_keys": [
"ssh-ed25519 AAAA...",
"ssh-rsa AAAA..."
]
}
}
Group Fields
| Field | Type | Required | Description |
|---|
id | string | ✅ | Unique group identifier |
name | string | ❌ | Human-readable description |
profile | string | ✅ | Profile ID to apply |
selector | object | ❌ | Label match criteria (omit for default group) |
metadata | object | ❌ | Key-value data for template rendering |
Selector Matching
Reserved selectors (automatically populated from machine attributes):
| Selector | Source | Example | Normalized |
|---|
uuid | SMBIOS UUID | 550e8400-e29b-41d4-a716-446655440000 | Lowercase |
mac | Primary NIC MAC | 52:54:00:89:d8:10 | Colon-separated |
hostname | Network hostname | node1.example.com | As reported |
serial | Hardware serial | VMware-42 1a... | As reported |
Custom selectors (passed as query params):
{
"selector": {
"region": "us-west",
"environment": "production",
"rack": "A23"
}
}
Matching request: /ipxe?mac=52:54:00:89:d8:10®ion=us-west&environment=production&rack=A23
Matching logic:
- All selector key-value pairs must match request labels (AND logic)
- Most specific group wins (most selector matches)
- If multiple groups have same specificity, first match wins (undefined order)
- Groups with no selectors = default group (matches all)
Default Groups
Group with empty selector matches all machines:
{
"id": "default-worker",
"name": "Default Worker",
"profile": "worker",
"metadata": {
"environment": "dev"
}
}
⚠️ Warning: Avoid multiple default groups (non-deterministic matching)
Example: Region-based Matching
Group 1: US-West Workers
{
"id": "us-west-workers",
"profile": "worker",
"selector": {
"region": "us-west"
},
"metadata": {
"etcd_endpoints": "https://etcd-usw.example.com:2379"
}
}
Group 2: EU Workers
{
"id": "eu-workers",
"profile": "worker",
"selector": {
"region": "eu"
},
"metadata": {
"etcd_endpoints": "https://etcd-eu.example.com:2379"
}
}
Group 3: Specific Machine Override
{
"id": "node-special",
"profile": "controller",
"selector": {
"mac": "52:54:00:89:d8:10",
"region": "us-west"
},
"metadata": {
"role": "controller"
}
}
Matching precedence:
- Machine with
mac=52:54:00:89:d8:10®ion=us-west → node-special (2 selectors) - Machine with
region=us-west → us-west-workers (1 selector) - Machine with
region=eu → eu-workers (1 selector)
Templating System
Matchbox uses Go’s text/template for rendering configs with group metadata.
Template Context
Available variables in Ignition/Butane/Cloud-Config/generic templates:
// Group metadata (all keys from group.metadata)
{{.node_name}}
{{.cluster_id}}
{{.etcd_endpoints}}
// Group selectors (normalized)
{{.mac}} // e.g., "52:54:00:89:d8:10"
{{.uuid}} // e.g., "550e8400-..."
{{.region}} // Custom selector
// Request query params (raw)
{{.request.query.mac}} // As passed in URL
{{.request.query.foo}} // Custom query param
{{.request.raw_query}} // Full query string
// Special functions
{{if index . "ssh_authorized_keys"}} // Check if key exists
{{range $element := .ssh_authorized_keys}} // Iterate arrays
Example: Templated Butane Config
Group metadata:
{
"metadata": {
"node_name": "worker-01",
"ssh_authorized_keys": [
"ssh-ed25519 AAA...",
"ssh-rsa BBB..."
],
"ntp_servers": ["time1.google.com", "time2.google.com"]
}
}
Butane template: /var/lib/matchbox/ignition/worker.yaml
variant: fcos
version: 1.5.0
storage:
files:
- path: /etc/hostname
mode: 0644
contents:
inline: {{.node_name}}
- path: /etc/systemd/timesyncd.conf
mode: 0644
contents:
inline: |
[Time]
{{range $server := .ntp_servers}}
NTP={{$server}}
{{end}}
{{if index . "ssh_authorized_keys"}}
passwd:
users:
- name: core
ssh_authorized_keys:
{{range $key := .ssh_authorized_keys}}
- {{$key}}
{{end}}
{{end}}
Rendered Ignition (simplified):
{
"ignition": {"version": "3.3.0"},
"storage": {
"files": [
{
"path": "/etc/hostname",
"contents": {"source": "data:,worker-01"},
"mode": 420
},
{
"path": "/etc/systemd/timesyncd.conf",
"contents": {"source": "data:,%5BTime%5D%0ANTP%3Dtime1.google.com%0ANTP%3Dtime2.google.com"},
"mode": 420
}
]
},
"passwd": {
"users": [{
"name": "core",
"sshAuthorizedKeys": ["ssh-ed25519 AAA...", "ssh-rsa BBB..."]
}]
}
}
Template Best Practices
- Prefer external rendering: Use Terraform +
ct_config provider for complex templates - Validate Butane: Use
strict: true in Terraform or fcct --strict - Escape carefully: Go templates use
{{}}, Butane uses YAML - mind the interaction - Test rendering: Request
/ignition?mac=... directly to inspect output - Version control: Keep templates + groups in git for auditability
Warning: .request is reserved for query param access. Group metadata with "request": {...} will be overwritten.
Reserved keys:
request.query.* - Query parametersrequest.raw_query - Raw query string
API Integration
HTTP Endpoints (Read-only)
| Endpoint | Purpose | Template Context |
|---|
/ipxe | iPXE boot script | Profile boot section |
/grub | GRUB config | Profile boot section |
/ignition | Ignition config | Group metadata + selectors + query |
/cloud | Cloud-Config (deprecated) | Group metadata + selectors + query |
/generic | Generic config | Group metadata + selectors + query |
/metadata | Key-value env format | Group metadata + selectors + query |
Example metadata endpoint response:
GET /metadata?mac=52:54:00:89:d8:10&foo=bar
NODE_NAME=worker-01
CLUSTER_ID=prod
MAC=52:54:00:89:d8:10
REQUEST_QUERY_MAC=52:54:00:89:d8:10
REQUEST_QUERY_FOO=bar
REQUEST_RAW_QUERY=mac=52:54:00:89:d8:10&foo=bar
gRPC API (Authenticated, mutable)
Used by terraform-provider-matchbox for declarative infrastructure:
Terraform example:
provider "matchbox" {
endpoint = "matchbox.example.com:8081"
client_cert = file("~/.matchbox/client.crt")
client_key = file("~/.matchbox/client.key")
ca = file("~/.matchbox/ca.crt")
}
resource "matchbox_profile" "worker" {
name = "worker"
kernel = "/assets/fedora-coreos/.../kernel"
initrd = ["--name main /assets/fedora-coreos/.../initramfs.img"]
args = [
"initrd=main",
"coreos.inst.install_dev=/dev/sda",
"coreos.inst.ignition_url=${var.matchbox_http_endpoint}/ignition?uuid=$${uuid}&mac=$${mac:hexhyp}"
]
raw_ignition = data.ct_config.worker.rendered
}
resource "matchbox_group" "node1" {
name = "node1"
profile = matchbox_profile.worker.name
selector = {
mac = "52:54:00:89:d8:10"
}
metadata = {
node_name = "worker-01"
}
}
Operations:
CreateProfile, GetProfile, UpdateProfile, DeleteProfileCreateGroup, GetGroup, UpdateGroup, DeleteGroup
TLS client authentication required (see deployment docs)
Configuration Workflow
┌─────────────────────────────────────────────────────────────┐
│ 1. Write Butane configs (YAML) │
│ - worker.yaml, controller.yaml │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Terraform ct_config transpiles Butane → Ignition │
│ data "ct_config" "worker" { │
│ content = file("worker.yaml") │
│ strict = true │
│ } │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Terraform creates profiles + groups in Matchbox │
│ matchbox_profile.worker → gRPC CreateProfile() │
│ matchbox_group.node1 → gRPC CreateGroup() │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Machine PXE boots, queries Matchbox │
│ GET /ipxe?mac=... → matches group → returns profile │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Ignition fetches rendered config │
│ GET /ignition?mac=... → Matchbox returns Ignition │
└─────────────────────────────────────────────────────────────┘
Benefits:
- Rich Terraform templating (loops, conditionals, external data sources)
- Butane validation before deployment
- Declarative infrastructure (can
terraform plan before apply) - Version control workflow (git + CI/CD)
Alternative: Manual FileStore
┌─────────────────────────────────────────────────────────────┐
│ 1. Create profile JSON manually │
│ /var/lib/matchbox/profiles/worker.json │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Create group JSON manually │
│ /var/lib/matchbox/groups/node1.json │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Write Ignition/Butane config │
│ /var/lib/matchbox/ignition/worker.ign │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Restart matchbox (to reload FileStore) │
│ systemctl restart matchbox │
└─────────────────────────────────────────────────────────────┘
Drawbacks:
- Manual file management
- No validation before deployment
- Requires matchbox restart to pick up changes
- Error-prone for large fleets
Storage Backends
FileStore (Default)
Config: -data-path=/var/lib/matchbox
Pros:
- Simple file-based storage
- Easy to version control (git)
- Human-readable JSON
Cons:
- Requires file system access
- Manual reload for gRPC-created resources
Custom Store (Extensible)
Matchbox’s Store interface allows custom backends:
type Store interface {
ProfileGet(id string) (*Profile, error)
GroupGet(id string) (*Group, error)
IgnitionGet(name string) (string, error)
// ... other methods
}
Potential custom stores:
- etcd backend (for HA Matchbox)
- Database backend (PostgreSQL, MySQL)
- S3/object storage backend
Note: Not officially provided by Matchbox project; requires custom implementation
Security Considerations
gRPC API authentication: Requires TLS client certificates
ca.crt - CA that signed client certsserver.crt/server.key - Server TLS identityclient.crt/client.key - Client credentials (Terraform)
HTTP endpoints are read-only: No auth, machines fetch configs
- Do NOT put secrets in Ignition configs
- Use external secret stores (Vault, GCP Secret Manager)
- Reference secrets via Ignition
files.source with auth headers
Network segmentation: Matchbox on provisioning VLAN, isolate from production
Config validation: Validate Ignition/Butane before deployment to avoid boot failures
Audit logging: Version control groups/profiles; log gRPC API changes
Operational Tips
Test groups with curl:
curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10'
List profiles:
ls -la /var/lib/matchbox/profiles/
Validate Butane:
podman run -i --rm quay.io/coreos/fcct:release --strict < worker.yaml
Check group matching:
# Default group (no selectors)
curl http://matchbox.example.com:8080/ignition
# Specific machine
curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10&uuid=550e8400-e29b-41d4-a716-446655440000'
Backup configs:
tar -czf matchbox-backup-$(date +%F).tar.gz /var/lib/matchbox/{groups,profiles,ignition}
Summary
Matchbox’s configuration model provides:
- Separation of concerns: Profiles (what) vs Groups (who/where)
- Flexible matching: Label-based, multi-attribute, custom selectors
- Template support: Go templates for dynamic configs (but prefer external rendering)
- API-driven: Terraform integration for GitOps workflows
- Storage options: FileStore (simple) or custom backends (extensible)
- OS-agnostic: Works with any Ignition-based distro (FCOS, Flatcar, RHCOS)
Best practice: Use Terraform + external Butane configs for production; manual FileStore for labs/development.
5.2 - Deployment Patterns
Matchbox deployment options and operational considerations
Matchbox Deployment Patterns
Analysis of deployment architectures, installation methods, and operational considerations for running Matchbox in production.
Deployment Architectures
Single-Host Deployment
┌─────────────────────────────────────────────────────┐
│ Provisioning Host │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Matchbox │ │ dnsmasq │ │
│ │ :8080 HTTP │ │ DHCP/TFTP │ │
│ │ :8081 gRPC │ │ :67,:69 │ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ └──────────┬───────────┘ │
│ │ │
│ /var/lib/matchbox/ │
│ ├── groups/ │
│ ├── profiles/ │
│ ├── ignition/ │
│ └── assets/ │
└─────────────────────────────────────────────────────┘
│
│ Network
▼
┌──────────────┐
│ PXE Clients │
└──────────────┘
Use case: Lab, development, small deployments (<50 machines)
Pros:
- Simple setup
- Single service to manage
- Minimal resource requirements
Cons:
- Single point of failure
- No scalability
- Downtime during updates
HA Deployment (Multiple Matchbox Instances)
┌─────────────────────────────────────────────────────┐
│ Load Balancer (Ingress/HAProxy) │
│ :8080 HTTP :8081 gRPC │
└─────────────────────────────────────────────────────┘
│ │
├─────────────┬────────────────┤
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│Matchbox 1│ │Matchbox 2│ │Matchbox N│
│ (Pod/VM) │ │ (Pod/VM) │ │ (Pod/VM) │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└─────────────┴────────────────┘
│
▼
┌────────────────────────┐
│ Shared Storage │
│ /var/lib/matchbox │
│ (NFS, PV, ConfigMap) │
└────────────────────────┘
Use case: Production, datacenter-scale (100+ machines)
Pros:
- High availability (no single point of failure)
- Rolling updates (zero downtime)
- Load distribution
Cons:
- Complex storage (shared volume or etcd backend)
- More infrastructure required
Storage options:
- Kubernetes PersistentVolume (RWX mode)
- NFS share mounted on multiple hosts
- Custom etcd-backed Store (requires custom implementation)
- Git-sync sidecar (read-only, periodic pull)
Kubernetes Deployment
┌─────────────────────────────────────────────────────┐
│ Ingress Controller │
│ matchbox.example.com → Service matchbox:8080 │
│ matchbox-rpc.example.com → Service matchbox:8081 │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Service: matchbox (ClusterIP) │
│ ports: 8080/TCP, 8081/TCP │
└─────────────────────────────────────────────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Pod: matchbox │ │ Pod: matchbox │
│ replicas: 2+ │ │ replicas: 2+ │
└─────────────────┘ └─────────────────┘
│ │
└───────────┬───────────┘
▼
┌─────────────────────────────────────────────────────┐
│ PersistentVolumeClaim: matchbox-data │
│ /var/lib/matchbox (RWX mode) │
└─────────────────────────────────────────────────────┘
Manifest structure:
contrib/k8s/
├── matchbox-deployment.yaml # Deployment + replicas
├── matchbox-service.yaml # Service (8080, 8081)
├── matchbox-ingress.yaml # Ingress (HTTP + gRPC TLS)
└── matchbox-pvc.yaml # PersistentVolumeClaim
Key configurations:
Secret for gRPC TLS:
kubectl create secret generic matchbox-rpc \
--from-file=ca.crt \
--from-file=server.crt \
--from-file=server.key
Ingress for gRPC (TLS passthrough):
metadata:
annotations:
nginx.ingress.kubernetes.io/ssl-passthrough: "true"
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
Volume mount:
volumes:
- name: data
persistentVolumeClaim:
claimName: matchbox-data
volumeMounts:
- name: data
mountPath: /var/lib/matchbox
Use case: Cloud-native deployments, Kubernetes-based infrastructure
Pros:
- Native Kubernetes primitives (Deployments, Services, Ingress)
- Rolling updates via Deployment strategy
- Easy scaling (
kubectl scale) - Health checks + auto-restart
Cons:
- Requires RWX PersistentVolume or shared storage
- Ingress TLS configuration complexity (gRPC passthrough)
- Cluster dependency (can’t provision cluster bootstrap nodes)
⚠️ Bootstrap problem: Kubernetes-hosted Matchbox can’t PXE boot its own cluster nodes (chicken-and-egg). Use external Matchbox for initial cluster bootstrap, then migrate.
Installation Methods
1. Binary Installation (systemd)
Recommended for: Bare-metal hosts, VMs, traditional Linux servers
Steps:
Download and verify:
wget https://github.com/poseidon/matchbox/releases/download/v0.10.0/matchbox-v0.10.0-linux-amd64.tar.gz
wget https://github.com/poseidon/matchbox/releases/download/v0.10.0/matchbox-v0.10.0-linux-amd64.tar.gz.asc
gpg --verify matchbox-v0.10.0-linux-amd64.tar.gz.asc
Extract and install:
tar xzf matchbox-v0.10.0-linux-amd64.tar.gz
sudo cp matchbox-v0.10.0-linux-amd64/matchbox /usr/local/bin/
Create user and directories:
sudo useradd -U matchbox
sudo mkdir -p /var/lib/matchbox/{assets,groups,profiles,ignition}
sudo chown -R matchbox:matchbox /var/lib/matchbox
Install systemd unit:
sudo cp contrib/systemd/matchbox.service /etc/systemd/system/
Configure via systemd dropin:
sudo systemctl edit matchbox
[Service]
Environment="MATCHBOX_ADDRESS=0.0.0.0:8080"
Environment="MATCHBOX_RPC_ADDRESS=0.0.0.0:8081"
Environment="MATCHBOX_LOG_LEVEL=debug"
Start service:
sudo systemctl daemon-reload
sudo systemctl start matchbox
sudo systemctl enable matchbox
Pros:
- Direct control over service
- Easy log access (
journalctl -u matchbox) - Native OS integration
Cons:
- Manual updates required
- OS dependency (package compatibility)
2. Container Deployment (Docker/Podman)
Recommended for: Docker hosts, quick testing, immutable infrastructure
Docker:
mkdir -p /var/lib/matchbox/assets
docker run -d --name matchbox \
--net=host \
-v /var/lib/matchbox:/var/lib/matchbox:Z \
-v /etc/matchbox:/etc/matchbox:Z,ro \
quay.io/poseidon/matchbox:v0.10.0 \
-address=0.0.0.0:8080 \
-rpc-address=0.0.0.0:8081 \
-log-level=debug
Podman:
podman run -d --name matchbox \
--net=host \
-v /var/lib/matchbox:/var/lib/matchbox:Z \
-v /etc/matchbox:/etc/matchbox:Z,ro \
quay.io/poseidon/matchbox:v0.10.0 \
-address=0.0.0.0:8080 \
-rpc-address=0.0.0.0:8081 \
-log-level=debug
Volume mounts:
/var/lib/matchbox - Data directory (groups, profiles, configs, assets)/etc/matchbox - TLS certificates (ca.crt, server.crt, server.key)
Network mode:
--net=host - Required for DHCP/TFTP interaction on same host- Bridge mode possible if Matchbox is on separate host from dnsmasq
Pros:
- Immutable deployments
- Easy updates (pull new image)
- Portable across hosts
Cons:
- Volume management complexity
- SELinux considerations (
:Z flag)
3. Kubernetes Deployment
Recommended for: Kubernetes environments, cloud platforms
Quick start:
# Create TLS secret for gRPC
kubectl create secret generic matchbox-rpc \
--from-file=ca.crt=~/.matchbox/ca.crt \
--from-file=server.crt=~/.matchbox/server.crt \
--from-file=server.key=~/.matchbox/server.key
# Deploy manifests
kubectl apply -R -f contrib/k8s/
# Check status
kubectl get pods -l app=matchbox
kubectl get svc matchbox
kubectl get ingress matchbox matchbox-rpc
Persistence options:
Option 1: emptyDir (ephemeral, dev only):
volumes:
- name: data
emptyDir: {}
Option 2: PersistentVolumeClaim (production):
volumes:
- name: data
persistentVolumeClaim:
claimName: matchbox-data
Option 3: ConfigMap (static configs):
volumes:
- name: groups
configMap:
name: matchbox-groups
- name: profiles
configMap:
name: matchbox-profiles
Option 4: Git-sync sidecar (GitOps):
initContainers:
- name: git-sync
image: k8s.gcr.io/git-sync:v3.6.3
env:
- name: GIT_SYNC_REPO
value: https://github.com/example/matchbox-configs
- name: GIT_SYNC_DEST
value: /var/lib/matchbox
volumeMounts:
- name: data
mountPath: /var/lib/matchbox
Pros:
- Native k8s features (scaling, health checks, rolling updates)
- Ingress integration
- GitOps workflows
Cons:
- Complexity (Ingress, PVC, TLS)
- Can’t bootstrap own cluster
Network Boot Environment Setup
Matchbox requires separate DHCP/TFTP/DNS services. Options:
Option 1: dnsmasq Container (Quickest)
Use case: Lab, testing, environments without existing DHCP
Full DHCP + TFTP + DNS:
docker run -d --name dnsmasq \
--cap-add=NET_ADMIN \
--net=host \
quay.io/poseidon/dnsmasq:latest \
-d -q \
--dhcp-range=192.168.1.3,192.168.1.254,30m \
--enable-tftp \
--tftp-root=/var/lib/tftpboot \
--dhcp-match=set:bios,option:client-arch,0 \
--dhcp-boot=tag:bios,undionly.kpxe \
--dhcp-match=set:efi64,option:client-arch,9 \
--dhcp-boot=tag:efi64,ipxe.efi \
--dhcp-userclass=set:ipxe,iPXE \
--dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
--address=/matchbox.example.com/192.168.1.2 \
--log-queries \
--log-dhcp
Proxy DHCP (alongside existing DHCP):
docker run -d --name dnsmasq \
--cap-add=NET_ADMIN \
--net=host \
quay.io/poseidon/dnsmasq:latest \
-d -q \
--dhcp-range=192.168.1.1,proxy,255.255.255.0 \
--enable-tftp \
--tftp-root=/var/lib/tftpboot \
--dhcp-userclass=set:ipxe,iPXE \
--pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe \
--pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe \
--log-queries \
--log-dhcp
Included files: undionly.kpxe, ipxe.efi, grub.efi (bundled in image)
Option 2: Existing DHCP/TFTP Infrastructure
Use case: Enterprise environments with network admin policies
Required DHCP options (ISC DHCP example):
subnet 192.168.1.0 netmask 255.255.255.0 {
range 192.168.1.10 192.168.1.250;
# BIOS clients
if option architecture-type = 00:00 {
filename "undionly.kpxe";
}
# UEFI clients
elsif option architecture-type = 00:09 {
filename "ipxe.efi";
}
# iPXE clients
elsif exists user-class and option user-class = "iPXE" {
filename "http://matchbox.example.com:8080/boot.ipxe";
}
next-server 192.168.1.100; # TFTP server IP
}
TFTP files (place in tftp root):
Option 3: iPXE-only (No PXE Chainload)
Use case: Modern hardware with native iPXE firmware
DHCP config (simpler):
filename "http://matchbox.example.com:8080/boot.ipxe";
No TFTP server needed (iPXE fetches directly via HTTP)
Limitation: Doesn’t support legacy BIOS with basic PXE ROM
TLS Certificate Setup
gRPC API requires TLS client certificates for authentication.
Option 1: Provided cert-gen Script
cd scripts/tls
export SAN=DNS.1:matchbox.example.com,IP.1:192.168.1.100
./cert-gen
Generates:
ca.crt - Self-signed CAserver.crt, server.key - Server credentialsclient.crt, client.key - Client credentials (for Terraform)
Install server certs:
sudo mkdir -p /etc/matchbox
sudo cp ca.crt server.crt server.key /etc/matchbox/
sudo chown -R matchbox:matchbox /etc/matchbox
Save client certs for Terraform:
mkdir -p ~/.matchbox
cp client.crt client.key ca.crt ~/.matchbox/
Option 2: Corporate PKI
Preferred for production: Use organization’s certificate authority
Requirements:
- Server cert with SAN:
DNS:matchbox.example.com - Client cert issued by same CA
- CA cert for validation
Matchbox flags:
-ca-file=/etc/matchbox/ca.crt
-cert-file=/etc/matchbox/server.crt
-key-file=/etc/matchbox/server.key
Terraform provider config:
provider "matchbox" {
endpoint = "matchbox.example.com:8081"
client_cert = file("/path/to/client.crt")
client_key = file("/path/to/client.key")
ca = file("/path/to/ca.crt")
}
Option 3: Let’s Encrypt (HTTP API only)
Note: gRPC requires client cert auth (incompatible with Let’s Encrypt)
Use case: TLS for HTTP endpoints only (read-only API)
Matchbox flags:
-web-ssl=true
-web-cert-file=/etc/letsencrypt/live/matchbox.example.com/fullchain.pem
-web-key-file=/etc/letsencrypt/live/matchbox.example.com/privkey.pem
Limitation: Still need self-signed certs for gRPC API
Configuration Flags
Core Flags
| Flag | Default | Description |
|---|
-address | 127.0.0.1:8080 | HTTP API listen address |
-rpc-address | `` | gRPC API listen address (empty = disabled) |
-data-path | /var/lib/matchbox | Data directory (FileStore) |
-assets-path | /var/lib/matchbox/assets | Static assets directory |
-log-level | info | Logging level (debug, info, warn, error) |
TLS Flags (gRPC)
| Flag | Default | Description |
|---|
-ca-file | /etc/matchbox/ca.crt | CA certificate for client verification |
-cert-file | /etc/matchbox/server.crt | Server TLS certificate |
-key-file | /etc/matchbox/server.key | Server TLS private key |
TLS Flags (HTTP, optional)
| Flag | Default | Description |
|---|
-web-ssl | false | Enable TLS for HTTP API |
-web-cert-file | `` | HTTP server TLS certificate |
-web-key-file | `` | HTTP server TLS private key |
Environment Variables
All flags can be set via environment variables with MATCHBOX_ prefix:
export MATCHBOX_ADDRESS=0.0.0.0:8080
export MATCHBOX_RPC_ADDRESS=0.0.0.0:8081
export MATCHBOX_LOG_LEVEL=debug
export MATCHBOX_DATA_PATH=/custom/path
Operational Considerations
Firewall Configuration
Matchbox host:
firewall-cmd --permanent --add-port=8080/tcp # HTTP API
firewall-cmd --permanent --add-port=8081/tcp # gRPC API
firewall-cmd --reload
dnsmasq host (if separate):
firewall-cmd --permanent --add-service=dhcp
firewall-cmd --permanent --add-service=tftp
firewall-cmd --permanent --add-service=dns # optional
firewall-cmd --reload
Monitoring
Health check endpoints:
# HTTP API
curl http://matchbox.example.com:8080
# Should return: matchbox
# gRPC API
openssl s_client -connect matchbox.example.com:8081 \
-CAfile ~/.matchbox/ca.crt \
-cert ~/.matchbox/client.crt \
-key ~/.matchbox/client.key
Prometheus metrics: Not built-in; consider adding reverse proxy (e.g., nginx) with metrics exporter
Logs (systemd):
journalctl -u matchbox -f
Logs (container):
Backup Strategy
What to backup:
/var/lib/matchbox/{groups,profiles,ignition} - Configs/etc/matchbox/*.{crt,key} - TLS certificates- Terraform state (if using Terraform provider)
Backup command:
tar -czf matchbox-backup-$(date +%F).tar.gz \
/var/lib/matchbox/{groups,profiles,ignition} \
/etc/matchbox
Restore:
tar -xzf matchbox-backup-YYYY-MM-DD.tar.gz -C /
sudo chown -R matchbox:matchbox /var/lib/matchbox
sudo systemctl restart matchbox
GitOps approach: Store configs in git repository for versioning and auditability
Updates
Binary deployment:
# Download new version
wget https://github.com/poseidon/matchbox/releases/download/vX.Y.Z/matchbox-vX.Y.Z-linux-amd64.tar.gz
tar xzf matchbox-vX.Y.Z-linux-amd64.tar.gz
# Replace binary
sudo systemctl stop matchbox
sudo cp matchbox-vX.Y.Z-linux-amd64/matchbox /usr/local/bin/
sudo systemctl start matchbox
Container deployment:
docker pull quay.io/poseidon/matchbox:vX.Y.Z
docker stop matchbox
docker rm matchbox
docker run -d --name matchbox ... quay.io/poseidon/matchbox:vX.Y.Z ...
Kubernetes deployment:
kubectl set image deployment/matchbox matchbox=quay.io/poseidon/matchbox:vX.Y.Z
kubectl rollout status deployment/matchbox
Scaling Considerations
Vertical scaling (single instance):
- CPU: Minimal (config rendering is lightweight)
- Memory: ~50MB base + asset cache
- Disk: Depends on cached assets (100MB - 10GB+)
Horizontal scaling (multiple instances):
- Stateless HTTP API (load balance round-robin)
- Shared storage required (RWX PV, NFS, or custom backend)
- gRPC API can be load-balanced with gRPC-aware LB
Asset serving optimization:
- Use CDN or cache proxy for remote assets
- Local asset caching for <100 machines
- Dedicated HTTP server (nginx) for large deployments (1000+ machines)
Security Best Practices
Don’t store secrets in Ignition configs
- Use Ignition
files.source with auth headers to fetch from Vault - Or provision minimal config, fetch secrets post-boot
Network segmentation
- Provision VLAN isolated from production
- Firewall rules: only allow provisioning traffic
gRPC API access control
- Client cert authentication (mandatory)
- Restrict cert issuance to authorized personnel/systems
- Rotate certs periodically
Audit logging
- Version control groups/profiles (git)
- Log gRPC API changes (Terraform state tracking)
- Monitor HTTP endpoint access
Validate configs before deployment
fcct --strict for Butane configs- Terraform plan before apply
- Test in dev environment first
Troubleshooting
Common Issues
1. Machines not PXE booting:
# Check DHCP responses
tcpdump -i eth0 port 67 and port 68
# Verify TFTP files
ls -la /var/lib/tftpboot/
curl tftp://192.168.1.100/undionly.kpxe
# Check Matchbox accessibility
curl http://matchbox.example.com:8080/boot.ipxe
2. 404 Not Found on /ignition:
# Test group matching
curl 'http://matchbox.example.com:8080/ignition?mac=52:54:00:89:d8:10'
# Check group exists
ls -la /var/lib/matchbox/groups/
# Check profile referenced by group exists
ls -la /var/lib/matchbox/profiles/
# Verify ignition_id file exists
ls -la /var/lib/matchbox/ignition/
3. gRPC connection refused (Terraform):
# Test TLS connection
openssl s_client -connect matchbox.example.com:8081 \
-CAfile ~/.matchbox/ca.crt \
-cert ~/.matchbox/client.crt \
-key ~/.matchbox/client.key
# Check Matchbox gRPC is listening
sudo ss -tlnp | grep 8081
# Verify firewall
sudo firewall-cmd --list-ports
4. Ignition config validation errors:
# Validate Butane locally
podman run -i --rm quay.io/coreos/fcct:release --strict < config.yaml
# Fetch rendered Ignition
curl 'http://matchbox.example.com:8080/ignition?mac=...' | jq .
# Validate Ignition spec
curl 'http://matchbox.example.com:8080/ignition?mac=...' | \
podman run -i --rm quay.io/coreos/ignition-validate:latest
Summary
Matchbox deployment considerations:
- Architecture: Single-host (dev/lab) vs HA (production) vs Kubernetes
- Installation: Binary (systemd), container (Docker/Podman), or Kubernetes manifests
- Network boot: dnsmasq container (quick), existing infrastructure (enterprise), or iPXE-only (modern)
- TLS: Self-signed (dev), corporate PKI (production), Let’s Encrypt (HTTP only)
- Scaling: Vertical (simple) vs horizontal (requires shared storage)
- Security: Client cert auth, network segmentation, no secrets in configs
- Operations: Backup configs, GitOps workflow, monitoring/logging
Recommendation for production:
- HA deployment (2+ instances) with load balancer
- Shared storage (NFS or RWX PV on Kubernetes)
- Corporate PKI for TLS certificates
- GitOps workflow (Terraform + git-controlled configs)
- Network segmentation (dedicated provisioning VLAN)
- Prometheus/Grafana monitoring
5.3 - Network Boot Support
Detailed analysis of Matchbox’s network boot capabilities
Network Boot Support in Matchbox
Matchbox provides comprehensive network boot support for bare-metal provisioning, supporting multiple boot firmware types and protocols.
Overview
Matchbox serves as an HTTP entrypoint for network-booted machines but does not implement DHCP, TFTP, or DNS services itself. Instead, it integrates with existing network infrastructure (or companion services like dnsmasq) to provide a complete PXE boot solution.
Boot Protocol Support
1. PXE (Preboot Execution Environment)
Legacy BIOS support via chainloading to iPXE:
Machine BIOS → DHCP (gets TFTP server) → TFTP (gets undionly.kpxe)
→ iPXE firmware → HTTP (Matchbox /boot.ipxe)
Key characteristics:
- Requires TFTP server to serve
undionly.kpxe (iPXE bootloader) - Chainloads from legacy PXE ROM to modern iPXE
- Supports older hardware with basic PXE firmware
- TFTP only used for initial iPXE bootstrap; subsequent downloads via HTTP
2. iPXE (Enhanced PXE)
Primary boot method supported by Matchbox:
iPXE Client → DHCP (gets boot script URL) → HTTP (Matchbox endpoints)
→ Kernel/initrd download → Boot with Ignition config
Endpoints served by Matchbox:
| Endpoint | Purpose |
|---|
/boot.ipxe | Static script that gathers machine attributes (UUID, MAC, hostname, serial) |
/ipxe?<labels> | Rendered iPXE script with kernel, initrd, and boot args for matched machine |
/assets/ | Optional local caching of kernel/initrd images |
Example iPXE flow:
- Machine boots with iPXE firmware
- DHCP response points to
http://matchbox.example.com:8080/boot.ipxe - iPXE fetches
/boot.ipxe:#!ipxe
chain ipxe?uuid=${uuid}&mac=${mac:hexhyp}&domain=${domain}&hostname=${hostname}&serial=${serial}
- iPXE makes request to
/ipxe?uuid=...&mac=... with machine attributes - Matchbox matches machine to group/profile and renders iPXE script:
#!ipxe
kernel /assets/coreos/VERSION/coreos_production_pxe.vmlinuz \
coreos.config.url=http://matchbox.foo:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp} \
coreos.first_boot=1
initrd /assets/coreos/VERSION/coreos_production_pxe_image.cpio.gz
boot
Advantages:
- HTTP downloads (faster than TFTP)
- Scriptable boot logic
- Can fetch configs from HTTP endpoints
- Supports HTTPS (if compiled with TLS support)
3. GRUB2
UEFI firmware support:
UEFI Firmware → DHCP (gets GRUB bootloader) → TFTP (grub.efi)
→ GRUB → HTTP (Matchbox /grub endpoint)
Matchbox endpoint: /grub?<labels>
Example GRUB config rendered by Matchbox:
default=0
timeout=1
menuentry "CoreOS" {
echo "Loading kernel"
linuxefi "(http;matchbox.foo:8080)/assets/coreos/VERSION/coreos_production_pxe.vmlinuz" \
"coreos.config.url=http://matchbox.foo:8080/ignition" "coreos.first_boot"
echo "Loading initrd"
initrdefi "(http;matchbox.foo:8080)/assets/coreos/VERSION/coreos_production_pxe_image.cpio.gz"
}
Use case:
- UEFI systems that prefer GRUB over iPXE
- Environments with existing GRUB network boot infrastructure
4. PXELINUX (Legacy, via TFTP)
While not a primary Matchbox target, PXELINUX clients can be configured to chainload iPXE:
# /var/lib/tftpboot/pxelinux.cfg/default
timeout 10
default iPXE
LABEL iPXE
KERNEL ipxe.lkrn
APPEND dhcp && chain http://matchbox.example.com:8080/boot.ipxe
DHCP Configuration Patterns
Matchbox supports two DHCP deployment models:
Pattern 1: PXE-Enabled DHCP
Full DHCP server provides IP allocation + PXE boot options.
Example dnsmasq configuration:
dhcp-range=192.168.1.1,192.168.1.254,30m
enable-tftp
tftp-root=/var/lib/tftpboot
# Legacy BIOS → chainload to iPXE
dhcp-match=set:bios,option:client-arch,0
dhcp-boot=tag:bios,undionly.kpxe
# UEFI → iPXE
dhcp-match=set:efi32,option:client-arch,6
dhcp-boot=tag:efi32,ipxe.efi
dhcp-match=set:efi64,option:client-arch,9
dhcp-boot=tag:efi64,ipxe.efi
# iPXE clients → Matchbox
dhcp-userclass=set:ipxe,iPXE
dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe
# DNS for Matchbox
address=/matchbox.example.com/192.168.1.100
Client architecture detection:
- Option 93 (
client-arch): Identifies BIOS (0), UEFI32 (6), UEFI64 (9) - User class: Detects iPXE clients to skip TFTP chainloading
Pattern 2: Proxy DHCP
Runs alongside existing DHCP server; provides only boot options (no IP allocation).
Example dnsmasq proxy-DHCP:
dhcp-range=192.168.1.1,proxy,255.255.255.0
enable-tftp
tftp-root=/var/lib/tftpboot
# Chainload legacy PXE to iPXE
pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe
# iPXE clients → Matchbox
dhcp-userclass=set:ipxe,iPXE
pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe
Benefits:
- Non-invasive: doesn’t replace existing DHCP
- PXE clients receive merged responses from both DHCP servers
- Ideal for environments where main DHCP cannot be modified
Network Boot Flow (Complete)
Scenario: BIOS machine with legacy PXE firmware
┌──────────────────────────────────────────────────────────────────┐
│ 1. Machine powers on, BIOS set to network boot │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 2. NIC PXE firmware broadcasts DHCPDISCOVER (PXEClient) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 3. DHCP/proxyDHCP responds with: │
│ - IP address (if full DHCP) │
│ - Next-server: TFTP server IP │
│ - Filename: undionly.kpxe (based on arch=0) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 4. PXE firmware downloads undionly.kpxe via TFTP │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 5. Execute iPXE (undionly.kpxe) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 6. iPXE requests DHCP again, identifies as iPXE (user-class) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 7. DHCP responds with boot URL (not TFTP): │
│ http://matchbox.example.com:8080/boot.ipxe │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 8. iPXE fetches /boot.ipxe via HTTP: │
│ #!ipxe │
│ chain ipxe?uuid=${uuid}&mac=${mac:hexhyp}&... │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 9. iPXE chains to /ipxe?uuid=XXX&mac=YYY (introspected labels) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 10. Matchbox matches machine to group/profile │
│ - Finds most specific group matching labels │
│ - Retrieves profile (kernel, initrd, args, configs) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 11. Matchbox renders iPXE script with: │
│ - kernel URL (local asset or remote HTTPS) │
│ - initrd URL │
│ - kernel args (including ignition.config.url) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 12. iPXE downloads kernel + initrd (HTTP/HTTPS) │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 13. iPXE boots kernel with specified args │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 14. Fedora CoreOS/Flatcar boots, Ignition runs │
│ - Fetches /ignition?uuid=XXX&mac=YYY from Matchbox │
│ - Matchbox renders Ignition config with group metadata │
│ - Ignition partitions disk, writes files, creates users │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ 15. System reboots (if disk install), boots from disk │
└──────────────────────────────────────────────────────────────────┘
Asset Serving
Matchbox can serve static assets (kernel, initrd images) from a local directory to reduce bandwidth and increase speed:
Asset directory structure:
/var/lib/matchbox/assets/
├── fedora-coreos/
│ └── 36.20220906.3.2/
│ ├── fedora-coreos-36.20220906.3.2-live-kernel-x86_64
│ ├── fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img
│ └── fedora-coreos-36.20220906.3.2-live-rootfs.x86_64.img
└── flatcar/
└── 3227.2.0/
├── flatcar_production_pxe.vmlinuz
├── flatcar_production_pxe_image.cpio.gz
└── version.txt
HTTP endpoint: http://matchbox.example.com:8080/assets/
Scripts provided:
scripts/get-fedora-coreos - Download/verify Fedora CoreOS imagesscripts/get-flatcar - Download/verify Flatcar Linux images
Profile reference:
{
"boot": {
"kernel": "/assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-kernel-x86_64",
"initrd": ["--name main /assets/fedora-coreos/36.20220906.3.2/fedora-coreos-36.20220906.3.2-live-initramfs.x86_64.img"]
}
}
Alternative: Profiles can reference remote HTTPS URLs (requires iPXE compiled with TLS support):
{
"kernel": "https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/36.20220906.3.2/x86_64/fedora-coreos-36.20220906.3.2-live-kernel-x86_64"
}
OS Support
Fedora CoreOS
Boot types:
- Live PXE (RAM-only, ephemeral)
- Install to disk (persistent, recommended)
Required kernel args:
coreos.inst.install_dev=/dev/sda - Target disk for installcoreos.inst.ignition_url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp} - Provisioning configcoreos.live.rootfs_url=... - Root filesystem image
Ignition fetch: During first boot, ignition.service fetches config from Matchbox
Flatcar Linux
Boot types:
- Live PXE (RAM-only)
- Install to disk
Required kernel args:
flatcar.first_boot=yes - Marks first bootflatcar.config.url=http://matchbox/ignition?uuid=${uuid}&mac=${mac:hexhyp} - Ignition config URLflatcar.autologin - Auto-login to console (optional, dev/debug)
Ignition support: Flatcar uses Ignition v3.x for provisioning
RHEL CoreOS
Supported as it uses Ignition like Fedora CoreOS. Requires Red Hat-specific image sources.
Machine Matching & Labels
Matchbox matches machines to profiles using labels extracted during boot:
Reserved Label Selectors
| Label | Source | Example | Normalized |
|---|
uuid | SMBIOS UUID | 550e8400-e29b-41d4-a716-446655440000 | Lowercase |
mac | NIC MAC address | 52:54:00:89:d8:10 | Normalized to colons |
hostname | Network boot program | node1.example.com | As-is |
serial | Hardware serial | VMware-42 1a... | As-is |
Custom Labels
Groups can match on arbitrary labels passed as query params:
/ipxe?mac=52:54:00:89:d8:10®ion=us-west&env=prod
Matching precedence: Most specific group wins (most selector matches)
Firmware Compatibility
| Firmware Type | Client Arch | Boot File | Protocol | Matchbox Support |
|---|
| BIOS (legacy PXE) | 0 | undionly.kpxe → iPXE | TFTP → HTTP | ✅ Via chainload |
| UEFI 32-bit | 6 | ipxe.efi | TFTP → HTTP | ✅ |
| UEFI (BIOS compat) | 7 | ipxe.efi | TFTP → HTTP | ✅ |
| UEFI 64-bit | 9 | ipxe.efi | TFTP → HTTP | ✅ |
| Native iPXE | - | N/A | HTTP | ✅ Direct |
| GRUB (UEFI) | - | grub.efi | TFTP → HTTP | ✅ /grub endpoint |
Network Requirements
Firewall rules on Matchbox host:
# HTTP API (read-only)
firewall-cmd --add-port=8080/tcp --permanent
# gRPC API (authenticated, Terraform)
firewall-cmd --add-port=8081/tcp --permanent
DNS requirement:
matchbox.example.com must resolve to Matchbox server IP- Can be configured in dnsmasq, corporate DNS, or
/etc/hosts on DHCP server
DHCP/TFTP host (if using dnsmasq):
firewall-cmd --add-service=dhcp --permanent
firewall-cmd --add-service=tftp --permanent
firewall-cmd --add-service=dns --permanent # optional
Troubleshooting Tips
Verify Matchbox endpoints:
curl http://matchbox.example.com:8080
# Should return: matchbox
curl http://matchbox.example.com:8080/boot.ipxe
# Should return iPXE script
Test machine matching:
curl 'http://matchbox.example.com:8080/ipxe?mac=52:54:00:89:d8:10'
# Should return rendered iPXE script with kernel/initrd
Check TFTP files:
ls -la /var/lib/tftpboot/
# Should contain: undionly.kpxe, ipxe.efi, grub.efi
Verify DHCP responses:
tcpdump -i eth0 -n port 67 and port 68
# Watch for DHCP offers with PXE options
iPXE console debugging:
- Press Ctrl+B during iPXE boot to enter console
- Commands:
dhcp, ifstat, show net0/ip, chain http://...
Limitations
- HTTPS support: iPXE must be compiled with crypto support (larger binary, ~80KB vs ~45KB)
- TFTP dependency: Legacy PXE requires TFTP for initial chainload (can’t skip)
- No DHCP/TFTP built-in: Must use external services or dnsmasq container
- Boot firmware variations: Some vendor PXE implementations have quirks
- SecureBoot: iPXE and GRUB must be signed (or SecureBoot disabled)
Reference Implementation: dnsmasq Container
Matchbox project provides quay.io/poseidon/dnsmasq with:
- Pre-configured DHCP/TFTP/DNS service
- Bundled
ipxe.efi, undionly.kpxe, grub.efi - Example configs for PXE-DHCP and proxy-DHCP modes
Quick start (full DHCP):
docker run --rm --cap-add=NET_ADMIN --net=host quay.io/poseidon/dnsmasq \
-d -q \
--dhcp-range=192.168.1.3,192.168.1.254 \
--enable-tftp --tftp-root=/var/lib/tftpboot \
--dhcp-match=set:bios,option:client-arch,0 \
--dhcp-boot=tag:bios,undionly.kpxe \
--dhcp-match=set:efi64,option:client-arch,9 \
--dhcp-boot=tag:efi64,ipxe.efi \
--dhcp-userclass=set:ipxe,iPXE \
--dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
--address=/matchbox.example.com/192.168.1.2 \
--log-queries --log-dhcp
Quick start (proxy-DHCP):
docker run --rm --cap-add=NET_ADMIN --net=host quay.io/poseidon/dnsmasq \
-d -q \
--dhcp-range=192.168.1.1,proxy,255.255.255.0 \
--enable-tftp --tftp-root=/var/lib/tftpboot \
--dhcp-userclass=set:ipxe,iPXE \
--pxe-service=tag:#ipxe,x86PC,"PXE chainload to iPXE",undionly.kpxe \
--pxe-service=tag:ipxe,x86PC,"iPXE",http://matchbox.example.com:8080/boot.ipxe \
--log-queries --log-dhcp
Summary
Matchbox provides robust network boot support through:
- Protocol flexibility: iPXE (primary), GRUB2, legacy PXE (via chainload)
- Firmware compatibility: BIOS and UEFI
- Modern approach: HTTP-based with optional local asset caching
- Clean separation: Matchbox handles config rendering; external services handle DHCP/TFTP
- Production-ready: Used by Typhoon Kubernetes distributions for bare-metal provisioning
5.4 - Use Case Evaluation
Evaluation of Matchbox for specific use cases and comparison with alternatives
Matchbox Use Case Evaluation
Analysis of Matchbox’s suitability for various use cases, strengths, limitations, and comparison with alternative provisioning solutions.
Use Case Fit Analysis
✅ Ideal Use Cases
Scenario: Provisioning 10-1000 physical servers for Kubernetes nodes
Why Matchbox Excels:
- Ignition-native (perfect for Fedora CoreOS/Flatcar)
- Declarative machine provisioning via Terraform
- Label-based matching (region, role, hardware type)
- Integration with Typhoon Kubernetes distribution
- Minimal OS surface (immutable, container-optimized)
Example workflow:
resource "matchbox_profile" "k8s_controller" {
name = "k8s-controller"
kernel = "/assets/fedora-coreos/.../kernel"
raw_ignition = data.ct_config.controller.rendered
}
resource "matchbox_group" "controllers" {
profile = matchbox_profile.k8s_controller.name
selector = {
role = "controller"
}
}
Alternatives considered:
- Cloud-init + netboot.xyz: Less declarative, no native Ignition support
- Foreman: Heavier, more complex for container-centric workloads
- Metal³: Kubernetes-native but requires existing cluster
Verdict: ⭐⭐⭐⭐⭐ Matchbox is purpose-built for this
2. Lab/Development Environments
Scenario: Rapid PXE boot testing with QEMU/KVM VMs or homelab servers
Why Matchbox Excels:
- Quick setup (binary + dnsmasq container)
- No DHCP infrastructure required (proxy-DHCP mode)
- Localhost deployment (no external dependencies)
- Fast iteration (change configs, re-PXE)
- Included examples and scripts
Example setup:
# Start Matchbox locally
docker run -d --net=host -v /var/lib/matchbox:/var/lib/matchbox \
quay.io/poseidon/matchbox:latest -address=0.0.0.0:8080
# Start dnsmasq on same host
docker run -d --net=host --cap-add=NET_ADMIN \
quay.io/poseidon/dnsmasq ...
Alternatives considered:
- netboot.xyz: Great for manual OS selection, no automation
- PiXE server: Simpler but less flexible matching logic
- Manual iPXE scripts: No dynamic matching, manual maintenance
Verdict: ⭐⭐⭐⭐⭐ Minimal setup, maximum flexibility
3. Edge/Remote Site Provisioning
Scenario: Provision machines at 10+ remote datacenters or edge locations
Why Matchbox Excels:
- Lightweight (single binary, ~20MB)
- Declarative region-based matching
- Centralized config management (Terraform)
- Can run on minimal hardware (ARM support)
- HTTP-based (works over WAN with reverse proxy)
Architecture:
Central Matchbox (via Terraform)
↓ gRPC API
Regional Matchbox Instances (read-only cache)
↓ HTTP
Edge Machines (PXE boot)
Label-based routing:
{
"selector": {
"region": "us-west",
"site": "pdx-1"
},
"metadata": {
"ntp_servers": ["10.100.1.1", "10.100.1.2"]
}
}
Alternatives considered:
- Foreman: Requires more resources per site
- Ansible + netboot: No declarative PXE boot, post-install only
- Cloud-init datasources: Requires cloud metadata service per site
Verdict: ⭐⭐⭐⭐☆ Good fit, but consider caching strategy for WAN
⚠️ Moderate Fit Use Cases
Scenario: Provide bare-metal-as-a-service to multiple customers
Matchbox challenges:
- No built-in multi-tenancy (single namespace)
- No RBAC (gRPC API is all-or-nothing with client certs)
- No customer self-service portal
Workarounds:
- Deploy separate Matchbox per tenant (isolation via separate instances)
- Proxy gRPC API with custom RBAC layer
- Use group selectors with customer IDs
Better alternatives:
- Metal³ (Kubernetes-native, better multi-tenancy)
- OpenStack Ironic (purpose-built for bare-metal cloud)
- MAAS (Ubuntu-specific, has RBAC)
Verdict: ⭐⭐☆☆☆ Possible but architecturally challenging
5. Heterogeneous OS Provisioning
Scenario: Need to provision Fedora CoreOS, Ubuntu, RHEL, Windows
Matchbox challenges:
- Designed for Ignition-based OSes (FCOS, Flatcar, RHCOS)
- No native support for Kickstart (RHEL/CentOS)
- No support for Preseed (Ubuntu/Debian)
- No Windows unattend.xml support
What works:
- Fedora CoreOS ✅
- Flatcar Linux ✅
- RHEL CoreOS ✅
- Container Linux (deprecated but supported) ✅
What requires workarounds:
- RHEL/CentOS: Possible via generic configs + Kickstart URLs, but not native
- Ubuntu: Can PXE boot and point to autoinstall ISO, but loses Matchbox templating benefits
- Debian: Similar to Ubuntu
- Windows: Not supported (different PXE boot mechanisms)
Better alternatives for heterogeneous environments:
- Foreman (supports Kickstart, Preseed, unattend.xml)
- MAAS (Ubuntu-centric but extensible)
- Cobbler (older but supports many OS types)
Verdict: ⭐⭐☆☆☆ Stick to Ignition-based OSes or use different tool
❌ Poor Fit Use Cases
6. Windows PXE Boot
Why Matchbox doesn’t fit:
- No WinPE support
- No unattend.xml rendering
- Different PXE boot chain (WDS/SCCM model)
Recommendation: Use Microsoft WDS or SCCM
Verdict: ⭐☆☆☆☆ Not designed for this
7. BIOS/Firmware Updates
Why Matchbox doesn’t fit:
- Focused on OS provisioning, not firmware
- No vendor-specific tooling (Dell iDRAC, HP iLO integration)
Recommendation: Use vendor tools or Ansible with ipmi/redfish modules
Verdict: ⭐☆☆☆☆ Out of scope
Strengths
1. Ignition-First Design
- Native support for modern immutable OSes
- Declarative, atomic provisioning (no config drift)
- First-boot partition/filesystem setup
2. Label-Based Matching
- Flexible machine classification (MAC, UUID, region, role, custom)
- Most-specific-match algorithm (override defaults per machine)
- Query params for dynamic attributes
- Declarative infrastructure as code
- Plan before apply (preview changes)
- State tracking for auditability
- Rich templating (ct_config provider for Butane)
4. Minimal Dependencies
- Single static binary (~20MB)
- No database required (FileStore default)
- No built-in DHCP/TFTP (separation of concerns)
- Container-ready (OCI image available)
5. HTTP-Centric
- Faster downloads than TFTP (iPXE via HTTP)
- Proxy/CDN friendly for asset distribution
- Standard web tooling (curl, load balancers, Ingress)
6. Production-Ready
- Used by Typhoon Kubernetes (battle-tested)
- Clear upgrade path (SemVer releases)
- OpenPGP signature support for config integrity
Limitations
1. No Multi-Tenancy
- Single namespace (all groups/profiles global)
- No RBAC on gRPC API (client cert = full access)
- Requires separate instances per tenant
2. Ignition-Only Focus
- Cloud-Config deprecated (legacy support only)
- No native Kickstart/Preseed/unattend.xml
- Limits OS choice to CoreOS family
3. Storage Constraints
- FileStore doesn’t scale to 10,000+ profiles
- No built-in HA storage (requires NFS or custom backend)
- Kubernetes deployment needs RWX PersistentVolume
4. No Machine Discovery
- Doesn’t detect new machines (passive service)
- No inventory management (use external CMDB)
- No hardware introspection (use Ironic for that)
5. Limited Observability
- No built-in metrics (Prometheus integration requires reverse proxy)
- Logs are minimal (request logging only)
- No audit trail for gRPC API changes (use Terraform state)
6. TFTP Still Required
- Legacy BIOS PXE needs TFTP for chainloading to iPXE
- Can’t fully eliminate TFTP unless all machines have native iPXE
Comparison with Alternatives
vs. Foreman
| Feature | Matchbox | Foreman |
|---|
| OS Support | Ignition-based | Kickstart, Preseed, AutoYaST, etc. |
| Complexity | Low (single binary) | High (Rails app, DB, Puppet/Ansible) |
| Config Model | Declarative (Ignition) | Imperative (post-install scripts) |
| API | HTTP + gRPC | REST API |
| UI | None (API-only) | Full web UI |
| Terraform | Native provider | Community modules |
| Use Case | Container-centric infra | Traditional Linux servers |
When to choose Matchbox: CoreOS-based Kubernetes clusters, minimal infrastructure
When to choose Foreman: Heterogeneous OS, need web UI, traditional config mgmt
| Feature | Matchbox | Metal³ |
|---|
| Platform | Standalone | Kubernetes-native (operator) |
| Bootstrap | Can bootstrap k8s cluster | Needs existing k8s cluster |
| Machine Lifecycle | Provision only | Provision + decommission + reprovision |
| Hardware Introspection | No (labels passed manually) | Yes (via Ironic) |
| Multi-tenancy | No | Yes (via k8s namespaces) |
| Complexity | Low | High (requires Ironic, DHCP, etc.) |
When to choose Matchbox: Greenfield bare-metal, no existing k8s
When to choose Metal³: Existing k8s, need hardware mgmt lifecycle
vs. Cobbler
| Feature | Matchbox | Cobbler |
|---|
| Age | Modern (2016+) | Legacy (2008+) |
| Config Format | Ignition (declarative) | Kickstart/Preseed (imperative) |
| Templating | Go templates (minimal) | Cheetah templates (extensive) |
| Python | Go (static binary) | Python (requires interpreter) |
| DHCP Management | External | Can manage DHCP |
| Maintenance | Active (Poseidon) | Low activity |
When to choose Matchbox: Modern immutable OSes, container workloads
When to choose Cobbler: Legacy infra, need DHCP management, heterogeneous OS
vs. MAAS (Ubuntu)
| Feature | Matchbox | MAAS |
|---|
| OS Support | CoreOS family | Ubuntu (primary), others (limited) |
| IPAM | No (external DHCP) | Built-in IPAM |
| Power Mgmt | No (manual or scripts) | Built-in (IPMI, AMT, etc.) |
| UI | No | Full web UI |
| Declarative | Yes (Terraform) | Limited (CLI mostly) |
| Cloud Integration | No | Yes (libvirt, LXD, VM hosts) |
When to choose Matchbox: Non-Ubuntu, Kubernetes, minimal dependencies
When to choose MAAS: Ubuntu-centric, need power mgmt, cloud integration
vs. netboot.xyz
| Feature | Matchbox | netboot.xyz |
|---|
| Purpose | Automated provisioning | Manual OS selection menu |
| Automation | Full (API-driven) | None (interactive menu) |
| Customization | Per-machine configs | Global menu |
| Ignition | Native support | No |
| Complexity | Medium | Very low |
When to choose Matchbox: Automated fleet provisioning
When to choose netboot.xyz: Ad-hoc OS installation, homelab
Decision Matrix
Use this table to evaluate Matchbox for your use case:
| Requirement | Weight | Matchbox Score | Notes |
|---|
| Ignition/CoreOS support | High | ⭐⭐⭐⭐⭐ | Native, first-class |
| Heterogeneous OS | High | ⭐⭐☆☆☆ | Limited to Ignition OSes |
| Declarative provisioning | Medium | ⭐⭐⭐⭐⭐ | Terraform native |
| Multi-tenancy | Medium | ⭐☆☆☆☆ | Requires separate instances |
| Web UI | Medium | ☆☆☆☆☆ | No UI (API-only) |
| Ease of deployment | Medium | ⭐⭐⭐⭐☆ | Binary or container, minimal deps |
| Scalability | Medium | ⭐⭐⭐☆☆ | FileStore limits, need shared storage for HA |
| Hardware mgmt | Low | ☆☆☆☆☆ | No power mgmt, no introspection |
| Cost | Low | ⭐⭐⭐⭐⭐ | Open source, Apache 2.0 |
Scoring:
- ⭐⭐⭐⭐⭐ Excellent
- ⭐⭐⭐⭐☆ Good
- ⭐⭐⭐☆☆ Adequate
- ⭐⭐☆☆☆ Limited
- ⭐☆☆☆☆ Poor
- ☆☆☆☆☆ Not supported
Recommendations
Choose Matchbox if:
- ✅ Provisioning Fedora CoreOS, Flatcar, or RHEL CoreOS
- ✅ Building bare-metal Kubernetes clusters
- ✅ Prefer declarative infrastructure (Terraform)
- ✅ Want minimal dependencies (single binary)
- ✅ Need flexible label-based machine matching
- ✅ Have homogeneous OS requirements (all Ignition-based)
Avoid Matchbox if:
- ❌ Need multi-OS support (Windows, traditional Linux)
- ❌ Require web UI for operations teams
- ❌ Need built-in hardware management (power, BIOS config)
- ❌ Have strict multi-tenancy requirements
- ❌ Need automated hardware discovery/introspection
Hybrid Approaches
Pattern 1: Matchbox + Ansible
- Matchbox: Initial OS provisioning
- Ansible: Post-boot configuration, app deployment
- Works well for stateful services on bare-metal
Pattern 2: Matchbox + Metal³
- Matchbox: Bootstrap initial k8s cluster
- Metal³: Ongoing cluster node lifecycle management
- Gradual migration from Matchbox to Metal³
Pattern 3: Matchbox + Terraform + External Secrets
- Matchbox: Base OS + minimal config
- Ignition: Fetch secrets from Vault/GCP Secret Manager
- Terraform: Orchestrate end-to-end provisioning
Conclusion
Matchbox is a purpose-built, minimalist network boot service optimized for modern immutable operating systems (Ignition-based). It excels in container-centric bare-metal environments, particularly for Kubernetes clusters built with Fedora CoreOS or Flatcar Linux.
Best fit: Organizations adopting immutable infrastructure patterns, container orchestration, and declarative provisioning workflows.
Not ideal for: Heterogeneous OS environments, multi-tenant bare-metal clouds, or teams requiring extensive web UI and built-in hardware management.
For home labs and development, Matchbox offers an excellent balance of simplicity and power. For production Kubernetes deployments, it’s a proven, battle-tested solution (via Typhoon). For complex enterprise provisioning with mixed OS requirements, consider Foreman or MAAS instead.
6 - Ubiquiti Dream Machine Pro Analysis
Comprehensive analysis of the Ubiquiti Dream Machine Pro capabilities, focusing on network boot (PXE) support and infrastructure integration.
Overview
The Ubiquiti Dream Machine Pro (UDM Pro) is an all-in-one network gateway, router, and switch designed for enterprise and advanced home lab environments. This analysis focuses on its capabilities relevant to infrastructure automation and network boot scenarios.
Key Specifications
Hardware
- Processor: Quad-core ARM Cortex-A57 @ 1.7 GHz
- RAM: 4GB DDR4
- Storage: 128GB eMMC (for UniFi OS, applications, and logs)
- Network Interfaces:
- 1x WAN port (RJ45, SFP, or SFP+)
- 8x LAN ports (1 Gbps RJ45, configurable)
- 1x SFP+ port (10 Gbps)
- 1x SFP port (1 Gbps)
- Additional Features:
- 3.5" SATA HDD bay (for UniFi Protect surveillance)
- IDS/IPS engine
- Deep packet inspection
- Built-in UniFi Network Controller
Software
- OS: UniFi OS (Linux-based)
- Controller: Built-in UniFi Network Controller
- Services: DHCP, DNS, routing, firewall, VPN (site-to-site and remote access)
Network Boot (PXE) Support
Native DHCP PXE Capabilities
The UDM Pro provides basic PXE boot support through its DHCP server:
Supported:
- DHCP Option 66 (
next-server / TFTP server address) - DHCP Option 67 (
filename / boot file name) - Basic single-architecture PXE booting
Configuration via UniFi Controller:
- Navigate to Settings → Networks → Select your network
- Scroll to DHCP section
- Enable DHCP
- Under Advanced DHCP Options:
- TFTP Server: IP address of your TFTP/PXE server (e.g.,
192.168.42.16) - Boot Filename: Name of the bootloader file (e.g.,
pxelinux.0 for BIOS or bootx64.efi for UEFI)
Limitations:
- No multi-architecture support: Cannot differentiate boot files based on client architecture (BIOS vs. UEFI, x86_64 vs. ARM64)
- No conditional DHCP options: Cannot vary
filename or next-server based on client characteristics - Fixed boot parameters: One boot configuration for all PXE clients
- Single bootloader only: Must choose either BIOS or UEFI bootloader, not both
Use Cases:
- ✅ Homogeneous environments (all BIOS or all UEFI)
- ✅ Single OS deployment scenarios
- ✅ Simple provisioning workflows
- ❌ Mixed BIOS/UEFI environments (requires external DHCP server with conditional logic)
Network Segmentation & VLANs
The UDM Pro excels at network segmentation, critical for infrastructure isolation:
- VLAN Support: Native 802.1Q tagging
- Firewall Rules: Inter-VLAN routing with granular firewall policies
- Network Isolation: Can create fully isolated networks or controlled inter-network traffic
- Use Cases for Infrastructure:
- Management VLAN (for PXE/provisioning)
- Production VLAN (workloads)
- IoT/OT VLAN (isolated devices)
- DMZ (exposed services)
VPN Capabilities
Site-to-Site VPN
- Protocols: IPsec, WireGuard (experimental)
- Use Case: Connect home lab to cloud infrastructure (GCP, AWS, Azure)
- Performance: Hardware-accelerated encryption on UDM Pro
Remote Access VPN
- Protocols: L2TP, OpenVPN
- Use Case: Remote administration of home lab infrastructure
- Integration: Can work with Cloudflare Access for additional security layer
IDS/IPS Engine
- Technology: Suricata-based
- Capabilities:
- Intrusion detection
- Intrusion prevention (can drop malicious traffic)
- Threat signatures updated via UniFi
- Performance Impact: Can affect throughput on high-bandwidth connections
- Recommendation: Enable for security-sensitive infrastructure segments
DNS & DHCP Services
DNS
- Local DNS: Can act as caching DNS resolver
- Custom DNS Records: Limited to UniFi controller hostname
- Recommendation: Use external DNS (Pi-hole, Bind9) for advanced features like split-horizon DNS
DHCP
- Static Leases: Supports MAC-based static IP assignments
- DHCP Options: Can configure common options (NTP, DNS, domain name)
- Reservations: Per-client reservations via GUI
- PXE Options: Basic Option 66/67 support (as noted above)
Integration with Infrastructure-as-Code
UniFi Network API
- REST API: Available for configuration automation
- Python Libraries:
pyunifi and others for programmatic access - Use Cases:
- Terraform provider for network state management
- Ansible modules for configuration automation
- CI/CD integration for network-as-code
- Provider:
paultyng/unifi - Capabilities: Manage networks, firewall rules, port forwarding, DHCP settings
- Limitations: Not all UI features exposed via API
Configuration Persistence
- Backup/Restore: JSON-based configuration export
- Version Control: Can track config changes in Git
- Recovery: Auto-backup to cloud (optional)
Throughput
- Routing/NAT: ~3.5 Gbps (without IDS/IPS)
- IDS/IPS Enabled: ~850 Mbps - 1 Gbps
- VPN (IPsec): ~1 Gbps
- Inter-VLAN Routing: Wire speed (8 Gbps backplane)
Scalability
- Concurrent Devices: 500+ clients tested
- VLANs: Up to 32 networks/VLANs
- Firewall Rules: Thousands (performance depends on complexity)
- DHCP Leases: Supports large pools efficiently
Comparison to Alternatives
| Feature | UDM Pro | pfSense | OPNsense | MikroTik |
|---|
| Basic PXE | ✅ | ✅ | ✅ | ✅ |
| Conditional DHCP | ❌ | ✅ | ✅ | ✅ |
| All-in-one | ✅ | ❌ | ❌ | Varies |
| GUI Ease-of-use | ✅✅ | ⚠️ | ⚠️ | ❌ |
| API/Automation | ⚠️ | ✅ | ✅ | ✅✅ |
| IDS/IPS Built-in | ✅ | ⚠️ (addon) | ⚠️ (addon) | ❌ |
| Hardware | Fixed | Flexible | Flexible | Flexible |
| Price | $$$ | $ (+ hardware) | $ (+ hardware) | $ - $$$ |
Recommendations for Home Lab Use
Ideal Use Cases
✅ Use the UDM Pro when:
- You want an all-in-one solution with minimal configuration
- You need integrated UniFi controller and network management
- Your home lab has mixed UniFi hardware (switches, APs)
- You want a polished GUI and mobile app management
- Network segmentation and VLANs are critical
Consider Alternatives When
⚠️ Look elsewhere if:
- You need conditional DHCP options or multi-architecture PXE boot
- You require advanced routing protocols (BGP, OSPF beyond basics)
- You need granular firewall control and scripting (pfSense/OPNsense better)
- Budget is tight and you already have x86 hardware (pfSense on old PC)
- You need extremely low latency (sub-1ms) routing
Recommended Configuration for Infrastructure Lab
Network Segmentation:
- VLAN 10: Management (PXE, Ansible, provisioning tools)
- VLAN 20: Kubernetes cluster
- VLAN 30: Storage network (NFS, iSCSI)
- VLAN 40: Public-facing services (behind Cloudflare)
DHCP Strategy:
- Use UDM Pro native DHCP with basic PXE options for single-arch PXE needs
- Static reservations for infrastructure components
- Consider external DHCP server if conditional options are required
Firewall Rules:
- Default deny between VLANs
- Allow management VLAN → all (with source IP restrictions)
- Allow cluster VLAN → storage VLAN (on specific ports)
- NAT only on VLAN 40 (public services)
VPN Configuration:
- Site-to-Site to GCP via WireGuard (lower overhead than IPsec)
- Remote access VPN on separate VLAN with restrictive firewall
Integration:
- Terraform for network state management
- Ansible for DHCP/DNS servers in management VLAN
- Cloudflare Access for secure public service exposure
Conclusion
The UDM Pro is a capable all-in-one network device ideal for home labs that prioritize ease-of-use and integration with the UniFi ecosystem. It provides basic PXE boot support suitable for single-architecture environments, though conditional DHCP options require external DHCP servers for complex scenarios.
For infrastructure automation projects, the UDM Pro serves well as a reliable network foundation that handles VLANs, routing, and basic services, allowing you to focus on higher-level infrastructure concerns like container orchestration and cloud integration.
6.1 - UDM Pro VLAN Configuration & Capabilities
Detailed analysis of VLAN support on the Ubiquiti Dream Machine Pro, including port-based VLAN assignment and VPN integration.
Overview
The Ubiquiti Dream Machine Pro (UDM Pro) provides robust VLAN support through native 802.1Q tagging, enabling network segmentation for security, performance, and organizational purposes. This document covers VLAN configuration capabilities, port assignments, and VPN integration.
VLAN Fundamentals on UDM Pro
Supported Standards
- 802.1Q VLAN Tagging: Full support for standard VLAN tagging
- VLAN Range: IDs 1-4094 (standard IEEE 802.1Q range)
- Maximum VLANs: Up to 32 networks/VLANs per device
- Native VLAN: Configurable per port (default: VLAN 1)
VLAN Types
Corporate Network
- Default network type for general-purpose VLANs
- Provides DHCP, inter-VLAN routing, and firewall capabilities
- Can enable/disable guest policies, IGMP snooping, and multicast DNS
Guest Network
- Isolated network with internet-only access
- Automatic firewall rules preventing access to other VLANs
- Captive portal support for guest authentication
IoT Network
- Optimized for IoT devices with device isolation
- Prevents lateral movement between IoT devices
- Allows communication with controller/gateway only
Port-Based VLAN Assignment
Per-Port VLAN Configuration
The UDM Pro’s 8x 1 Gbps LAN ports and SFP/SFP+ ports support flexible VLAN assignment:
Configuration Options per Port:
- Native VLAN/Untagged VLAN: The default VLAN for untagged traffic on the port
- Tagged VLANs: Multiple VLANs that can pass through the port with 802.1Q tags
- Port Profile: Pre-configured VLAN assignments that can be applied to ports
Port Profile Types
All: Port accepts all VLANs (trunk mode)
- Passes all configured VLANs with tags
- Used for connecting managed switches or access points
- Native VLAN for untagged traffic
Specific VLANs: Port limited to selected VLANs
- Choose which VLANs are allowed (tagged)
- Set native/untagged VLAN
- Used for controlled trunk links
Single VLAN: Access port mode
- Port carries only one VLAN (untagged)
- All traffic on this port belongs to specified VLAN
- Used for end devices (PCs, servers, printers)
Configuration Steps
Via UniFi Controller GUI:
Create Port Profile:
- Navigate to Settings → Profiles → Port Manager
- Click Create New Port Profile
- Select profile type (All, LAN, or Custom)
- Configure VLAN settings:
- Native VLAN/Network: Untagged VLAN
- Tagged VLANs: Select allowed VLANs (for trunk mode)
- Enable/disable settings: PoE, Storm Control, Port Isolation
Assign Profile to Ports:
- Navigate to UniFi Devices → Select UDM Pro
- Go to Ports tab
- For each LAN port (1-8) or SFP port:
- Click port to edit
- Select Port Profile from dropdown
- Apply changes
Quick Port Assignment (Alternative):
- Settings → Networks → Select VLAN
- Under Port Manager, assign specific ports to this network
- Ports become access ports for this VLAN
Example Port Layout
UDM Pro Port Assignment Example:
Port 1: Native VLAN 10 (Management) - Access Mode
└── Use: Ansible control server
Port 2: Native VLAN 20 (Kubernetes) - Access Mode
└── Use: K8s master node
Port 3: Native VLAN 30 (Storage) - Access Mode
└── Use: NAS/SAN device
Port 4: Native VLAN 1, Tagged: 10,20,30,40 - Trunk Mode
└── Use: Managed switch uplink
Port 5-7: Native VLAN 40 (DMZ) - Access Mode
└── Use: Public-facing servers
Port 8: Native VLAN 1 (Default/Untagged) - Access Mode
└── Use: Management laptop (temporary)
SFP+: Native VLAN 1, Tagged: All - Trunk Mode
└── Use: 10G uplink to core switch
VLAN Features and Capabilities
Inter-VLAN Routing
Enabled by Default:
- Hardware-accelerated routing between VLANs
- Wire-speed performance (8 Gbps backplane)
- Routing decisions made at Layer 3
Firewall Control:
- Default behavior: Allow all inter-VLAN traffic
- Recommended: Create explicit allow/deny rules per VLAN pair
- Granular control: Protocol, port, source/destination filtering
Example Firewall Rules:
Rule 1: Allow Management (VLAN 10) → All VLANs
Source: 192.168.10.0/24
Destination: Any
Action: Accept
Rule 2: Allow K8s (VLAN 20) → Storage (VLAN 30) - NFS only
Source: 192.168.20.0/24
Destination: 192.168.30.0/24
Ports: 2049 (NFS), 111 (Portmapper)
Action: Accept
Rule 3: Block IoT (VLAN 50) → All Private Networks
Source: 192.168.50.0/24
Destination: 192.168.0.0/16, 10.0.0.0/8, 172.16.0.0/12
Action: Drop
Rule 4 (Implicit): Default Deny Between VLANs
Source: Any
Destination: Any
Action: Drop
DHCP per VLAN
Each VLAN can have its own DHCP server:
- Independent IP ranges per VLAN
- Separate DHCP options (DNS, gateway, NTP, domain)
- Static DHCP reservations per VLAN
- PXE boot options (Option 66/67) per network
Configuration:
- Settings → Networks → Select VLAN
- DHCP section:
- Enable DHCP server
- Define IP range (e.g., 192.168.10.100-192.168.10.254)
- Set lease time
- Configure gateway (usually UDM Pro’s IP on this VLAN)
- Add custom DHCP options
Example DHCP Configuration:
VLAN 10 (Management):
Subnet: 192.168.10.0/24
Gateway: 192.168.10.1 (UDM Pro)
DHCP Range: 192.168.10.100-192.168.10.200
DNS: 192.168.10.10 (local DNS server)
TFTP Server (Option 66): 192.168.10.16
Boot Filename (Option 67): pxelinux.0
VLAN 20 (Kubernetes):
Subnet: 192.168.20.0/24
Gateway: 192.168.20.1 (UDM Pro)
DHCP Range: 192.168.20.50-192.168.20.99
DNS: 8.8.8.8, 8.8.4.4
Domain Name: k8s.lab.local
VLAN Isolation
Guest Portal Isolation:
- Guest networks auto-configured with isolation rules
- Prevents access to RFC1918 private networks
- Internet-only access by default
Manual Isolation (Firewall Rules):
- Create LAN In rules to block inter-VLAN traffic
- Use groups for easier management of multiple VLANs
- Apply port isolation for additional security
Device Isolation (IoT Networks):
- Prevents devices on same VLAN from communicating
- Only controller/gateway access allowed
- Use for untrusted IoT devices (cameras, smart home)
VPN and VLAN Integration
Site-to-Site VPN VLAN Assignment
✅ VLANs CAN be assigned to site-to-site VPN connections:
WireGuard VPN:
- Configure remote subnet to map to specific local VLAN
- Example: GCP subnet 10.128.0.0/20 → routed through VLAN 10
- Routing table automatically updated
- Firewall rules apply to VPN traffic
IPsec Site-to-Site:
- Specify local networks (can select specific VLANs)
- Remote networks configured in tunnel settings
- Multiple VLANs can traverse single VPN tunnel
- Perfect Forward Secrecy supported
Configuration Steps:
- Settings → VPN → Site-to-Site VPN
- Create New VPN tunnel (WireGuard or IPsec)
- Under Local Networks, select VLANs to include:
- Option 1: Select “All” networks
- Option 2: Choose specific VLANs (e.g., VLAN 10, 20 only)
- Configure Remote Networks (cloud provider subnets)
- Set encryption parameters and pre-shared keys
- Create Firewall Rules for VPN traffic:
- Allow specific VLAN → VPN tunnel
- Control which VLANs can reach remote networks
Example Site-to-Site Config:
Home Lab → GCP WireGuard VPN
Local Networks:
- VLAN 10 (Management): 192.168.10.0/24
- VLAN 20 (Kubernetes): 192.168.20.0/24
Remote Networks:
- GCP VPC: 10.128.0.0/20
Firewall Rules:
- Allow VLAN 10 → GCP VPC (all protocols)
- Allow VLAN 20 → GCP VPC (HTTPS, kubectl API only)
- Block all other VLANs from VPN tunnel
Remote Access VPN VLAN Assignment
✅ VLANs CAN be assigned to remote access VPN clients:
L2TP/IPsec Remote Access:
- VPN clients land on a specific VLAN
- Default: All clients in same VPN subnet
- Firewall rules control VLAN access from VPN
OpenVPN Remote Access (via UniFi Network Application addon):
- Not natively built into UDM Pro
- Requires UniFi Network Application 6.0+
- Can route VPN clients to specific VLAN
Teleport VPN (UniFi’s solution):
- Built-in remote access VPN
- Clients route through UDM Pro
- Can access specific VLANs based on firewall rules
- Layer 3 routing to VLANs
Configuration:
- Settings → VPN → Remote Access
- Enable L2TP or configure Teleport
- Set VPN Network (e.g., 192.168.100.0/24)
- Advanced:
- Enable access to specific VLANs
- By default, VPN network is treated as separate VLAN
- Firewall Rules to allow VPN → VLANs:
- Source: VPN network (192.168.100.0/24)
- Destination: VLAN 10, VLAN 20 (or specific resources)
- Action: Accept
Example Remote Access Config:
Remote VPN Users → Home Lab Access
VPN Network: 192.168.100.0/24
VPN Gateway: 192.168.100.1 (UDM Pro)
Firewall Rules:
Rule 1: Allow VPN → Management VLAN (admin users)
Source: 192.168.100.0/24
Dest: 192.168.10.0/24
Ports: SSH (22), HTTPS (443)
Rule 2: Allow VPN → Kubernetes VLAN (developers)
Source: 192.168.100.0/24
Dest: 192.168.20.0/24
Ports: kubectl (6443), app ports (8080-8090)
Rule 3: Block VPN → Storage VLAN (security)
Source: 192.168.100.0/24
Dest: 192.168.30.0/24
Action: Drop
VPN VLAN Routing Limitations
Current Limitations:
- Cannot assign individual VPN clients to different VLANs dynamically
- No VLAN assignment based on user identity (all clients in same VPN network)
- RADIUS integration does not support per-user VLAN assignment for VPN
- For per-user VLAN control, use firewall rules based on source IP
Workarounds:
- Use firewall rules with VPN client IP ranges for granular access
- Deploy separate VPN tunnels for different access levels
- Use RADIUS for authentication + firewall rules for authorization
VLAN Best Practices for Home Lab
Network Segmentation Strategy
Recommended VLAN Layout:
VLAN 1: Default/Management (UDM Pro access)
VLAN 10: Infrastructure Management (Ansible, PXE, monitoring)
VLAN 20: Kubernetes Cluster (control plane + workers)
VLAN 30: Storage Network (NFS, iSCSI, object storage)
VLAN 40: DMZ/Public Services (exposed to internet via Cloudflare)
VLAN 50: IoT Devices (isolated smart home devices)
VLAN 60: Guest Network (visitor WiFi, untrusted devices)
VLAN 100: VPN Remote Access (remote admin/dev access)
Firewall Policy Design
Default Deny Approach:
- Create explicit allow rules for necessary traffic
- Set implicit deny for all inter-VLAN traffic
- Log dropped packets for troubleshooting
Rule Order (top to bottom):
- Management VLAN → All (with source IP restrictions)
- Kubernetes → Storage (specific ports)
- DMZ → Internet (outbound only)
- VPN → Specific VLANs (based on role)
- All → Internet (NAT)
- Block RFC1918 from DMZ
- Drop all (implicit)
VLAN Routing Performance:
- Inter-VLAN routing is hardware-accelerated
- No performance penalty for multiple VLANs
- Use VLAN tagging on trunk ports to reduce switch load
Multicast and Broadcast Control:
- Enable IGMP snooping per VLAN for multicast efficiency
- Disable multicast DNS (mDNS) between VLANs if not needed
- Use multicast routing for cross-VLAN multicast (advanced)
Advanced VLAN Features
VLAN-Specific Services
DNS per VLAN:
- Configure different DNS servers per VLAN via DHCP
- Example: Management VLAN uses local DNS, DMZ uses public DNS
NTP per VLAN:
- DHCP Option 42 for NTP server
- Different time sources per network segment
Domain Name per VLAN:
- DHCP Option 15 for domain name
- Useful for split-horizon DNS setups
VLAN Tagging on WiFi
UniFi WiFi Integration:
- Each WiFi SSID can map to a specific VLAN
- Multiple SSIDs on same AP → different VLANs
- Seamless VLAN tagging for wireless clients
Configuration:
- Create WiFi network in UniFi Controller
- Assign VLAN ID to SSID
- Client traffic automatically tagged
VLAN Monitoring and Troubleshooting
Traffic Statistics:
- Per-VLAN bandwidth usage visible in UniFi Controller
- Deep Packet Inspection (DPI) provides application-level stats
- Export data for analysis in external tools
Debugging Tools:
- Port mirroring for packet capture
- Flow logs for traffic analysis
- Firewall logs show inter-VLAN blocks
Common Issues:
- VLAN not working: Check port profile assignment and native VLAN config
- No inter-VLAN routing: Verify firewall rules aren’t blocking traffic
- DHCP not working on VLAN: Ensure DHCP server enabled on that network
- VPN can’t reach VLAN: Check VPN local networks include the VLAN
Summary
VLAN Port Assignment: ✅ YES
The UDM Pro fully supports port-based VLAN assignment:
- Individual ports can be assigned to specific VLANs (access mode)
- Ports can carry multiple tagged VLANs (trunk mode)
- Native/untagged VLAN configurable per port
- Port profiles simplify configuration across multiple devices
VPN VLAN Assignment: ✅ YES
VLANs can be assigned to VPN connections:
- Site-to-Site VPN: Select which VLANs traverse the tunnel
- Remote Access VPN: VPN clients route to specific VLANs via firewall rules
- Routing Control: Full control over which VLANs are accessible via VPN
- Limitations: No per-user VLAN assignment; use firewall rules for granular access
Key Capabilities
- Up to 32 VLANs supported
- Hardware-accelerated inter-VLAN routing
- Per-VLAN DHCP, DNS, and firewall policies
- Full integration with UniFi WiFi for SSID-to-VLAN mapping
- Flexible port profiles for easy configuration
- VPN integration for both site-to-site and remote access scenarios