This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Architecture Decision Records

Documentation of architectural decisions made using MADR 4.0.0 standard

1: [0001] Use MADR for Architecture Decision Records
2: [0002] Network Boot Architecture for Home Lab
3: [0003] Cloud Provider Selection for Network Boot Infrastructure
4: [0004] Server Operating System Selection
5: [0005] Network Boot Infrastructure Implementation on Google Cloud

Architecture Decision Records (ADRs)

This section contains architectural decision records that document the key design choices made. Each ADR follows the MADR 4.0.0 format and includes:

Context and problem statement
Decision drivers and constraints
Considered options with pros and cons
Decision outcome and rationale
Consequences (positive and negative)
Confirmation methods

ADR Categories

ADRs are classified into three categories:

Strategic - High-level architectural decisions affecting the entire system (frameworks, authentication strategies, cross-cutting patterns). Use for foundational technology choices.
User Journey - Decisions solving specific user journey problems. More tactical than strategic, but still architectural. Use when evaluating approaches to implement user-facing features.
API Design - API endpoint implementation decisions (pagination, filtering, bulk operations). Use for significant API design trade-offs that warrant documentation.

Status Values

Each ADR has a status that reflects its current state:

proposed - Decision is under consideration
accepted - Decision has been approved and should be implemented
rejected - Decision was considered but not approved
deprecated - Decision is no longer relevant or has been superseded
superseded by ADR-XXXX - Decision has been replaced by a newer ADR

These records provide historical context for architectural decisions and help ensure consistency across the platform.

1 - [0001] Use MADR for Architecture Decision Records

Adopt Markdown Architectural Decision Records (MADR) as the standard format for documenting architectural decisions in the project.

Context and Problem Statement

As the project grows, architectural decisions are made that have long-term impacts on the system’s design, maintainability, and scalability. Without a structured way to document these decisions, we risk losing the context and rationale behind important choices, making it difficult for current and future team members to understand why certain approaches were taken.

How should we document architectural decisions in a way that is accessible, maintainable, and provides sufficient context for future reference?

Decision Drivers

Need for clear documentation of architectural decisions and their rationale
Easy accessibility and searchability of past decisions
Low barrier to entry for creating and maintaining decision records
Integration with existing documentation workflow
Version control friendly format
Industry-standard approach that team members may already be familiar with

Considered Options

MADR (Markdown Architectural Decision Records)
ADR using custom format
Wiki-based documentation
No formal ADR process

Decision Outcome

Chosen option: “MADR (Markdown Architectural Decision Records)”, because it provides a well-established, standardized format that is lightweight, version-controlled, and integrates seamlessly with our existing documentation structure. MADR 4.0.0 offers a clear template that captures all necessary information while remaining flexible enough for different types of decisions.

Consequences

Good, because MADR is a widely adopted standard with clear documentation and examples
Good, because markdown files are easy to create, edit, and review through pull requests
Good, because ADRs will be version-controlled alongside code, maintaining historical context
Good, because the format is flexible enough to accommodate strategic, user-journey, and API design decisions
Good, because team members can easily search and reference past decisions
Neutral, because requires discipline to maintain and update ADR status as decisions evolve
Bad, because team members need to learn and follow the MADR format conventions

Confirmation

Compliance will be confirmed through:

Code reviews ensuring new architectural decisions are documented as ADRs
ADRs are stored in docs/content/r&d/adrs/ following the naming convention NNNN-title-with-dashes.md
Regular reviews during architecture discussions to reference and update existing ADRs

Pros and Cons of the Options

MADR (Markdown Architectural Decision Records)

MADR 4.0.0 is a standardized format for documenting architectural decisions using markdown.

Good, because it’s a well-established standard with extensive documentation
Good, because markdown is simple, portable, and version-control friendly
Good, because it provides a clear structure while remaining flexible
Good, because it integrates with static site generators and documentation tools
Good, because it’s lightweight and doesn’t require special tools
Neutral, because it requires some initial learning of the format
Neutral, because maintaining consistency requires discipline

ADR using custom format

Create our own custom format for architectural decision records.

Good, because we can tailor it exactly to our needs
Bad, because it requires defining and maintaining our own standard
Bad, because new team members won’t be familiar with the format
Bad, because we lose the benefits of community knowledge and tooling
Bad, because it may evolve inconsistently over time

Wiki-based documentation

Use a wiki system (like Confluence, Notion, or GitHub Wiki) to document decisions.

Good, because wikis provide easy editing and hyperlinking
Good, because some team members may be familiar with wiki tools
Neutral, because it may or may not integrate with version control
Bad, because content may not be version-controlled alongside code
Bad, because it creates a separate system to maintain
Bad, because it’s harder to review changes through standard PR process
Bad, because portability and long-term accessibility may be concerns

No formal ADR process

Continue without a structured approach to documenting architectural decisions.

Good, because it requires no additional overhead
Bad, because context and rationale for decisions are lost over time
Bad, because new team members struggle to understand why decisions were made
Bad, because it leads to repeated discussions of previously settled questions
Bad, because it makes it difficult to track when decisions should be revisited

More Information

MADR 4.0.0 specification: https://adr.github.io/madr/
ADRs will be categorized as: strategic, user-journey, or api-design
ADR status values: proposed | accepted | rejected | deprecated | superseded by ADR-XXXX
All ADRs are stored in docs/content/r&d/adrs/ directory

2 - [0002] Network Boot Architecture for Home Lab

Evaluate options for network booting servers in a home lab environment, considering local vs cloud-hosted boot servers.

Context and Problem Statement

When setting up a home lab infrastructure, servers need to be provisioned and booted over the network using PXE (Preboot Execution Environment). This requires a TFTP/HTTP server to serve boot files to requesting machines. The question is: where should this boot server be hosted to balance security, reliability, cost, and operational complexity?

Decision Drivers

Security: Minimize attack surface and ensure only authorized servers receive boot files
Reliability: Boot process should be resilient and not dependent on external network connectivity
Cost: Minimize ongoing infrastructure costs
Complexity: Keep the operational burden manageable
Trust Model: Clear verification of requesting server identity

Considered Options

Option 1: TFTP/HTTP server locally on home lab network
Option 2: TFTP/HTTP server on public cloud (without VPN)
Option 3: TFTP/HTTP server on public cloud (with VPN)

Decision Outcome

Chosen option: “Option 3: TFTP/HTTP server on public cloud (with VPN)”, because:

No local machine management: Unlike Option 1, this avoids the need to maintain dedicated local hardware for the boot server, reducing operational overhead
Secure protocol support: The VPN tunnel encrypts all traffic, allowing unsecured protocols like TFTP to be used without risk of data exposure over public internet routes (unlike Option 2)
Cost-effective VPN: The UDM Pro natively supports WireGuard, enabling a self-managed VPN solution that avoids expensive managed VPN services (~$180-300/year vs ~$540-900/year)

Consequences

Good, because all traffic is encrypted through WireGuard VPN tunnel
Good, because boot server is not exposed to public internet (no public attack surface)
Good, because trust model is simple - subnet validation similar to local option
Good, because centralized cloud management reduces local maintenance burden
Good, because boot server remains available even if home lab storage fails
Good, because UDM Pro’s native WireGuard support keeps costs at ~$180-300/year
Bad, because boot process depends on both internet connectivity and VPN availability
Bad, because VPN adds latency to boot file transfers
Bad, because VPN gateway becomes an additional failure point
Bad, because higher ongoing cost compared to local-only option (~$180-300/year vs ~$10/year)

Confirmation

The implementation will be confirmed by:

Successfully network booting a test server using the chosen architecture
Validating the trust model prevents unauthorized boot requests
Measuring actual costs against estimates

Pros and Cons of the Options

Option 1: TFTP/HTTP server locally on home lab network

Run the boot server on local infrastructure (e.g., Raspberry Pi, dedicated VM, or container) within the home lab network.

Boot Flow Sequence

sequenceDiagram
    participant Server as Home Lab Server
    participant DHCP as Local DHCP Server
    participant Boot as Local TFTP/HTTP Server

    Server->>DHCP: PXE Boot Request (DHCP Discover)
    DHCP->>Server: DHCP Offer with Boot Server IP
    Server->>Boot: TFTP Request for Boot File
    Boot->>Boot: Verify MAC/IP against allowlist
    Boot->>Server: Send iPXE/Boot Loader
    Server->>Boot: HTTP Request for Kernel/Initrd
    Boot->>Server: Send Boot Files
    Server->>Server: Boot into OS

Trust Model

MAC Address Allowlist: Maintain a list of known server MAC addresses
Network Isolation: Boot server only accessible from home lab VLAN
No external exposure: Traffic never leaves local network
Physical security: Relies on physical access control to home lab

Cost Estimate

Hardware: ~$50-100 one-time (Raspberry Pi or repurposed hardware)
Power: ~$5-10/year (low power consumption)
Total: ~$55-110 initial + ~$10/year ongoing

Pros and Cons

Good, because no dependency on internet connectivity for booting
Good, because lowest latency for boot file transfers
Good, because all data stays within local network (maximum privacy)
Good, because lowest ongoing cost
Good, because simple trust model based on network isolation
Neutral, because requires dedicated local hardware or resources
Bad, because single point of failure if boot server goes down
Bad, because requires local maintenance and updates

Option 2: TFTP/HTTP server on public cloud (without VPN)

Host the boot server on a cloud provider (AWS, GCP, Azure) and expose it directly to the internet.

Boot Flow Sequence

sequenceDiagram
    participant Server as Home Lab Server
    participant DHCP as Local DHCP Server
    participant Router as Home Router/NAT
    participant Internet as Internet
    participant Boot as Cloud TFTP/HTTP Server

    Server->>DHCP: PXE Boot Request (DHCP Discover)
    DHCP->>Server: DHCP Offer with Cloud Boot Server IP
    Server->>Router: TFTP Request
    Router->>Internet: NAT Translation
    Internet->>Boot: TFTP Request from Home IP
    Boot->>Boot: Verify source IP + token/certificate
    Boot->>Internet: Send iPXE/Boot Loader
    Internet->>Router: Response
    Router->>Server: Boot Loader
    Server->>Router: HTTP Request for Kernel/Initrd
    Router->>Internet: NAT Translation
    Internet->>Boot: HTTP Request with auth headers
    Boot->>Boot: Validate request authenticity
    Boot->>Internet: Send Boot Files
    Internet->>Router: Response
    Router->>Server: Boot Files
    Server->>Server: Boot into OS

Trust Model

Source IP Validation: Restrict to home lab’s public IP (dynamic IP is problematic)
Certificate/Token Authentication: Embed certificates in initial bootloader
TLS for HTTP: All HTTP traffic encrypted
Challenge-Response: Boot server can challenge requesting server
Risk: TFTP typically unencrypted, vulnerable to interception

Cost Estimate

Cloud VM (t3.micro or equivalent): ~$10-15/month
Data Transfer: ~$1-5/month (boot files are typically small)
Static IP: ~$3-5/month
Total: ~$170-300/year

Pros and Cons

Good, because boot server remains available even if home lab has issues
Good, because centralized management in cloud console
Good, because easy to scale or replicate
Neutral, because requires internet connectivity for every boot
Bad, because significantly higher ongoing cost
Bad, because TFTP protocol is inherently insecure over public internet
Bad, because complex trust model required (IP validation, certificates)
Bad, because boot process depends on internet availability
Bad, because higher latency for boot file transfers
Bad, because public exposure increases attack surface

Option 3: TFTP/HTTP server on public cloud (with VPN)

Host the boot server in the cloud but connect the home lab to the cloud via a site-to-site VPN tunnel.

Boot Flow Sequence

sequenceDiagram
    participant Server as Home Lab Server
    participant DHCP as Local DHCP Server
    participant VPN as VPN Gateway (Home)
    participant CloudVPN as VPN Gateway (Cloud)
    participant Boot as Cloud TFTP/HTTP Server

    Note over VPN,CloudVPN: Site-to-Site VPN Tunnel Established

    Server->>DHCP: PXE Boot Request (DHCP Discover)
    DHCP->>Server: DHCP Offer with Boot Server Private IP
    Server->>VPN: TFTP Request to Private IP
    VPN->>CloudVPN: Encrypted VPN Tunnel
    CloudVPN->>Boot: TFTP Request (appears local)
    Boot->>Boot: Verify source IP from home lab subnet
    Boot->>CloudVPN: Send iPXE/Boot Loader
    CloudVPN->>VPN: Encrypted Response
    VPN->>Server: Boot Loader
    Server->>VPN: HTTP Request for Kernel/Initrd
    VPN->>CloudVPN: Encrypted VPN Tunnel
    CloudVPN->>Boot: HTTP Request
    Boot->>Boot: Validate subnet membership
    Boot->>CloudVPN: Send Boot Files
    CloudVPN->>VPN: Encrypted Response
    VPN->>Server: Boot Files
    Server->>Server: Boot into OS

Trust Model

VPN Tunnel Encryption: All traffic encrypted end-to-end
Private IP Addressing: Boot server only accessible via VPN
Subnet Validation: Verify requests come from trusted home lab subnet
VPN Authentication: Strong auth at tunnel level (certificates, pre-shared keys)
No public exposure: Boot server has no public IP

Cost Estimate

Cloud VM (t3.micro or equivalent): ~$10-15/month
Data Transfer (VPN): ~$5-10/month
VPN Gateway Service (if using managed): ~$30-50/month OR
Self-managed VPN (WireGuard/OpenVPN): ~$0 additional
Total (self-managed VPN): ~$180-300/year
Total (managed VPN): ~$540-900/year

Pros and Cons

Good, because all traffic encrypted through VPN tunnel
Good, because boot server not exposed to public internet
Good, because trust model similar to local option (subnet validation)
Good, because centralized cloud management benefits
Good, because boot server available if home lab storage fails
Neutral, because moderate complexity (VPN setup and maintenance)
Bad, because higher cost than local option
Bad, because boot process still depends on internet + VPN availability
Bad, because VPN adds latency to boot process
Bad, because VPN gateway becomes additional failure point
Bad, because most expensive option if using managed VPN service

More Information

Key Questions for Decision

How critical is boot availability during internet outages?
Is the home lab public IP static or dynamic?
What is the acceptable boot time latency?
How many servers need to be supported?
Is there existing VPN infrastructure?

Issue #595 - story(docs): create adr for network boot architecture

3 - [0003] Cloud Provider Selection for Network Boot Infrastructure

Evaluate Google Cloud Platform vs Amazon Web Services for hosting network boot server infrastructure as required by ADR-0002.

Context and Problem Statement

ADR-0002 established that network boot infrastructure will be hosted on a cloud provider and accessed via VPN (specifically WireGuard from the UDM Pro). The decision to use cloud hosting provides resilience against local hardware failures while maintaining security through encrypted VPN tunnels.

The question now is: Which cloud provider should host the network boot infrastructure?

This decision will affect:

Cost: Ongoing monthly/annual infrastructure costs
Protocol Support: Ability to serve TFTP, HTTP, and HTTPS boot files
VPN Integration: Ease of WireGuard deployment and management
Operational Complexity: Management overhead and maintenance burden
Performance: Boot file transfer latency and throughput
Vendor Lock-in: Future flexibility to migrate or multi-cloud

Decision Drivers

Cost Efficiency: Minimize ongoing infrastructure costs for home lab scale
Protocol Support: Must support TFTP (UDP/69), HTTP (TCP/80), and HTTPS (TCP/443) for network boot workflows
WireGuard Compatibility: Must support self-managed WireGuard VPN with reasonable effort
UDM Pro Integration: Should work seamlessly with UniFi Dream Machine Pro’s native WireGuard client
Simplicity: Minimize operational complexity for a single-person home lab
Existing Expertise: Leverage existing team knowledge and infrastructure
Performance: Sufficient throughput and low latency for boot file transfers (50-200MB per boot)

Considered Options

Option 1: Google Cloud Platform (GCP)
Option 2: Amazon Web Services (AWS)

Decision Outcome

Chosen option: “Option 1: Google Cloud Platform (GCP)”, because:

Existing Infrastructure: The home lab already uses GCP extensively (Cloud Run services, load balancers, mTLS infrastructure per existing codebase), reducing operational overhead and leveraging existing expertise
Comparable Costs: Both providers offer similar costs for the required infrastructure (~$6-12/month for compute + VPN), with GCP’s e2-micro being sufficient
Equivalent Protocol Support: Both support TFTP/HTTP/HTTPS via direct VM access (load balancers unnecessary for single boot server), meeting all protocol requirements
WireGuard Compatibility: Both require self-managed WireGuard deployment (neither has native WireGuard support), with nearly identical implementation complexity
Unified Management: Consolidating all cloud infrastructure on GCP simplifies monitoring, billing, IAM, and operational workflows

While AWS would be a viable alternative (especially with t4g.micro ARM instances offering slightly better price/performance), the existing GCP investment makes it the pragmatic choice to avoid multi-cloud complexity.

Consequences

Good, because consolidates all cloud infrastructure on a single provider (reduced operational complexity)
Good, because leverages existing GCP expertise and IAM configurations
Good, because unified Cloud Monitoring/Logging across all services
Good, because single cloud bill simplifies cost tracking
Good, because existing Terraform modules and patterns can be reused
Good, because GCP’s e2-micro instances (~$6.50/month) are cost-effective for the workload
Good, because self-managed WireGuard provides flexibility and low cost (~$10/month total)
Neutral, because both providers have comparable protocol support (TFTP/HTTP/HTTPS via VM)
Neutral, because both require self-managed WireGuard (no native support)
Bad, because creates vendor lock-in to GCP (migration would require relearning and reconfiguration)
Bad, because foregoes AWS’s slightly cheaper t4g.micro ARM instances (~$6/month vs GCP’s ~$6.50/month)
Bad, because multi-cloud strategy could provide redundancy (accepted trade-off for simplicity)

Confirmation

The implementation will be confirmed by:

Successfully deploying WireGuard VPN gateway on GCP Compute Engine
Establishing site-to-site VPN tunnel between UDM Pro and GCP
Network booting a test server via VPN using TFTP and HTTP protocols
Measuring actual costs against estimates (~$10-15/month)
Validating boot performance (transfer time < 30 seconds for typical boot)

Pros and Cons of the Options

Option 1: Google Cloud Platform (GCP)

Host network boot infrastructure on Google Cloud Platform.

Architecture Overview

graph TB
    subgraph "Home Lab Network"
        A[Home Lab Servers]
        B[UDM Pro - WireGuard Client]
    end
    
    subgraph "GCP VPC"
        C[WireGuard Gateway VM<br/>e2-micro]
        D[Boot Server VM<br/>e2-micro]
        C -->|VPC Routing| D
    end
    
    A -->|PXE Boot Request| B
    B -->|WireGuard Tunnel| C
    C -->|TFTP/HTTP/HTTPS| D
    D -->|Boot Files| C
    C -->|Encrypted Response| B
    B -->|Boot Files| A

Implementation Details

Compute:

WireGuard Gateway: e2-micro VM (~$6.50/month) running Ubuntu 22.04
- Self-managed WireGuard server
- IP forwarding enabled
- Static external IP (~$3.50/month if VM ever stops)
Boot Server: e2-micro VM (same or consolidated with gateway)
- TFTP server (tftpd-hpa)
- HTTP server (nginx or simple Python server)
- Optional HTTPS with self-signed cert or Let’s Encrypt

Networking:

VPC: Default VPC or custom VPC with private subnets
Firewall Rules:
- Allow UDP/51820 from home lab public IP (WireGuard)
- Allow UDP/69, TCP/80, TCP/443 from VPN subnet (boot protocols)
Routes: Custom route to direct home lab subnet through WireGuard gateway
Cloud VPN: Not used (self-managed WireGuard instead to save ~$65/month)

WireGuard Setup:

Install WireGuard on Compute Engine VM
Configure wg0 interface with PostUp/PostDown iptables rules
Store private key in Secret Manager
UDM Pro connects as WireGuard peer

Cost Breakdown (US regions):

Component	Monthly Cost
e2-micro VM (WireGuard + Boot)	~$6.50
Static External IP (if attached)	~$3.50
Egress (10 boots × 150MB)	~$0.18
Total	~$10.18
Annual	~$122

Pros and Cons

Good, because existing home lab infrastructure already uses GCP extensively
Good, because consolidates all cloud resources on single provider (unified billing, IAM, monitoring)
Good, because leverages existing GCP expertise and Terraform modules
Good, because Cloud Monitoring/Logging already configured for other services
Good, because Secret Manager integration for WireGuard key storage
Good, because e2-micro instance size is sufficient for network boot workload
Good, because low cost (~$10/month for self-managed WireGuard)
Good, because VPC networking is familiar and well-documented
Neutral, because requires self-managed WireGuard (no native support, same as AWS)
Neutral, because TFTP/HTTP/HTTPS served directly from VM (no special GCP features needed)
Bad, because slightly more expensive than AWS t4g.micro (~$6.50/month vs ~$6/month)
Bad, because creates vendor lock-in to GCP ecosystem
Bad, because Cloud VPN (managed IPsec) is expensive (~$73/month), so must use self-managed WireGuard

Option 2: Amazon Web Services (AWS)

Host network boot infrastructure on Amazon Web Services.

Architecture Overview

graph TB
    subgraph "Home Lab Network"
        A[Home Lab Servers]
        B[UDM Pro - WireGuard Client]
    end
    
    subgraph "AWS VPC"
        C[WireGuard Gateway EC2<br/>t4g.micro]
        D[Boot Server EC2<br/>t4g.micro]
        C -->|VPC Routing| D
    end
    
    A -->|PXE Boot Request| B
    B -->|WireGuard Tunnel| C
    C -->|TFTP/HTTP/HTTPS| D
    D -->|Boot Files| C
    C -->|Encrypted Response| B
    B -->|Boot Files| A

Implementation Details

Compute:

WireGuard Gateway: t4g.micro EC2 (~$6/month, ARM-based Graviton)
- Self-managed WireGuard server
- Source/Dest check disabled for IP forwarding
- Elastic IP (free when attached to running instance)
Boot Server: t4g.micro EC2 (same or consolidated with gateway)
- TFTP server (tftpd-hpa)
- HTTP server (nginx)
- Optional HTTPS with Let’s Encrypt or self-signed cert

Networking:

VPC: Default VPC or custom VPC with private subnets
Security Groups:
- WireGuard SG: Allow UDP/51820 from home lab public IP
- Boot Server SG: Allow UDP/69, TCP/80, TCP/443 from WireGuard SG
Route Table: Add route for home lab subnet via WireGuard instance
Site-to-Site VPN: Not used (self-managed WireGuard saves ~$30/month)

WireGuard Setup:

Install WireGuard on Ubuntu 22.04 or Amazon Linux 2023 EC2
Configure wg0 with iptables MASQUERADE
Store private key in Secrets Manager
UDM Pro connects as WireGuard peer

Cost Breakdown (US East):

Component	Monthly Cost
t4g.micro EC2 (WireGuard + Boot)	~$6.00
Elastic IP (attached)	$0.00
Egress (10 boots × 150MB)	~$0.09
Total (On-Demand)	~$6.09
Total (1-yr Reserved)	~$3.59
Annual (On-Demand)	~$73
Annual (Reserved)	~$43

Pros and Cons

Good, because t4g.micro ARM instances offer best price/performance (~$6/month on-demand)
Good, because Reserved Instances provide significant savings (~40% with 1-year commitment)
Good, because Elastic IP is free when attached to running instance
Good, because AWS has extensive documentation and community support
Good, because potential for future multi-cloud strategy
Good, because ACM provides free SSL certificates (if public domain used)
Good, because Secrets Manager for WireGuard key storage
Good, because low cost (~$6/month on-demand, ~$3.50/month with RI)
Neutral, because requires self-managed WireGuard (no native support, same as GCP)
Neutral, because TFTP/HTTP/HTTPS served directly from EC2 (no special AWS features)
Bad, because introduces multi-cloud complexity (separate billing, IAM, monitoring)
Bad, because no existing AWS infrastructure in home lab (new learning curve)
Bad, because requires separate monitoring/logging setup (CloudWatch vs Cloud Monitoring)
Bad, because separate Terraform state and modules needed
Bad, because Site-to-Site VPN is expensive (~$36/month), so must use self-managed WireGuard

More Information

Detailed Analysis

For in-depth analysis of each provider’s capabilities:

Key Findings Summary

Both providers offer:

✅ TFTP Support: Via direct VM/EC2 access (load balancers don’t support TFTP)
✅ HTTP/HTTPS Support: Full support via direct VM/EC2 or load balancers
✅ WireGuard Compatibility: Self-managed deployment on VM/EC2 (neither has native support)
✅ UDM Pro Integration: Native WireGuard client works with both
✅ Low Cost: $6-12/month for compute + VPN infrastructure
✅ Sufficient Performance: 100+ Mbps throughput on smallest instances

Key differences:

GCP: Slightly higher cost (~$10/month), but consolidates with existing infrastructure
AWS: Slightly lower cost (~$6/month on-demand, ~$3.50/month Reserved), but introduces multi-cloud complexity

Cost Comparison Table

Component	GCP (e2-micro)	AWS (t4g.micro On-Demand)	AWS (t4g.micro 1-yr RI)
Compute	$6.50/month	$6.00/month	$3.50/month
Static IP	$3.50/month	$0.00 (Elastic IP free when attached)	$0.00
Egress (1.5GB)	$0.18/month	$0.09/month	$0.09/month
Monthly	$10.18	$6.09	$3.59
Annual	$122	$73	$43

Savings Analysis: AWS is ~$49-79/year cheaper, but introduces operational complexity.

Protocol Support Comparison

Protocol	GCP Support	AWS Support	Implementation
TFTP (UDP/69)	⚠️ Via VM	⚠️ Via EC2	Direct VM/EC2 access (no LB support)
HTTP (TCP/80)	✅ Full	✅ Full	Direct VM/EC2 or Load Balancer
HTTPS (TCP/443)	✅ Full	✅ Full	Direct VM/EC2 or Load Balancer + cert
WireGuard	⚠️ Self-managed	⚠️ Self-managed	Install on VM/EC2

WireGuard Deployment Comparison

Aspect	GCP	AWS
Native Support	❌ No (IPsec Cloud VPN only)	❌ No (IPsec Site-to-Site VPN only)
Self-Managed	✅ Compute Engine	✅ EC2
Setup Complexity	Similar (install, configure, firewall)	Similar (install, configure, SG)
IP Forwarding	Enable on VM	Disable Source/Dest check
Firewall	VPC Firewall rules	Security Groups
Key Storage	Secret Manager	Secrets Manager
Cost	~$10/month total	~$6/month total

Trade-offs Analysis

Choosing GCP:

Wins: Operational simplicity, unified infrastructure, existing expertise
Loses: ~$50-80/year higher cost, vendor lock-in

Choosing AWS:

Wins: Lower cost, Reserved Instance savings, multi-cloud optionality
Loses: Multi-cloud complexity, separate monitoring/billing, new tooling

For a home lab prioritizing simplicity over cost optimization, GCP’s consolidation benefits outweigh the modest cost difference.

ADR-0002: Network Boot Architecture - Established requirement for cloud-hosted boot server with VPN
ADR-0001: Use MADR for Architecture Decision Records - MADR format used for this ADR

Future Considerations

Cost Reevaluation: If annual costs become significant, reconsider AWS Reserved Instances
Multi-Cloud: If multi-cloud strategy emerges, migrate boot server to AWS
Managed WireGuard: If GCP or AWS adds native WireGuard support, reevaluate managed option
High Availability: If HA required, evaluate multi-region deployment costs on both providers

Issue #597 - story(docs): create adr for cloud provider selection

4 - [0004] Server Operating System Selection

Evaluate operating systems for homelab server infrastructure with focus on Kubernetes cluster setup and maintenance.

Context and Problem Statement

The homelab infrastructure requires a server operating system to run Kubernetes clusters for container workloads. The choice of operating system significantly impacts ease of cluster initialization, ongoing maintenance burden, security posture, and operational complexity.

The question is: Which operating system should be used for homelab Kubernetes servers?

This decision will affect:

Cluster Initialization: Complexity and time required to bootstrap Kubernetes
Maintenance Burden: Frequency and complexity of OS updates, Kubernetes upgrades, and patching
Security Posture: Attack surface, built-in security features, and hardening requirements
Resource Efficiency: RAM, CPU, and disk overhead
Operational Complexity: Day-to-day management, troubleshooting, and debugging
Learning Curve: Time required for team to become proficient

Decision Drivers

Ease of Kubernetes Setup: Minimize steps and complexity for cluster initialization
Maintenance Simplicity: Reduce ongoing operational burden for updates and upgrades
Security-First Design: Minimal attack surface and strong security defaults
Resource Efficiency: Low RAM/CPU/disk overhead for cost-effective homelab
Learning Curve: Reasonable adoption time for single-person homelab
Community Support: Strong documentation and active community
Immutability: Prefer declarative, version-controlled configuration (GitOps-friendly)
Purpose-Built: OS optimized specifically for Kubernetes vs general-purpose

Considered Options

Option 1: Ubuntu Server with k3s
Option 2: Fedora Server with kubeadm
Option 3: Talos Linux (purpose-built Kubernetes OS)
Option 4: Harvester HCI (hyperconverged platform)

Decision Outcome

Chosen option: “Option 3: Talos Linux”, because:

Minimal Attack Surface: No SSH, shell, or package manager eliminates entire classes of vulnerabilities, providing the strongest security posture
Built-in Kubernetes: No separate installation or configuration complexity - Kubernetes is included and optimized
Declarative Configuration: API-driven, immutable infrastructure aligns with GitOps principles and prevents configuration drift
Lowest Resource Overhead: ~768MB RAM vs 1-2GB+ for traditional distros, maximizing homelab hardware efficiency
Simplified Maintenance: Declarative upgrades (talosctl upgrade) for both OS and Kubernetes reduce operational burden
Security by Default: Immutable filesystem, no shell, KSPP compliance - secure without manual hardening

While the learning curve is steeper than traditional Linux distributions, the benefits of purpose-built Kubernetes infrastructure, minimal maintenance, and superior security outweigh the initial learning investment for a dedicated Kubernetes homelab.

Consequences

Good, because minimal attack surface (no SSH/shell) provides strongest security posture
Good, because declarative configuration enables GitOps workflows and prevents drift
Good, because lowest resource overhead (~768MB RAM) maximizes homelab efficiency
Good, because built-in Kubernetes eliminates installation complexity
Good, because immutable infrastructure prevents configuration drift
Good, because simplified upgrades (single command for OS + K8s) reduce maintenance burden
Good, because smallest disk footprint (~500MB) vs 10GB+ for traditional distros
Good, because secure by default (no manual hardening required)
Good, because purpose-built design optimized specifically for Kubernetes
Good, because API-driven management (talosctl) enables automation
Neutral, because steeper learning curve (paradigm shift from shell-based management)
Neutral, because smaller community than Ubuntu/Fedora (but active and helpful)
Bad, because limited to Kubernetes workloads only (not general-purpose)
Bad, because no shell access requires different troubleshooting approach
Bad, because newer platform (less mature than Ubuntu/Fedora)
Bad, because no escape hatch for manual intervention when needed

Confirmation

The implementation will be confirmed by:

Successfully bootstrapping a Talos cluster using talosctl
Deploying test workloads and validating functionality
Performing declarative OS and Kubernetes upgrades
Measuring actual resource usage (RAM < 1GB per node)
Validating security posture (no SSH/shell, immutable filesystem)
Testing GitOps workflow (machine configs in version control)

Pros and Cons of the Options

Option 1: Ubuntu Server with k3s

Host Kubernetes using Ubuntu Server 24.04 LTS with k3s lightweight Kubernetes distribution.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Ubuntu Server
    participant K3s as k3s Components
    
    Admin->>Server: Install Ubuntu 24.04 LTS
    Server->>Server: Configure network (static IP)
    Admin->>Server: Update system
    Admin->>Server: curl -sfL https://get.k3s.io | sh -
    Server->>K3s: Download k3s binary
    K3s->>Server: Configure containerd
    K3s->>Server: Start k3s service
    K3s->>Server: Initialize etcd (embedded)
    K3s->>Server: Start API server
    K3s->>Server: Deploy built-in CNI (Flannel)
    K3s-->>Admin: Control plane ready
    Admin->>Server: Retrieve node token
    Admin->>Server: Install k3s agent on workers
    K3s->>Server: Join workers to cluster
    K3s-->>Admin: Cluster ready (5-10 minutes)

Implementation Details

Installation:

# Single-command k3s install
curl -sfL https://get.k3s.io | sh -

# Get token for workers
sudo cat /var/lib/rancher/k3s/server/node-token

# Install on workers
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 K3S_TOKEN=<token> sh -

Resource Requirements:

RAM: 1GB total (512MB OS + 512MB k3s)
CPU: 1-2 cores
Disk: 20GB (10GB OS + 10GB containers)

Maintenance:

# OS updates
sudo apt update && sudo apt upgrade

# k3s upgrade
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.32.0+k3s1 sh -

# Or automatic via system-upgrade-controller

Pros and Cons

Good, because most familiar Linux distribution (easy adoption)
Good, because 5-year LTS support (10 years with Ubuntu Pro)
Good, because k3s provides single-command setup
Good, because extensive documentation and community support
Good, because compatible with all Kubernetes tooling
Good, because automatic security updates available
Good, because general-purpose (can run non-K8s workloads)
Good, because low learning curve
Neutral, because moderate resource overhead (1GB RAM)
Bad, because general-purpose OS has larger attack surface
Bad, because requires manual OS updates and reboots
Bad, because managing OS + Kubernetes lifecycle separately
Bad, because imperative configuration (not GitOps-native)
Bad, because mutable filesystem (configuration drift possible)

Option 2: Fedora Server with kubeadm

Host Kubernetes using Fedora Server with kubeadm (official Kubernetes tool) and CRI-O container runtime.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Fedora Server
    participant K8s as Kubernetes Components
    
    Admin->>Server: Install Fedora 41
    Server->>Server: Configure network
    Admin->>Server: Update system (dnf update)
    Admin->>Server: Install CRI-O
    Server->>Server: Configure CRI-O runtime
    Admin->>Server: Install kubeadm/kubelet/kubectl
    Server->>Server: Disable swap, load kernel modules
    Server->>Server: Configure SELinux
    Admin->>K8s: kubeadm init --cri-socket=unix:///var/run/crio/crio.sock
    K8s->>Server: Generate certificates
    K8s->>Server: Start etcd
    K8s->>Server: Start API server
    K8s-->>Admin: Control plane ready
    Admin->>K8s: kubectl apply CNI
    K8s->>Server: Deploy CNI pods
    Admin->>K8s: kubeadm join (workers)
    K8s-->>Admin: Cluster ready (15-20 minutes)

Implementation Details

Installation:

# Install CRI-O
sudo dnf install -y cri-o
sudo systemctl enable --now crio

# Install kubeadm components
sudo dnf install -y kubelet kubeadm kubectl

# Initialize cluster
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/crio/crio.sock

# Install CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

Resource Requirements:

RAM: 2.2GB total (700MB OS + 1.5GB Kubernetes)
CPU: 2+ cores
Disk: 35GB (15GB OS + 20GB containers)

Maintenance:

# OS updates (every 13 months major upgrade)
sudo dnf update -y

# Kubernetes upgrade
sudo dnf update -y kubeadm
sudo kubeadm upgrade apply v1.32.0
sudo dnf update -y kubelet kubectl

Pros and Cons

Good, because SELinux enabled by default (stronger than AppArmor)
Good, because latest kernel and packages (bleeding edge)
Good, because native CRI-O support (OpenShift compatibility)
Good, because upstream for RHEL (enterprise patterns)
Good, because kubeadm provides full control over cluster
Neutral, because faster release cycle (latest features, but more upgrades)
Bad, because short support cycle (13 months per release)
Bad, because bleeding-edge can introduce instability
Bad, because complex kubeadm setup (many manual steps)
Bad, because higher resource overhead (2.2GB RAM)
Bad, because SELinux configuration for Kubernetes is complex
Bad, because frequent OS upgrades required (every 13 months)
Bad, because managing OS + Kubernetes separately
Bad, because imperative configuration (not GitOps-native)

Option 3: Talos Linux (purpose-built Kubernetes OS)

Use Talos Linux, an immutable, API-driven operating system designed specifically for Kubernetes with built-in cluster management.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Talos as Talos Linux
    participant K8s as Kubernetes Components
    
    Admin->>Server: Boot Talos ISO (PXE or USB)
    Server->>Talos: Start in maintenance mode
    Talos-->>Admin: API endpoint ready
    Admin->>Admin: Generate configs (talosctl gen config)
    Admin->>Talos: talosctl apply-config (controlplane.yaml)
    Talos->>Server: Install Talos to disk
    Server->>Server: Reboot from disk
    Talos->>K8s: Start kubelet
    Talos->>K8s: Start etcd
    Talos->>K8s: Start API server
    Admin->>Talos: talosctl bootstrap
    Talos->>K8s: Initialize cluster
    K8s->>Talos: Start controller-manager
    K8s-->>Admin: Control plane ready
    Admin->>K8s: Apply CNI
    Admin->>Talos: Apply worker configs
    Talos->>K8s: Join workers
    K8s-->>Admin: Cluster ready (10-15 minutes)

Implementation Details

Installation:

# Generate machine configs
talosctl gen config homelab https://192.168.1.10:6443

# Apply config to control plane (booted from ISO)
talosctl apply-config --insecure --nodes 192.168.1.10 --file controlplane.yaml

# Bootstrap Kubernetes
talosctl bootstrap --nodes 192.168.1.10 --endpoints 192.168.1.10

# Get kubeconfig
talosctl kubeconfig --nodes 192.168.1.10

# Add workers
talosctl apply-config --insecure --nodes 192.168.1.11 --file worker.yaml

Machine Configuration (declarative YAML):

version: v1alpha1
machine:
  type: controlplane
  install:
    disk: /dev/sda
  network:
    hostname: control-plane-1
    interfaces:
      - interface: eth0
        addresses:
          - 192.168.1.10/24
cluster:
  clusterName: homelab
  controlPlane:
    endpoint: https://192.168.1.10:6443
  network:
    cni:
      name: custom
      urls:
        - https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml

Resource Requirements:

RAM: 768MB total (256MB OS + 512MB Kubernetes)
CPU: 1-2 cores
Disk: 10-15GB (500MB OS + 10GB containers)

Maintenance:

# Upgrade Talos (OS + Kubernetes)
talosctl upgrade --nodes 192.168.1.10 --image ghcr.io/siderolabs/installer:v1.9.0

# Upgrade Kubernetes version
talosctl upgrade-k8s --nodes 192.168.1.10 --to 1.32.0

# Apply config changes
talosctl apply-config --nodes 192.168.1.10 --file controlplane.yaml

Pros and Cons

Good, because Kubernetes built-in (no separate installation)
Good, because minimal attack surface (no SSH, shell, package manager)
Good, because immutable infrastructure (config drift impossible)
Good, because API-driven management (GitOps-friendly)
Good, because lowest resource overhead (~768MB RAM)
Good, because declarative configuration (YAML in version control)
Good, because secure by default (no manual hardening)
Good, because smallest disk footprint (~500MB OS)
Good, because designed specifically for Kubernetes
Good, because simple declarative upgrades (OS + K8s)
Good, because UEFI Secure Boot support
Neutral, because smaller community (but active and helpful)
Bad, because steep learning curve (paradigm shift)
Bad, because limited to Kubernetes workloads only
Bad, because troubleshooting without shell requires different approach
Bad, because relatively new (less mature than Ubuntu/Fedora)
Bad, because no escape hatch for manual intervention

Option 4: Harvester HCI (hyperconverged platform)

Use Harvester, a hyperconverged infrastructure platform built on K3s and KubeVirt for unified VM + container management.

Architecture Overview

sequenceDiagram
    participant Admin
    participant Server as Bare Metal Server
    participant Harvester as Harvester HCI
    participant K3s as K3s / KubeVirt
    participant Storage as Longhorn Storage
    
    Admin->>Server: Boot Harvester ISO
    Server->>Harvester: Installation wizard
    Admin->>Harvester: Configure cluster (VIP, storage)
    Harvester->>Server: Install RancherOS 2.0
    Harvester->>Server: Install K3s
    Server->>Server: Reboot
    Harvester->>K3s: Start K3s server
    K3s->>Storage: Deploy Longhorn
    K3s->>Server: Deploy KubeVirt
    K3s->>Server: Deploy multus CNI
    Harvester-->>Admin: Web UI ready
    Admin->>Harvester: Add nodes
    Harvester->>K3s: Join cluster
    K3s-->>Admin: Cluster ready (20-30 minutes)

Implementation Details

Installation: Interactive ISO wizard or cloud-init config

Resource Requirements:

RAM: 8GB minimum per node (16GB+ recommended)
CPU: 4+ cores per node
Disk: 250GB+ per node (100GB OS + 150GB storage)
Nodes: 3+ for production HA

Features:

Web UI management
Built-in storage (Longhorn)
VM support (KubeVirt)
Live migration
Rancher integration

Pros and Cons

Good, because unified VM + container platform
Good, because built-in K3s (Kubernetes included)
Good, because web UI simplifies management
Good, because built-in persistent storage (Longhorn)
Good, because VM live migration
Good, because Rancher integration
Neutral, because immutable OS layer
Bad, because very heavy resource requirements (8GB+ RAM)
Bad, because complex architecture (KubeVirt, Longhorn, multus)
Bad, because overkill for container-only workloads
Bad, because larger attack surface (web UI, VM layer)
Bad, because requires 3+ nodes for HA (not single-node friendly)
Bad, because steep learning curve for full feature set

More Information

Detailed Analysis

For in-depth analysis of each operating system:

Ubuntu Server Analysis
- Installation methods (kubeadm, k3s, MicroK8s)
- Cluster initialization sequences
- Maintenance requirements and upgrade procedures
- Resource overhead and security posture
Fedora Server Analysis
- kubeadm with CRI-O installation
- SELinux configuration for Kubernetes
- Rapid release cycle implications
- RHEL ecosystem compatibility
Talos Linux Analysis
- API-driven, immutable architecture
- Declarative configuration model
- Security-first design principles
- Production readiness and advanced features
Harvester HCI Analysis
- Hyperconverged infrastructure capabilities
- VM + container unified platform
- KubeVirt and Longhorn integration
- Multi-node cluster requirements

Key Findings Summary

Resource efficiency comparison:

✅ Talos: 768MB RAM, 500MB disk (most efficient)
✅ Ubuntu + k3s: 1GB RAM, 20GB disk (efficient)
⚠️ Fedora + kubeadm: 2.2GB RAM, 35GB disk (moderate)
❌ Harvester: 8GB+ RAM, 250GB+ disk (heavy)

Security posture comparison:

✅ Talos: Minimal attack surface (no SSH/shell, immutable)
✅ Fedora: SELinux by default (strong MAC)
⚠️ Ubuntu: AppArmor (moderate security)
⚠️ Harvester: Larger attack surface (web UI, VM layer)

Operational complexity comparison:

✅ Ubuntu + k3s: Single command install, familiar management
✅ Talos: Declarative, automated (after learning curve)
⚠️ Fedora + kubeadm: Manual kubeadm steps, frequent OS upgrades
❌ Harvester: Complex HCI architecture, heavy requirements

Decision Matrix

Criterion	Ubuntu + k3s	Fedora + kubeadm	Talos Linux	Harvester
Setup Simplicity	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
Maintenance Burden	⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Security Posture	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐
Resource Efficiency	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐
Learning Curve	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐	⭐⭐⭐
Community Support	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Immutability	⭐	⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
GitOps-Friendly	⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐
Purpose-Built	⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Overall Score	29/45	24/45	38/45	28/45

Talos Linux scores highest for Kubernetes-dedicated homelab infrastructure prioritizing security, efficiency, and GitOps workflows.

Trade-offs Analysis

Choosing Talos Linux:

Wins: Best security, lowest overhead, declarative configuration, minimal maintenance
Loses: Steeper learning curve, no shell access, smaller community

Choosing Ubuntu + k3s:

Wins: Easiest adoption, largest community, general-purpose flexibility
Loses: Higher attack surface, manual OS management, imperative config

Choosing Fedora + kubeadm:

Wins: Latest features, SELinux, enterprise compatibility
Loses: Frequent OS upgrades, complex setup, higher overhead

Choosing Harvester:

Wins: VM + container unified platform, web UI
Loses: Heavy resources, complex architecture, overkill for K8s-only

For a Kubernetes-dedicated homelab prioritizing security and efficiency, Talos Linux’s benefits outweigh the learning curve investment.

ADR-0001: Use MADR for Architecture Decision Records - MADR format used for this ADR
ADR-0002: Network Boot Architecture - Server provisioning architecture
ADR-0003: Cloud Provider Selection - Cloud infrastructure decisions

Future Considerations

Team Growth: If team grows beyond single person, reassess Ubuntu for familiarity
VM Requirements: If VM workloads emerge, consider Harvester or KubeVirt on Talos
Enterprise Patterns: If RHEL compatibility needed, reconsider Fedora/CentOS Stream
Maintenance Burden: If Talos learning curve proves too steep, fallback to k3s
Talos Maturity: Monitor Talos ecosystem growth and production adoption

Issue #598 - story(docs): create adr for server operating system

5 - [0005] Network Boot Infrastructure Implementation on Google Cloud

Evaluate implementation approaches for deploying network boot infrastructure on Google Cloud Platform using UEFI HTTP boot, comparing custom server implementation versus Matchbox-based solution.

Context and Problem Statement

ADR-0002 established that network boot infrastructure will be hosted on a cloud provider accessed via WireGuard VPN. ADR-0003 selected Google Cloud Platform as the hosting provider to consolidate infrastructure and leverage existing expertise.

The remaining question is: How should the network boot server itself be implemented?

This decision affects:

Development Effort: Time required to build, test, and maintain the solution
Feature Completeness: Capabilities for boot image management, machine mapping, and provisioning workflows
Operational Complexity: Deployment, monitoring, and troubleshooting burden
Security: Boot image integrity, access control, and audit capabilities
Scalability: Ability to grow from single home lab to multiple environments

The boot server must handle:

HTTP/HTTPS requests for UEFI boot scripts, kernels, initrd images, and cloud-init configurations
Machine-to-image mapping to serve appropriate boot files based on MAC address, hardware profile, or tags
Boot image lifecycle management including upload, versioning, and rollback capabilities

Hardware-Specific Context

The target bare metal servers (HP DL360 Gen 9) have the following network boot capabilities:

UEFI HTTP Boot: Supported in iLO 4 firmware v2.40+ (released 2016)
TLS Support: Server-side TLS only (no client certificate authentication)
Boot Process: Firmware handles initial HTTP requests directly (no PXE/TFTP chain loading required)
Configuration: Boot URL configured via iLO RBSU or UEFI System Utilities

Security Implications: Since the servers cannot present client certificates for mTLS authentication with Cloudflare, the WireGuard VPN serves as the secure transport layer for boot traffic. The HTTP boot server is only accessible through the VPN tunnel.

Reference: HP DL360 Gen 9 Network Boot Analysis

Decision Drivers

Time to Production: Minimize time to get a working network boot infrastructure
Feature Requirements: Must support machine-specific boot configurations, image versioning, and cloud-init integration
Maintenance Burden: Prefer solutions that minimize ongoing maintenance and updates
GCP Integration: Should leverage GCP services (Cloud Storage, Secret Manager, IAM)
Security: Boot images must be served securely with access control and integrity verification
Observability: Comprehensive logging and monitoring for troubleshooting boot failures
Cost: Minimize infrastructure costs while meeting functional requirements
Future Flexibility: Ability to extend or customize as needs evolve

Considered Options

Option 1: Custom server implementation (Go-based)
Option 2: Matchbox-based solution

Decision Outcome

Chosen option: “Option 1: Custom implementation”, because:

UEFI HTTP Boot Simplification: Elimination of TFTP/PXE dramatically reduces implementation complexity
Cloud Run Deployment: HTTP-only boot enables serverless deployment (~$5/month vs $8-17/month)
Development Time Manageable: UEFI HTTP boot reduces custom development to 2-3 weeks
Full Control: Custom implementation maintains flexibility for future home lab requirements
GCP Native Integration: Direct Cloud Storage, Firestore, Secret Manager, and IAM integration
Existing Framework: Leverages z5labs/humus patterns already in use across services
HTTP REST API: Native HTTP REST admin API via z5labs/humus framework provides better integration with existing tooling

Consequences

Good, because UEFI HTTP boot eliminates TFTP complexity entirely
Good, because Cloud Run deployment reduces operational overhead and cost
Good, because leverages existing z5labs/humus framework and Go expertise
Good, because GCP native integration (Cloud Storage, Firestore, Secret Manager, IAM)
Good, because full control over implementation enables future customization
Good, because simplified testing (HTTP-only, no TFTP/PXE edge cases)
Good, because OpenTelemetry observability built-in from existing patterns
Neutral, because requires 2-3 weeks development time vs 1 week for Matchbox setup
Neutral, because ongoing maintenance responsibility (no upstream project support)
Bad, because custom implementation may miss edge cases that Matchbox handles
Bad, because reinvents machine matching and boot configuration patterns
Bad, because Cloud Run cold start latency needs monitoring (mitigated with min instances = 1)

Confirmation

The implementation success will be validated by:

Successfully deploying custom boot server on GCP Cloud Run
Successfully network booting HP DL360 Gen 9 via UEFI HTTP boot through WireGuard VPN
Confirming iLO 4 firmware v2.40+ compatibility with HTTP boot workflow
Validating boot image upload and versioning workflows via HTTP REST API
Measuring Cloud Run cold start latency for boot requests (target: < 100ms)
Measuring boot file request latency for kernel/initrd downloads (target: < 100ms)
Confirming Cloud Storage integration for boot asset storage
Testing machine-to-image mapping based on MAC address using Firestore
Validating WireGuard VPN security for boot traffic (compensating for lack of client cert support)
Verifying OpenTelemetry observability integration with Cloud Monitoring

Pros and Cons of the Options

Option 1: Custom Server Implementation (Go-based)

Build a custom network boot server in Go, leveraging the existing z5labs/humus framework for HTTP services.

Architecture Overview

architecture-beta
    group gcp(cloud)[GCP VPC]

    service wg_nlb(internet)[Network LB] in gcp
    service wireguard(server)[WireGuard Gateway] in gcp
    service https_lb(internet)[HTTPS LB] in gcp
    service compute(server)[Compute Engine] in gcp
    service storage(database)[Cloud Storage] in gcp
    service firestore(database)[Firestore] in gcp
    service secrets(disk)[Secret Manager] in gcp
    service monitoring(internet)[Cloud Monitoring] in gcp

    group homelab(cloud)[Home Lab]
    service udm(server)[UDM Pro] in homelab
    service servers(server)[Bare Metal Servers] in homelab

    servers:L -- R:udm
    udm:R -- L:wg_nlb
    wg_nlb:R -- L:wireguard
    wireguard:R -- L:https_lb
    https_lb:R -- L:compute
    compute:B --> T:storage
    compute:B --> T:firestore
    compute:R --> L:secrets
    compute:T --> B:monitoring

Components:

Boot Server: Go service deployed to Cloud Run (or Compute Engine VM as fallback)
- HTTP/HTTPS server (using z5labs/humus framework with OpenAPI)
- UEFI HTTP boot endpoint serving boot scripts and assets
- HTTP REST admin API for boot configuration management
Cloud Storage: Buckets for boot images, boot scripts, kernels, initrd files
Firestore/Datastore: Machine-to-image mapping database (MAC → boot profile)
Secret Manager: WireGuard keys, TLS certificates (optional for HTTPS boot)
Cloud Monitoring: Metrics for boot requests, success/failure rates, latency

Boot Image Lifecycle

sequenceDiagram
    participant Admin
    participant API as Boot Server API
    participant Storage as Cloud Storage
    participant DB as Firestore
    participant Monitor as Cloud Monitoring

    Note over Admin,Monitor: Upload Boot Image
    Admin->>API: POST /api/v1/images (kernel, initrd, metadata)
    API->>API: Validate image integrity (checksum)
    API->>Storage: Upload kernel to gs://boot-images/kernels/
    API->>Storage: Upload initrd to gs://boot-images/initrd/
    API->>DB: Store metadata (version, checksum, tags)
    API->>Monitor: Log upload event
    API->>Admin: 201 Created (image ID)

    Note over Admin,Monitor: Map Machine to Image
    Admin->>API: POST /api/v1/machines (MAC, image_id, profile)
    API->>DB: Store machine mapping
    API->>Admin: 201 Created

    Note over Admin,Monitor: UEFI HTTP Boot Request
    participant Server as Home Lab Server
    Note right of Server: iLO 4 firmware v2.40+ initiates HTTP request directly
    Server->>API: HTTP GET /boot?mac=aa:bb:cc:dd:ee:ff (via WireGuard VPN)
    API->>DB: Query machine mapping by MAC
    API->>API: Generate iPXE script (kernel, initrd URLs)
    API->>Monitor: Log boot script request
    API->>Server: Send iPXE script
    
    Server->>API: HTTP GET /kernels/ubuntu-22.04.img
    API->>Storage: Fetch kernel from Cloud Storage
    API->>Monitor: Log kernel download (size, duration)
    API->>Server: Stream kernel file
    
    Server->>API: HTTP GET /initrd/ubuntu-22.04.img
    API->>Storage: Fetch initrd from Cloud Storage
    API->>Monitor: Log initrd download
    API->>Server: Stream initrd file
    
    Server->>Server: Boot into OS
    
    Note over Admin,Monitor: Rollback Image Version
    Admin->>API: POST /api/v1/machines/{mac}/rollback
    API->>DB: Update machine mapping to previous image_id
    API->>Monitor: Log rollback event
    API->>Admin: 200 OK

Implementation Details

Development Stack:

Language: Go 1.24 (leverage existing Go expertise)
HTTP Framework: z5labs/humus (consistent with existing services)
UEFI Boot: Standard HTTP handlers (no special libraries needed)
Storage Client: cloud.google.com/go/storage
Database: Firestore for machine mappings (or simple JSON config in Cloud Storage)
Observability: OpenTelemetry (metrics, traces, logs to Cloud Monitoring/Trace)

Deployment:

Cloud Run (preferred - HTTP-only boot enables serverless deployment):
- Min instances: 1 (ensures fast boot response, avoids cold start delays)
- Max instances: 2 (home lab scale)
- Memory: 512MB
- CPU: 1 vCPU
- Health checks: /health/startup, /health/liveness
- Concurrency: 10 requests per instance
Alternative - Compute Engine VM (if Cloud Run latency unacceptable):
- e2-micro instance ($6.50/month)
- Container-Optimized OS with Docker
- systemd service for boot server
- Health checks: /health/startup, /health/liveness
Networking:
- VPC firewall: Allow TCP/80, TCP/443 from WireGuard subnet (no UDP/69 needed)
- Static internal IP for boot server (Compute Engine) or HTTPS Load Balancer (Cloud Run)
- Cloud NAT for outbound connectivity (Cloud Storage access)

Configuration Management:

Machine mappings stored in Firestore or Cloud Storage JSON files

Boot profiles defined in YAML (similar to Matchbox groups):

profiles:
  - name: ubuntu-22.04-server
    kernel: gs://boot-images/kernels/ubuntu-22.04.img
    initrd: gs://boot-images/initrd/ubuntu-22.04.img
    cmdline: "console=tty0 console=ttyS0"
    cloud_init: gs://boot-images/cloud-init/ubuntu-base.yaml

machines:
  - mac: "aa:bb:cc:dd:ee:ff"
    profile: ubuntu-22.04-server
    hostname: node-01

Cost Breakdown:

Option A: Cloud Run Deployment (Preferred):

Component	Monthly Cost
Cloud Run (1 min instance, 512MB, always-on)	$3.50
Cloud Storage (50GB boot images)	$1.00
Firestore (minimal reads/writes)	$0.50
Egress (10 boots × 150MB)	$0.18
Total	~$5.18

Option B: Compute Engine Deployment (If Cloud Run latency unacceptable):

Component	Monthly Cost
e2-micro VM (boot server)	$6.50
Cloud Storage (50GB boot images)	$1.00
Firestore (minimal reads/writes)	$0.50
Egress (10 boots × 150MB)	$0.18
Total	~$8.18

Pros and Cons

Good, because UEFI HTTP boot eliminates TFTP complexity entirely
Good, because Cloud Run deployment option reduces operational overhead and infrastructure cost
Good, because full control over boot server implementation and features
Good, because leverages existing Go expertise and z5labs/humus framework patterns
Good, because seamless GCP integration (Cloud Storage, Firestore, Secret Manager, IAM)
Good, because minimal dependencies (no external projects to track)
Good, because customizable to specific home lab requirements
Good, because OpenTelemetry observability built-in from existing patterns
Good, because can optimize for home lab scale (< 20 machines)
Good, because lightweight implementation (no unnecessary features)
Good, because simplified testing (HTTP-only, no TFTP/PXE edge cases)
Good, because standard HTTP serving is well-understood (lower risk than TFTP)
Neutral, because development effort required (2-3 weeks for MVP, reduced from 3-4 weeks)
Neutral, because requires ongoing maintenance and security updates
Neutral, because Cloud Run cold start latency needs validation (POC required)
Bad, because reinvents machine matching and boot configuration patterns
Bad, because testing network boot scenarios still requires hardware
Bad, because potential for bugs in custom implementation
Bad, because no community support or established best practices
Bad, because development time still longer than Matchbox (2-3 weeks vs 1 week)

Option 2: Matchbox-Based Solution

Deploy Matchbox, an open-source network boot server developed by CoreOS (now part of Red Hat), to handle UEFI HTTP boot workflows.

Architecture Overview

architecture-beta
    group gcp(cloud)[GCP VPC]
    
    service wg_nlb(internet)[Network LB] in gcp
    service wireguard(server)[WireGuard Gateway] in gcp
    service https_lb(internet)[HTTPS LB] in gcp
    service compute(server)[Compute Engine] in gcp
    service storage(database)[Cloud Storage] in gcp
    service secrets(disk)[Secret Manager] in gcp
    service monitoring(internet)[Cloud Monitoring] in gcp
    
    group homelab(cloud)[Home Lab]
    service udm(server)[UDM Pro] in homelab
    service servers(server)[Bare Metal Servers] in homelab
    
    servers:L -- R:udm
    udm:R -- L:wg_nlb
    wg_nlb:R -- L:wireguard
    wireguard:R -- L:https_lb
    https_lb:R -- L:compute
    compute:B --> T:storage
    compute:R --> L:secrets
    compute:T --> B:monitoring

Components:

Matchbox Server: Container deployed to Cloud Run or Compute Engine VM
- HTTP/gRPC APIs for boot workflows and configuration
- UEFI HTTP boot support (TFTP disabled)
- Machine grouping and profile templating
- Ignition, Cloud-Init, and generic boot support
Cloud Storage: Backend for boot assets (mounted via gcsfuse or synced periodically)
Local Storage (Compute Engine only): /var/lib/matchbox for assets and configuration (synced from Cloud Storage)
Secret Manager: WireGuard keys, Matchbox TLS certificates
Cloud Monitoring: Logs from Matchbox container, custom metrics via log parsing

Boot Image Lifecycle

sequenceDiagram
    participant Admin
    participant CLI as matchbox CLI / API
    participant Matchbox as Matchbox Server
    participant Storage as Cloud Storage
    participant Monitor as Cloud Monitoring

    Note over Admin,Monitor: Upload Boot Image
    Admin->>CLI: Upload kernel/initrd via gRPC API
    CLI->>Matchbox: gRPC CreateAsset(kernel, initrd)
    Matchbox->>Matchbox: Validate asset integrity
    Matchbox->>Matchbox: Store to /var/lib/matchbox/assets/
    Matchbox->>Storage: Sync to gs://boot-assets/ (via sidecar script)
    Matchbox->>Monitor: Log asset upload event
    Matchbox->>CLI: Asset ID, checksum

    Note over Admin,Monitor: Create Boot Profile
    Admin->>CLI: Create profile YAML (kernel, initrd, cmdline)
    CLI->>Matchbox: gRPC CreateProfile(profile.yaml)
    Matchbox->>Matchbox: Store to /var/lib/matchbox/profiles/
    Matchbox->>Storage: Sync profiles to gs://boot-config/
    Matchbox->>CLI: Profile ID

    Note over Admin,Monitor: Create Machine Group
    Admin->>CLI: Create group YAML (MAC selector, profile mapping)
    CLI->>Matchbox: gRPC CreateGroup(group.yaml)
    Matchbox->>Matchbox: Store to /var/lib/matchbox/groups/
    Matchbox->>Storage: Sync groups to gs://boot-config/
    Matchbox->>CLI: Group ID

    Note over Admin,Monitor: UEFI HTTP Boot Request
    participant Server as Home Lab Server
    Note right of Server: iLO 4 firmware v2.40+ initiates HTTP request directly
    Server->>Matchbox: HTTP GET /boot.ipxe?mac=aa:bb:cc:dd:ee:ff (via WireGuard VPN)
    Matchbox->>Matchbox: Match MAC to group
    Matchbox->>Matchbox: Render iPXE template with profile
    Matchbox->>Monitor: Log boot request (MAC, group, profile)
    Matchbox->>Server: Send iPXE script
    
    Server->>Matchbox: HTTP GET /assets/ubuntu-22.04-kernel.img
    Matchbox->>Matchbox: Serve from /var/lib/matchbox/assets/
    Matchbox->>Monitor: Log asset download (size, duration)
    Matchbox->>Server: Stream kernel file
    
    Server->>Matchbox: HTTP GET /assets/ubuntu-22.04-initrd.img
    Matchbox->>Matchbox: Serve from /var/lib/matchbox/assets/
    Matchbox->>Monitor: Log asset download
    Matchbox->>Server: Stream initrd file
    
    Server->>Server: Boot into OS
    
    Note over Admin,Monitor: Rollback Machine Group
    Admin->>CLI: Update group YAML (change profile reference)
    CLI->>Matchbox: gRPC UpdateGroup(group.yaml)
    Matchbox->>Matchbox: Update /var/lib/matchbox/groups/
    Matchbox->>Storage: Sync updated group config
    Matchbox->>Monitor: Log group update
    Matchbox->>CLI: Success

Implementation Details

Matchbox Deployment:

Container: quay.io/poseidon/matchbox:latest (official image)
Deployment Options:
- Cloud Run (preferred - HTTP-only boot enables serverless deployment):
  - Min instances: 1 (ensures fast boot response)
  - Memory: 1GB RAM (Matchbox recommendation)
  - CPU: 1 vCPU
  - Storage: Cloud Storage for assets/profiles/groups (via HTTP API)
- Compute Engine VM (if persistent local storage preferred):
  - e2-small instance ($14/month, 2GB RAM recommended for Matchbox)
  - /var/lib/matchbox: Persistent disk (10GB SSD, $1.70/month)
  - Cloud Storage sync: Periodic backup of assets/profiles/groups to gs://matchbox-config/
  - Option: Use gcsfuse to mount Cloud Storage directly (adds latency but simplifies backups)

Configuration Structure:

/var/lib/matchbox/
├── assets/           # Boot images (kernels, initrds, ISOs)
│   ├── ubuntu-22.04-kernel.img
│   ├── ubuntu-22.04-initrd.img
│   └── flatcar-stable.img.gz
├── profiles/         # Boot profiles (YAML)
│   ├── ubuntu-server.yaml
│   └── flatcar-container.yaml
└── groups/           # Machine groups (YAML)
    ├── default.yaml
    ├── node-01.yaml
    └── storage-nodes.yaml

Example Profile (profiles/ubuntu-server.yaml):

id: ubuntu-22.04-server
name: Ubuntu 22.04 LTS Server
boot:
  kernel: /assets/ubuntu-22.04-kernel.img
  initrd:
    - /assets/ubuntu-22.04-initrd.img
  args:
    - console=tty0
    - console=ttyS0
    - ip=dhcp
ignition_id: ubuntu-base.yaml

Example Group (groups/node-01.yaml):

id: node-01
name: Node 01 - Ubuntu Server
profile: ubuntu-22.04-server
selector:
  mac: "aa:bb:cc:dd:ee:ff"
metadata:
  hostname: node-01.homelab.local
  ssh_authorized_keys:
    - "ssh-ed25519 AAAA..."

GCP Integration:

Cloud Storage Sync: Cron job or sidecar container to sync /var/lib/matchbox to Cloud Storage

# Sync every 5 minutes
*/5 * * * * gsutil -m rsync -r /var/lib/matchbox gs://matchbox-config/

Secret Manager: Store Matchbox TLS certificates for gRPC API authentication
Cloud Monitoring: Ship Matchbox logs to Cloud Logging, parse for metrics:
- Boot request count by MAC/group
- Asset download success/failure rates
- TFTP vs HTTP request distribution

Networking:

VPC firewall: Allow TCP/8080 (HTTP), TCP/8081 (gRPC) from WireGuard subnet (no UDP/69 needed)
Optional: Internal load balancer if high availability required (adds ~$18/month)
Note: Cloud Run deployment includes integrated HTTPS load balancing

Cost Breakdown:

Option A: Cloud Run Deployment (Preferred):

Component	Monthly Cost
Cloud Run (1 min instance, 1GB RAM, always-on)	$7.00
Cloud Storage (50GB boot images)	$1.00
Egress (10 boots × 150MB)	$0.18
Total	~$8.18

Option B: Compute Engine Deployment (If persistent local storage preferred):

Component	Monthly Cost
e2-small VM (Matchbox server)	$14.00
Persistent SSD (10GB)	$1.70
Cloud Storage (50GB backups)	$1.00
Egress (10 boots × 150MB)	$0.18
Total	~$16.88

Pros and Cons

Good, because HTTP-only boot enables Cloud Run deployment (reduces cost significantly)
Good, because UEFI HTTP boot eliminates TFTP complexity and potential failure points
Good, because production-ready boot server with extensive real-world usage
Good, because feature-complete with machine grouping, templating, and multi-OS support
Good, because gRPC API for programmatic boot configuration management
Good, because supports Ignition (Flatcar, CoreOS), Cloud-Init, and generic boot workflows
Good, because well-documented with established best practices
Good, because active community and upstream maintenance (Red Hat/CoreOS)
Good, because reduces development time to days (deploy + configure vs weeks of coding)
Good, because avoids reinventing network boot patterns (machine matching, boot configuration)
Good, because proven security model (TLS for gRPC, asset integrity checks)
Neutral, because requires learning Matchbox configuration patterns (YAML profiles/groups)
Neutral, because containerized deployment (Docker on Compute Engine or Cloud Run)
Neutral, because Cloud Run deployment option competitive with custom implementation cost
Bad, because introduces external dependency (Matchbox project maintenance)
Bad, because some features unnecessary for home lab scale (large-scale provisioning, etcd backend)
Bad, because less control over implementation details (limited customization)
Bad, because Cloud Storage integration requires custom sync scripts (Matchbox doesn’t natively support GCS backend)
Bad, because dependency on upstream for security patches and bug fixes

UEFI HTTP Boot Architecture

This section documents the UEFI HTTP boot capability that fundamentally changes the network boot infrastructure design.

Boot Process Overview

Traditional PXE Boot (NOT USED - shown for comparison):

sequenceDiagram
    participant Server as Bare Metal Server
    participant DHCP as DHCP Server
    participant TFTP as TFTP Server
    participant HTTP as HTTP Server

    Note over Server,HTTP: Traditional PXE Boot Chain (NOT USED)
    Server->>DHCP: DHCP Discover
    DHCP->>Server: DHCP Offer (TFTP server, boot filename)
    Server->>TFTP: TFTP GET /pxelinux.0
    TFTP->>Server: Send PXE bootloader
    Server->>TFTP: TFTP GET /ipxe.efi
    TFTP->>Server: Send iPXE binary
    Server->>HTTP: HTTP GET /boot.ipxe
    HTTP->>Server: Send boot script
    Server->>HTTP: HTTP GET /kernel, /initrd
    HTTP->>Server: Stream boot files

UEFI HTTP Boot (ACTUAL IMPLEMENTATION):

sequenceDiagram
    participant Server as HP DL360 Gen 9<br/>(iLO 4 v2.40+)
    participant DHCP as DHCP Server<br/>(UDM Pro)
    participant VPN as WireGuard VPN
    participant HTTP as HTTP Boot Server<br/>(GCP Cloud Run)

    Note over Server,HTTP: UEFI HTTP Boot (ACTUAL IMPLEMENTATION)
    Server->>DHCP: DHCP Discover
    DHCP->>Server: DHCP Offer (boot URL: http://boot.internal/boot.ipxe?mac=...)
    Note right of Server: Firmware initiates HTTP request directly<br/>(no TFTP/PXE chain loading)
    Server->>VPN: WireGuard tunnel established
    Server->>HTTP: HTTP GET /boot.ipxe?mac=aa:bb:cc:dd:ee:ff
    HTTP->>Server: Send boot script with kernel/initrd URLs
    Server->>HTTP: HTTP GET /assets/talos-kernel.img
    HTTP->>Server: Stream kernel (via WireGuard)
    Server->>HTTP: HTTP GET /assets/talos-initrd.img
    HTTP->>Server: Stream initrd (via WireGuard)
    Server->>Server: Boot into OS

Key Differences

Aspect	Traditional PXE	UEFI HTTP Boot
Initial Protocol	TFTP (UDP/69)	HTTP (TCP/80) or HTTPS (TCP/443)
Boot Loader	Requires TFTP transfer of iPXE binary	Firmware has HTTP client built-in
Chain Loading	PXE → TFTP → iPXE → HTTP	Direct HTTP boot (no chain)
Firewall Rules	UDP/69, TCP/80, TCP/443	TCP/80, TCP/443 only
Cloud Run Support	❌ (UDP not supported)	✅ (HTTP-only)
Transfer Speed	~1-5 Mbps (TFTP)	10-100 Mbps (HTTP)
Complexity	High (multiple protocols)	Low (HTTP-only)

Security Architecture

Challenge: HP DL360 Gen 9 UEFI HTTP boot does not support client-side TLS certificates (mTLS).

Solution: WireGuard VPN provides transport-layer security:

flowchart LR
    subgraph homelab[Home Lab]
        server[HP DL360 Gen 9<br/>UEFI HTTP Boot<br/>iLO 4 v2.40+]
        udm[UDM Pro<br/>WireGuard Client]
    end

    subgraph gcp[Google Cloud Platform]
        wg_gw[WireGuard Gateway<br/>Compute Engine]
        cr[Boot Server<br/>Cloud Run]
    end

    server -->|HTTP| udm
    udm -->|Encrypted WireGuard Tunnel| wg_gw
    wg_gw -->|HTTP| cr

    style server fill:#f9f,stroke:#333
    style udm fill:#bbf,stroke:#333
    style wg_gw fill:#bfb,stroke:#333
    style cr fill:#fbb,stroke:#333

Why WireGuard instead of Cloudflare mTLS?

Cloudflare mTLS Limitation: Requires client certificates at TLS layer
UEFI Firmware Limitation: Cannot present client certificates during TLS handshake
WireGuard Solution: Provides mutual authentication at network layer (pre-shared keys)
Security Equivalent: WireGuard offers same security properties as mTLS:
- Mutual authentication (both endpoints authenticated)
- Confidentiality (all traffic encrypted)
- Integrity (authenticated encryption via ChaCha20-Poly1305)
- No Internet exposure (boot server only accessible via VPN)

Firmware Configuration

HP iLO 4 UEFI HTTP Boot Setup:

Access Configuration:
- iLO web interface → Remote Console → Power On → Press F9 (RBSU)
- Or: Direct RBSU access during POST (Press F9)
Enable UEFI HTTP Boot:
- Navigate: System Configuration → BIOS/Platform Configuration (RBSU) → Network Options
- Set Network Boot to Enabled
- Set Boot Mode to UEFI (not Legacy BIOS)
- Enable UEFI HTTP Boot Support
Configure NIC:
- Navigate: RBSU → Network Options → [FlexibleLOM/PCIe NIC]
- Set Option ROM to Enabled (required for UEFI boot option to appear)
- Set Network Boot to Enabled
- Configure IPv4/IPv6 settings (DHCP or static)
Set Boot Order:
- Navigate: RBSU → Boot Options → UEFI Boot Order
- Move network device to top priority
Configure Boot URL (via DHCP or static):
- DHCP option 67: http://10.x.x.x/boot.ipxe?mac=${net0/mac}
- Or: Static configuration in UEFI System Utilities

Required Firmware Versions:

iLO 4: v2.40 or later (for UEFI HTTP boot support)
System ROM: P89 v2.60 or later (recommended)

Verification:

# Check iLO firmware version via REST API
curl -k -u admin:password https://ilo-address/redfish/v1/Managers/1/ | jq '.FirmwareVersion'

# Expected output: "2.40" or higher

Architectural Implications

TFTP Elimination Impact:

Deployment: Cloud Run becomes viable (no UDP/TFTP requirement)
Cost: Reduced infrastructure costs (~$5-8/month vs $8-17/month)
Complexity: Simplified networking (TCP-only firewall rules)
Development: Reduced effort (no TFTP library, testing, edge cases)
Scalability: Cloud Run autoscaling vs fixed VM capacity
Maintenance: Serverless reduces operational overhead

Decision Impact:

The removal of TFTP complexity fundamentally shifts the cost/benefit analysis:

Custom Implementation: More attractive (Cloud Run, reduced development time)
Matchbox: Still valid but cost/complexity advantage reduced
TCO Gap: Narrowed from ~$8,000-12,000 to ~$4,000-8,000 (Year 1)
Development Gap: Reduced from 2-3 weeks to 1-2 weeks

Detailed Comparison

Feature Comparison

Feature	Custom Implementation	Matchbox
UEFI HTTP Boot	✅ Native (standard HTTP)	✅ Built-in
HTTP/HTTPS Boot	✅ Via z5labs/humus	✅ Built-in
Cloud Run Deployment	✅ Preferred option	✅ Enabled by HTTP-only
Boot Scripting	✅ Custom templates	✅ Go templates
Machine-to-Image Mapping	✅ Firestore/JSON	✅ YAML groups with selectors
Boot Profile Management	✅ Custom API	✅ gRPC API + YAML
Cloud-Init Support	⚠️ Requires implementation	✅ Native support
Ignition Support	❌ Not planned	✅ Native support (Flatcar, CoreOS)
Asset Versioning	⚠️ Requires implementation	⚠️ Manual (via Cloud Storage versioning)
Rollback Capability	⚠️ Requires implementation	✅ Update group to previous profile
OpenTelemetry Observability	✅ Built-in	⚠️ Logs only (requires parsing)
GCP Cloud Storage Integration	✅ Native SDK	⚠️ Requires sync scripts
HTTP REST Admin API	✅ Native (z5labs/humus)	⚠️ gRPC only
Multi-Environment Support	⚠️ Requires implementation	✅ Groups + metadata

Development Effort Comparison

Task	Custom Implementation	Matchbox
Initial Setup	1-2 days (project scaffolding)	4-8 hours (deployment + config)
UEFI HTTP Boot	1-2 days (standard HTTP endpoints)	✅ Included
HTTP Boot API	2-3 days (z5labs/humus endpoints)	✅ Included
Machine Matching Logic	2-3 days (database queries, selectors)	✅ Included
Boot Script Templates	2-3 days (boot script templating)	✅ Included
Cloud-Init Support	3-5 days (parsing, injection)	✅ Included
Asset Management	2-3 days (upload, storage)	✅ Included
HTTP REST Admin API	2-3 days (OpenAPI endpoints)	✅ Included (gRPC)
Cloud Run Deployment	1 day (Cloud Run config)	1 day (Cloud Run config)
Testing	3-5 days (unit, integration, E2E - simplified)	2-3 days (integration only)
Documentation	2-3 days	1 day (reference existing docs)
Total Effort	2-3 weeks	1 week

Operational Complexity

Aspect	Custom Implementation	Matchbox
Deployment	Docker container on Compute Engine	Docker container on Compute Engine
Configuration Updates	API calls or Terraform updates	YAML file updates + API/filesystem sync
Monitoring	OpenTelemetry metrics to Cloud Monitoring	Log parsing + custom metrics
Troubleshooting	Full access to code, custom logging	Matchbox logs + gRPC API inspection
Security Patches	Manual code updates	Upstream container image updates
Dependency Updates	Manual Go module updates	Upstream Matchbox updates
Backup/Restore	Cloud Storage + Firestore backups	Sync `/var/lib/matchbox` to Cloud Storage

Cost Comparison Summary

Comparing Cloud Run Deployments (Preferred for both options):

Item	Custom (Cloud Run)	Matchbox (Cloud Run)	Difference
Compute	Cloud Run ($3.50/month)	Cloud Run ($7/month)	+$3.50/month
Storage	Cloud Storage ($1/month)	Cloud Storage ($1/month)	$0
Development	2-3 weeks @ $100/hour = $8,000-12,000	1 week @ $100/hour = $4,000	-$4,000-8,000
Annual Infrastructure	~$54	~$96	+$42/year
TCO (Year 1)	~$8,054-12,054	~$4,096	-$3,958-7,958
TCO (Year 3)	~$8,162-12,162	~$4,288	-$3,874-7,874

Key Insights:

UEFI HTTP boot enables Cloud Run deployment for both options, dramatically reducing infrastructure costs
Custom implementation TCO gap narrowed from $7,895-11,895 to $3,958-7,958 (Year 1)
Both options now cost ~$5-8/month for infrastructure (vs $8-17/month with TFTP)
Development time difference reduced from 2-3 weeks to 1-2 weeks
Decision is much closer than originally assessed

Risk Analysis

Risk	Custom Implementation	Matchbox	Mitigation
Security Vulnerabilities	Medium (standard HTTP code, well-understood)	Medium (upstream dependency)	Both: Monitor for security updates, automated deployments
Boot Failures	Medium (HTTP-only reduces complexity)	Low (battle-tested)	Custom: Comprehensive E2E testing with real hardware
Cloud Run Cold Starts	Medium (needs validation)	Medium (needs validation)	Both: Min instances = 1 (always-on)
Maintenance Burden	Medium (ongoing code maintenance)	Low (upstream handles updates)	Both: Automated deployment pipelines
GCP Integration Issues	Low (native SDK)	Medium (sync scripts)	Matchbox: Robust sync with error handling
Scalability Limits	Low (Cloud Run autoscaling)	Low (handles thousands of nodes)	Both: Monitor boot request latency
Dependency Abandonment	N/A (no external deps)	Low (Red Hat backing)	Matchbox: Can fork if necessary

Implementation Plan

Phase 1: Core Boot Server (Week 1)

Project Setup (1-2 days)
- Create Go project with z5labs/humus framework
- Set up OpenAPI specification for HTTP REST admin API
- Configure Cloud Storage and Firestore clients
- Implement basic health check endpoints
UEFI HTTP Boot Endpoints (2-3 days)
- HTTP endpoint serving boot scripts (iPXE format)
- Kernel and initrd streaming from Cloud Storage
- MAC-based machine matching using Firestore
- Boot script templating with machine-specific parameters
Testing & Deployment (2-3 days)
- Deploy to Cloud Run with min instances = 1
- Configure WireGuard VPN connectivity
- Test UEFI HTTP boot from HP DL360 Gen 9 (iLO 4 v2.40+)
- Validate boot latency and Cloud Run cold start metrics

Phase 2: Admin API & Management (Week 2)

HTTP REST Admin API (2-3 days)
- Boot image upload endpoints (kernel, initrd, metadata)
- Machine-to-image mapping management
- Boot profile CRUD operations
- Asset versioning and integrity validation
Cloud-Init Integration (2-3 days)
- Cloud-init configuration templating
- Metadata injection for machine-specific settings
- Integration with boot workflow
Observability & Documentation (2-3 days)
- OpenTelemetry metrics integration
- Cloud Monitoring dashboards
- API documentation
- Operational runbooks

Success Criteria

✅ Successfully boot HP DL360 Gen 9 via UEFI HTTP boot through WireGuard VPN
✅ Boot latency < 100ms for HTTP requests (kernel/initrd downloads)
✅ Cloud Run cold start latency < 100ms (with min instances = 1)
✅ Machine-to-image mapping works correctly based on MAC address
✅ Cloud Storage integration functional (upload, retrieve boot assets)
✅ HTTP REST API fully functional for boot configuration management
✅ Firestore stores machine mappings and boot profiles correctly
✅ OpenTelemetry metrics available in Cloud Monitoring
✅ Configuration update workflow clear and documented
✅ Firmware compatibility confirmed (no TFTP fallback needed)

More Information

ADR-0002: Network Boot Architecture - Established cloud-hosted boot server with VPN
ADR-0003: Cloud Provider Selection - Selected GCP as hosting provider
ADR-0001: Use MADR for Architecture Decision Records - MADR format

Future Considerations

High Availability: If boot server uptime becomes critical, evaluate multi-region deployment or failover strategies
Multi-Cloud: If multi-cloud strategy emerges, custom implementation provides better portability
Enterprise Features: If advanced provisioning workflows required (bare metal Kubernetes, Ignition support, etc.), evaluate adding features to custom implementation
Asset Versioning: Implement comprehensive boot image versioning and rollback capabilities beyond basic Cloud Storage versioning
Multi-Environment Support: Add support for multiple environments (dev, staging, prod) with environment-specific boot profiles

Issue #601 - story(docs): create adr for network boot infrastructure on google cloud
Issue #595 - story(docs): create adr for network boot architecture
Issue #597 - story(docs): create adr for cloud provider selection

Architecture Decision Records

Architecture Decision Records (ADRs)

ADR Categories

Status Values

1 - [0001] Use MADR for Architecture Decision Records

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Consequences

Confirmation

Pros and Cons of the Options

MADR (Markdown Architectural Decision Records)

ADR using custom format

Wiki-based documentation

No formal ADR process

More Information

2 - [0002] Network Boot Architecture for Home Lab

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Consequences

Confirmation

Pros and Cons of the Options

Option 1: TFTP/HTTP server locally on home lab network

Boot Flow Sequence

Trust Model

Cost Estimate

Pros and Cons

Option 2: TFTP/HTTP server on public cloud (without VPN)

Boot Flow Sequence

Trust Model

Cost Estimate

Pros and Cons

Option 3: TFTP/HTTP server on public cloud (with VPN)

Boot Flow Sequence

Trust Model

Cost Estimate

Pros and Cons

More Information

Related Resources

Key Questions for Decision

Related Issues

3 - [0003] Cloud Provider Selection for Network Boot Infrastructure

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Consequences

Confirmation

Pros and Cons of the Options

Option 1: Google Cloud Platform (GCP)

Architecture Overview

Implementation Details

Pros and Cons

Option 2: Amazon Web Services (AWS)

Architecture Overview

Implementation Details

Pros and Cons

More Information

Detailed Analysis

Key Findings Summary

Cost Comparison Table

Protocol Support Comparison

WireGuard Deployment Comparison

Trade-offs Analysis

Related ADRs

Future Considerations

Related Issues

4 - [0004] Server Operating System Selection

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Consequences

Confirmation

Pros and Cons of the Options

Option 1: Ubuntu Server with k3s

Architecture Overview