This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Google Cloud Platform Analysis

Technical analysis of Google Cloud Platform capabilities for hosting network boot infrastructure

1: Cloud Storage FUSE (gcsfuse)
2: GCP Network Boot Protocol Support
3: GCP WireGuard VPN Support

This section contains detailed analysis of Google Cloud Platform (GCP) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.

Overview

Google Cloud Platform is Google’s suite of cloud computing services, offering compute, storage, networking, and managed services. This analysis focuses on GCP’s capabilities to support the network boot architecture decided in ADR-0002.

Key Services Evaluated

Compute Engine: Virtual machine instances for hosting boot server
Cloud VPN / VPC: Network connectivity and VPN capabilities
Cloud Load Balancing: Layer 4 and Layer 7 load balancing for HTTP/HTTPS
Cloud NAT: Network address translation for outbound connectivity
VPC Network: Software-defined networking and routing

Documentation Sections

Network Boot Support - Analysis of TFTP, HTTP, and HTTPS routing capabilities
WireGuard Support - Evaluation of WireGuard VPN integration options

1 - Cloud Storage FUSE (gcsfuse)

Analysis of Google Cloud Storage FUSE for mounting GCS buckets as local filesystems in network boot infrastructure

Overview

Cloud Storage FUSE (gcsfuse) is a FUSE-based filesystem adapter that allows Google Cloud Storage (GCS) buckets to be mounted and accessed as local filesystems on Linux systems. This enables applications to interact with object storage using standard filesystem operations (open, read, write, etc.) rather than requiring GCS-specific APIs.

Project: GoogleCloudPlatform/gcsfuse License: Apache 2.0 Status: Generally Available (GA) Latest Version: v2.x (as of 2024)

How gcsfuse Works

gcsfuse translates filesystem operations into GCS API calls:

Mount Operation: gcsfuse bucket-name /mount/point maps a GCS bucket to a local directory
Directory Structure: Interprets / in object names as directory separators
File Operations: Translates read(), write(), open(), etc. into GCS API requests
Metadata: Maintains file attributes (size, modification time) via GCS metadata
Caching: Optional stat, type, list, and file caching to reduce API calls

Example:

GCS object: gs://boot-assets/kernels/talos-v1.6.0.img
Mounted path: /mnt/boot-assets/kernels/talos-v1.6.0.img

Relevance to Network Boot Infrastructure

In the context of ADR-0005 Network Boot Infrastructure, gcsfuse offers a potential approach for serving boot assets from Cloud Storage without custom integration code.

Potential Use Cases

Boot Asset Storage: Mount gs://boot-assets/ to /var/lib/boot-server/assets/
Configuration Sync: Access boot profiles and machine mappings from GCS as local files
Matchbox Integration: Mount GCS bucket to /var/lib/matchbox/ for assets/profiles/groups
Simplified Development: Eliminate custom Cloud Storage SDK integration in boot server code

Architecture Pattern

┌─────────────────────────┐
│   Boot Server Process   │
│  (Cloud Run/Compute)    │
└───────────┬─────────────┘
            │ filesystem operations
            │ (read, open, stat)
            ▼
┌─────────────────────────┐
│   gcsfuse mount point   │
│   /var/lib/boot-assets  │
└───────────┬─────────────┘
            │ FUSE layer
            │ (translates to GCS API)
            ▼
┌─────────────────────────┐
│  Cloud Storage Bucket   │
│   gs://boot-assets/     │
└─────────────────────────┘

Performance Characteristics

Latency

Much higher latency than local filesystem: Every operation requires GCS API call(s)
No default caching: Without caching enabled, every read re-fetches from GCS
Network round-trip: Minimum ~10-50ms latency per operation (depending on region)

Throughput

Single Large File:

Read: ~4.1 MiB/s (individual file), up to 63.3 MiB/s (archive files)
Write: Comparable to gsutil cp for large files
With parallel downloads: Up to 9x faster for single-threaded reads of large files

Small Files:

Poor performance for random I/O on small files
Bulk operations on many small files create significant bottlenecks
ls on directories with thousands of objects can take minutes

Concurrent Access:

Performance degrades significantly with parallel readers (8 instances: ~30 hours vs 16 minutes with local data)
Not recommended for high-concurrency scenarios (web servers, NAS)

Performance Improvements (Recent Features)

Streaming Writes (default): Upload data directly to GCS as written
- Up to 40% faster for large sequential writes
- Reduces local disk usage (no staging file)
Parallel Downloads: Download large files using multiple workers
- Up to 9x faster model load times
- Best for single-threaded reads of large files
File Cache: Cache file contents locally (Local SSD, Persistent Disk, or tmpfs)
- Up to 2.3x faster training time (AI/ML workloads)
- Up to 3.4x higher throughput
- Requires explicit cache directory configuration
Metadata Cache: Cache stat, type, and list operations
- Stat and type caches enabled by default
- Configurable TTL (default: 60s, set -1 for unlimited)

Caching Configuration

gcsfuse provides four types of caching:

1. Stat Cache

Caches file attributes (size, modification time, existence).

# Enable with unlimited size and TTL
gcsfuse \
  --stat-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Reduces API calls for repeated stat() operations (e.g., checking file existence).

2. Type Cache

Caches file vs directory type information.

gcsfuse \
  --type-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Speeds up directory traversal and ls operations.

3. List Cache

Caches directory listing results.

gcsfuse \
  --max-conns-per-host=100 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Improves performance for applications that repeatedly list directory contents.

4. File Cache

Caches actual file contents locally.

gcsfuse \
  --file-cache-max-size-mb=-1 \
  --cache-dir=/mnt/local-ssd \
  --file-cache-cache-file-for-range-read=true \
  --file-cache-enable-parallel-downloads=true \
  bucket-name /mount/point

Use case: Essential for AI/ML training, repeated reads of large files.

Recommended cache storage:

Local SSD: Fastest, but ephemeral (data lost on restart)
Persistent Disk: Persistent but slower than Local SSD
tmpfs (RAM disk): Fastest but limited by memory

Production Configuration Example

# config.yaml for gcsfuse
metadata-cache:
  ttl-secs: -1  # Never expire (use only if bucket is read-only or single-writer)
  stat-cache-max-size-mb: -1
  type-cache-max-size-mb: -1

file-cache:
  max-size-mb: -1  # Unlimited (limited by disk space)
  cache-file-for-range-read: true
  enable-parallel-downloads: true
  parallel-downloads-per-file: 16
  download-chunk-size-mb: 50

write:
  create-empty-file: false  # Streaming writes (default)

logging:
  severity: info
  format: json

gcsfuse --config-file=config.yaml boot-assets /mnt/boot-assets

Limitations and Considerations

Filesystem Semantics

gcsfuse provides approximate POSIX semantics but is not fully POSIX-compliant:

No atomic rename: Rename operations are copy-then-delete (not atomic)
No hard links: GCS doesn’t support hard links
No file locking: flock() is a no-op
Limited permissions: GCS has simpler ACLs than POSIX permissions
No sparse files: Writes always materialize full file content

Performance Anti-Patterns

❌ Avoid:

Serving web content or acting as NAS (concurrent connections)
Random I/O on many small files (image datasets, text corpora)
Reading during ML training loops (download first, then train)
High-concurrency workloads (multiple parallel readers/writers)

✅ Good for:

Sequential reads of large files (models, checkpoints, kernels)
Infrequent writes of entire files
Read-mostly workloads with caching enabled
Single-writer scenarios

Consistency Trade-offs

With caching enabled:

Stale reads possible if cache TTL > 0 and external modifications occur
Safe only for:
- Read-only buckets
- Single-writer, single-mount scenarios
- Workloads tolerant of eventual consistency

Without caching:

Strong consistency (every read fetches latest from GCS)
Much slower performance

Resource Requirements

Disk space: File cache and streaming writes require local storage
- File cache: Size of cached files (can be large for ML datasets)
- Streaming writes: Temporary staging (proportional to concurrent writes)
Memory: Metadata caches consume RAM
File handles: Can exceed system limits with high concurrency
Network bandwidth: All data transfers via GCS API

Installation

On Compute Engine (Container-Optimized OS)

# Install gcsfuse (Container-Optimized OS doesn't include package managers)
export GCSFUSE_VERSION=2.x.x
curl -L -O https://github.com/GoogleCloudPlatform/gcsfuse/releases/download/v${GCSFUSE_VERSION}/gcsfuse_${GCSFUSE_VERSION}_amd64.deb
sudo dpkg -i gcsfuse_${GCSFUSE_VERSION}_amd64.deb

On Debian/Ubuntu

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

sudo apt-get update
sudo apt-get install gcsfuse

In Docker/Cloud Run

FROM ubuntu:22.04

# Install gcsfuse
RUN apt-get update && apt-get install -y \
    curl \
    gnupg \
    lsb-release \
  && export GCSFUSE_REPO=gcsfuse-$(lsb_release -c -s) \
  && echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list \
  && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
  && apt-get update \
  && apt-get install -y gcsfuse \
  && rm -rf /var/lib/apt/lists/*

# Create mount point
RUN mkdir -p /mnt/boot-assets

# Mount gcsfuse at startup
CMD gcsfuse --foreground boot-assets /mnt/boot-assets & \
    /usr/local/bin/boot-server

Note: Cloud Run does not support FUSE filesystems (requires privileged mode). gcsfuse only works on Compute Engine or GKE.

Network Boot Infrastructure Evaluation

Applicability to ADR-0005

Based on the analysis, gcsfuse is not recommended for the network boot infrastructure for the following reasons:

❌ Cloud Run Incompatibility

gcsfuse requires FUSE kernel module and privileged containers
Cloud Run does not support FUSE or privileged mode
ADR-0005 prefers Cloud Run deployment (HTTP-only boot enables serverless)
Impact: Blocks Cloud Run deployment, forcing Compute Engine VM

❌ Boot Latency Requirements

Boot file requests target < 100ms latency (ADR-0005 confirmation criteria)
gcsfuse adds 10-50ms+ latency per operation (network round-trips)
Kernel/initrd downloads are latency-sensitive (network boot timeout)
Impact: May exceed boot timeout thresholds

❌ No Caching for Read-Write Workloads

Boot server needs to write new assets and read existing ones
File cache with unlimited TTL requires read-only or single-writer assumption
Multiple boot server instances (autoscaling) violate single-writer constraint
Impact: Either accept stale reads or disable caching (slow)

❌ Small File Performance

Machine mapping configs, boot scripts, profiles are small files (KB range)
gcsfuse performs poorly on small, random I/O
ls operations on directories with many profiles can be slow
Impact: Slow boot configuration lookups

✅ Alternative: Direct Cloud Storage SDK

Using cloud.google.com/go/storage SDK directly offers:

Lower latency: Direct API calls without FUSE overhead
Cloud Run compatible: No kernel module or privileged mode required
Better control: Explicit caching, parallel downloads, streaming
Simpler deployment: No mount management, no FUSE dependencies
Cost: Similar API call costs to gcsfuse

Recommended approach (from ADR-0005):

// Custom boot server using Cloud Storage SDK
storage := storage.NewClient(ctx)
bucket := storage.Bucket("boot-assets")

// Stream kernel to boot client
obj := bucket.Object("kernels/talos-v1.6.0.img")
reader, _ := obj.NewReader(ctx)
defer reader.Close()
io.Copy(w, reader)  // Stream to HTTP response

When gcsfuse MIGHT Be Useful

Despite the above limitations, gcsfuse could be considered for:

Matchbox on Compute Engine:
- Matchbox expects filesystem paths for assets (/var/lib/matchbox/assets/)
- Compute Engine VM supports FUSE
- Read-heavy workload (boot assets rarely change)
- Could mount gs://boot-assets/ to /var/lib/matchbox/assets/ with file cache
Development/Testing:
- Quick prototyping without writing Cloud Storage integration
- Local development with production bucket access
- Not recommended for production deployment
Low-Throughput Scenarios:
- Home lab scale (< 10 boots/hour)
- File cache enabled with Local SSD
- Single Compute Engine VM (not autoscaled)

Configuration for Matchbox + gcsfuse:

#!/bin/bash
# Mount boot assets for Matchbox

BUCKET="boot-assets"
MOUNT_POINT="/var/lib/matchbox/assets"
CACHE_DIR="/mnt/disks/local-ssd/gcsfuse-cache"

mkdir -p "$MOUNT_POINT" "$CACHE_DIR"

gcsfuse \
  --stat-cache-max-size-mb=-1 \
  --type-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  --file-cache-max-size-mb=-1 \
  --cache-dir="$CACHE_DIR" \
  --file-cache-cache-file-for-range-read=true \
  --file-cache-enable-parallel-downloads=true \
  --implicit-dirs \
  --foreground \
  "$BUCKET" "$MOUNT_POINT"

Monitoring and Troubleshooting

Metrics

gcsfuse exposes Prometheus metrics:

gcsfuse --prometheus --prometheus-port=9101 bucket /mnt/point

Key metrics:

gcs_read_count: Number of GCS read operations
gcs_write_count: Number of GCS write operations
gcs_read_bytes: Bytes read from GCS
gcs_write_bytes: Bytes written to GCS
fs_ops_count: Filesystem operations by type (open, read, write, etc.)
fs_ops_error_count: Filesystem operation errors

Logging

# JSON logging for Cloud Logging integration
gcsfuse --log-format=json --log-file=/var/log/gcsfuse.log bucket /mnt/point

Common Issues

Issue: ls on large directories takes minutes

Solution:

Enable list caching with --metadata-cache-ttl-secs=-1
Reduce directory depth (flatten object hierarchy)
Consider prefix-based filtering instead of full listings

Issue: Stale reads after external bucket modifications

Solution:

Reduce --metadata-cache-ttl-secs (default 60s)
Disable caching entirely for strong consistency
Use versioned object names (immutable assets)

Issue: Transport endpoint is not connected errors

Solution:

Unmount cleanly before remounting: fusermount -u /mnt/point
Check GCS bucket permissions (IAM roles)
Verify network connectivity to storage.googleapis.com

Issue: High memory usage

Solution:

Limit metadata cache sizes: --stat-cache-max-size-mb=1024
Disable file cache if not needed
Monitor with --prometheus metrics

Comparison to Alternatives

gcsfuse vs Direct Cloud Storage SDK

Aspect	gcsfuse	Cloud Storage SDK
Latency	Higher (FUSE overhead + GCS API)	Lower (direct GCS API)
Cloud Run	❌ Not supported	✅ Fully supported
Development Effort	Low (standard filesystem code)	Medium (SDK integration)
Performance	Slower (filesystem abstraction)	Faster (optimized for use case)
Caching	Built-in (stat, type, list, file)	Manual (application-level)
Streaming	Automatic	Explicit (`io.Copy`)
Dependencies	FUSE kernel module, privileged mode	None (pure Go library)

Recommendation: Use Cloud Storage SDK directly for production network boot infrastructure.

gcsfuse vs rsync/gsutil Sync

Periodic sync pattern:

# Sync bucket to local disk every 5 minutes
*/5 * * * * gsutil -m rsync -r gs://boot-assets /var/lib/boot-assets

Aspect	gcsfuse	rsync/gsutil sync
Consistency	Eventual (with caching)	Strong (within sync interval)
Disk Usage	Minimal (file cache optional)	Full copy of assets
Latency	GCS API per request	Local disk (fast)
Sync Lag	Real-time (no caching) or TTL	Sync interval (minutes)
Deployment	Requires FUSE	Simple cron job

Recommendation: For read-heavy, infrequent-write workloads on Compute Engine, rsync/gsutil sync is simpler and faster than gcsfuse.

Conclusion

Cloud Storage FUSE (gcsfuse) provides a convenient filesystem abstraction over GCS buckets, but is not recommended for the network boot infrastructure due to:

Cloud Run incompatibility (requires FUSE kernel module)
Added latency (FUSE overhead + network round-trips)
Poor performance for small files and concurrent access
Caching trade-offs (consistency vs performance)

Recommended alternatives:

Custom Boot Server: Direct Cloud Storage SDK integration (cloud.google.com/go/storage)
Matchbox on Compute Engine: rsync/gsutil sync to local disk
Cloud Run Deployment: Direct SDK (no gcsfuse possible)

gcsfuse may be useful for development/testing or Matchbox prototyping on Compute Engine, but production deployments should use direct SDK integration or periodic sync for optimal performance and Cloud Run compatibility.

References

2 - GCP Network Boot Protocol Support

Analysis of Google Cloud Platform’s support for TFTP, HTTP, and HTTPS routing for network boot infrastructure

Network Boot Protocol Support on Google Cloud Platform

This document analyzes GCP’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.

TFTP (Trivial File Transfer Protocol) Support

Native Support

Status: ❌ Not natively supported by Cloud Load Balancing

GCP’s Cloud Load Balancing services (Application Load Balancer, Network Load Balancer) do not support TFTP protocol natively. TFTP operates on UDP port 69 and has unique protocol requirements that are not compatible with GCP’s load balancing services.

Implementation Options

Option 1: Direct VM Access (Recommended for VPN Scenario)

Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from a Compute Engine VM without load balancing:

Approach: Run TFTP server (e.g., tftpd-hpa, dnsmasq) on a Compute Engine VM
Access: Home lab connects via VPN tunnel to the VM’s private IP
Routing: VPC firewall rules allow UDP/69 from VPN subnet
Pros:
- Simple implementation
- No need for load balancing (single boot server sufficient)
- TFTP traffic encrypted through VPN tunnel
- Direct VM-to-client communication
Cons:
- Single point of failure (no load balancing/HA)
- Manual failover required if VM fails

Option 2: Network Load Balancer (NLB) Passthrough

While NLB doesn’t parse TFTP protocol, it can forward UDP traffic:

Approach: Configure Network Load Balancer for UDP/69 passthrough
Limitations:
- No protocol-aware health checks for TFTP
- Health checks would use TCP or HTTP on alternate port
- Adds complexity without significant benefit for single boot server
Use Case: Only relevant for multi-region HA deployment (overkill for home lab)

TFTP Security Considerations

Encryption: TFTP protocol itself is unencrypted, but VPN tunnel provides encryption
Firewall Rules: Restrict UDP/69 to VPN subnet only (no public access)
File Access Control: Configure TFTP server with restricted file access
Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads

HTTP Support

Native Support

Status: ✅ Fully supported

GCP provides comprehensive HTTP support through multiple services:

Cloud Load Balancing - Application Load Balancer

Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (QUIC)
Port: Any port (typically 80 for HTTP)
Routing: URL-based routing, host-based routing, path-based routing
Health Checks: HTTP health checks with configurable paths
SSL Offloading: Can terminate SSL at load balancer and use HTTP backend
Backend: Compute Engine VMs, instance groups, Cloud Run, GKE

Compute Engine Direct Access

For VPN scenario, HTTP can be served directly from VM:

Approach: Run HTTP server (nginx, Apache, custom service) on Compute Engine VM
Access: Home lab accesses via VPN tunnel to private IP
Firewall: VPC firewall rules allow TCP/80 from VPN subnet
Pros: Simpler than load balancer for single boot server

HTTP Boot Flow for Network Boot

PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
iPXE → HTTP: iPXE chainloads boot files via HTTP from same server
Kernel/Initrd: Large boot files served efficiently over HTTP

Performance Considerations

Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
Compression: gzip compression for text-based boot configs
Caching: Cloud CDN can cache boot files for faster delivery
TCP Optimization: GCP’s network optimized for low-latency TCP

HTTPS Support

Native Support

Status: ✅ Fully supported with advanced features

GCP provides enterprise-grade HTTPS support:

Cloud Load Balancing - Application Load Balancer

Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 with QUIC
SSL/TLS Termination: Terminate SSL at load balancer
Certificate Management:
- Google-managed SSL certificates (automatic renewal)
- Self-managed certificates (bring your own)
- Certificate Map for multiple domains
TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable minimum version)
Cipher Suites: Modern, compatible, or custom cipher suites
mTLS Support: Mutual TLS authentication (client certificates)

Certificate Manager

Managed Certificates: Automatic provisioning and renewal via Let’s Encrypt integration
Private CA: Integration with Google Cloud Certificate Authority Service
Certificate Maps: Route different domains to different backends based on SNI
Certificate Monitoring: Automatic alerts before expiration

HTTPS for Network Boot

Use Case

Modern UEFI firmware and iPXE support HTTPS boot:

iPXE HTTPS: iPXE compiled with DOWNLOAD_PROTO_HTTPS can fetch over HTTPS
UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot (RFC 3720 iSCSI boot)
Security: Boot file integrity verified via HTTPS chain of trust

Implementation on GCP

Certificate Provisioning:
- Use Google-managed certificate for public domain (if boot server has public DNS)
- Use self-signed certificate for VPN-only access (add to iPXE trust store)
- Use private CA for internal PKI
Load Balancer Configuration:
- HTTPS frontend (port 443)
- Backend service to Compute Engine VM running boot server
- SSL policy with TLS 1.2+ minimum
Alternative: Direct VM HTTPS:
- Run nginx/Apache with TLS on Compute Engine VM
- Access via VPN tunnel to private IP with HTTPS
- Simpler setup for VPN-only scenario

mTLS Support for Enhanced Security

GCP’s Application Load Balancer supports mutual TLS authentication:

Client Certificates: Require client certificates for additional authentication
Certificate Validation: Validate client certificates against trusted CA
Use Case: Ensure only authorized home lab servers can access boot files
Integration: Combine with VPN for defense-in-depth

Routing and Load Balancing Capabilities

VPC Routing

Custom Routes: Define routes to direct traffic through VPN gateway
Route Priority: Configure route priorities for failover scenarios
BGP Support: Dynamic routing with Cloud Router (for advanced VPN setups)

Firewall Rules

Ingress/Egress Rules: Fine-grained control over traffic
Source/Destination Filters: IP ranges, tags, service accounts
Protocol Filtering: Allow specific protocols (UDP/69, TCP/80, TCP/443)
VPN Subnet Restriction: Limit access to VPN-connected home lab subnet

Cloud Armor (Optional)

For additional security if boot server has public access:

DDoS Protection: Layer 3/4 DDoS mitigation
WAF Rules: Application-level filtering
IP Allowlisting: Restrict to known public IPs
Rate Limiting: Prevent abuse

Cost Implications

Network Egress Costs

VPN Traffic: Egress to VPN endpoint charged at standard internet egress rates
Intra-Region: Free for traffic within same region
Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.18/month (US egress)

Load Balancing Costs

Application Load Balancer: ~$0.025/hour + $0.008 per LCU-hour
Network Load Balancer: ~$0.025/hour + data processing charges
For VPN Scenario: Load balancer likely unnecessary (single VM sufficient)

Compute Costs

e2-micro Instance: ~$6-7/month (suitable for boot server)
f1-micro Instance: ~$4-5/month (even smaller, might suffice)
Reserved/Committed Use: Discounts for long-term commitment

Comparison with Requirements

Requirement	GCP Support	Implementation
TFTP	⚠️ Via VM, not LB	Direct VM access via VPN
HTTP	✅ Full support	VM or ALB
HTTPS	✅ Full support	VM or ALB with Certificate Manager
VPN Integration	✅ Native VPN	Cloud VPN or self-managed WireGuard
Load Balancing	✅ ALB, NLB	Optional for HA
Certificate Mgmt	✅ Managed certs	Certificate Manager
Cost Efficiency	✅ Low-cost VMs	e2-micro sufficient

Recommendations

For VPN-Based Architecture (per ADR-0002)

Compute Engine VM: Deploy single e2-micro VM with:
- TFTP server (tftpd-hpa or dnsmasq)
- HTTP server (nginx or simple Python HTTP server)
- Optional HTTPS with self-signed certificate
VPN Tunnel: Connect home lab to GCP via:
- Cloud VPN (IPsec) - easier setup, higher cost
- Self-managed WireGuard on Compute Engine - lower cost, more control
VPC Firewall: Restrict access to:
- UDP/69 (TFTP) from VPN subnet only
- TCP/80 (HTTP) from VPN subnet only
- TCP/443 (HTTPS) from VPN subnet only
No Load Balancer: For home lab scale, direct VM access is sufficient
Health Monitoring: Use Cloud Monitoring for VM and service health

If HA Required (Future Enhancement)

Deploy multi-zone VMs with Network Load Balancer
Use Cloud Storage as backend for boot files with VM serving as cache
Implement failover automation with Cloud Functions

References

3 - GCP WireGuard VPN Support

Analysis of WireGuard VPN deployment options on Google Cloud Platform for secure site-to-site connectivity

WireGuard VPN Support on Google Cloud Platform

This document analyzes options for deploying WireGuard VPN on GCP to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.

WireGuard Overview

WireGuard is a modern VPN protocol that provides:

Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
Performance: High throughput with low overhead
Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
Configuration: Simple key-based configuration
Kernel Integration: Mainline Linux kernel support since 5.6

GCP Native VPN Support

Cloud VPN (IPsec)

Status: ❌ WireGuard not natively supported

GCP’s managed Cloud VPN service supports:

IPsec VPN: IKEv1, IKEv2 with PSK or certificate authentication
HA VPN: Highly available VPN with 99.99% SLA
Classic VPN: Single-tunnel VPN (deprecated)

Limitation: Cloud VPN does not support WireGuard protocol natively.

Cost: Cloud VPN

HA VPN: ~$0.05/hour per tunnel × 2 tunnels = ~$73/month
Egress: Standard internet egress rates (~$0.12/GB for first 1TB)
Total Estimate: ~$75-100/month for managed VPN

Self-Managed WireGuard on Compute Engine

Implementation Approach

Since GCP doesn’t offer managed WireGuard, deploy WireGuard on a Compute Engine VM:

Status: ✅ Fully supported via Compute Engine

Architecture

graph LR
    A[Home Lab] -->|WireGuard Tunnel| B[GCP Compute Engine VM]
    B -->|Private VPC Network| C[Boot Server VM]
    B -->|IP Forwarding| C
    
    subgraph "Home Network"
        A
        D[UDM Pro]
        D -.WireGuard Client.- A
    end
    
    subgraph "GCP VPC"
        B[WireGuard Gateway VM]
        C[Boot Server VM]
    end

VM Configuration

WireGuard Gateway VM:
- Instance Type: e2-micro or f1-micro ($4-7/month)
- OS: Ubuntu 22.04 LTS or Debian 12 (native WireGuard kernel support)
- IP Forwarding: Enable IP forwarding to route traffic to other VMs
- External IP: Static external IP for stable WireGuard endpoint
- Firewall: Allow UDP port 51820 (WireGuard) from home lab public IP
Boot Server VM:
- Network: Same VPC as WireGuard gateway
- Private IP Only: No external IP (accessed via VPN)
- Route Traffic: Through WireGuard gateway VM

Installation Steps

# On GCP Compute Engine VM (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools

# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key

# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf

Example /etc/wireguard/wg0.conf on GCP VM:

[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o ens4 -j MASQUERADE

[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24

Corresponding config on UDM Pro:

[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>

[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <GCP_VM_EXTERNAL_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.128.0.0/20
PersistentKeepalive = 25

Enable and Start WireGuard

# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0

# Verify status
sudo wg show

GCP VPC Configuration

Firewall Rules

Create VPC firewall rule to allow WireGuard:

gcloud compute firewall-rules create allow-wireguard \
    --direction=INGRESS \
    --priority=1000 \
    --network=default \
    --action=ALLOW \
    --rules=udp:51820 \
    --source-ranges=<HOME_LAB_PUBLIC_IP>/32 \
    --target-tags=wireguard-gateway

Tag the WireGuard VM:

gcloud compute instances add-tags wireguard-gateway-vm \
    --tags=wireguard-gateway \
    --zone=us-central1-a

Static External IP

Reserve static IP for stable WireGuard endpoint:

gcloud compute addresses create wireguard-gateway-ip \
    --region=us-central1

gcloud compute instances delete-access-config wireguard-gateway-vm \
    --access-config-name="external-nat" \
    --zone=us-central1-a

gcloud compute instances add-access-config wireguard-gateway-vm \
    --access-config-name="external-nat" \
    --address=wireguard-gateway-ip \
    --zone=us-central1-a

Cost: Static IP ~$3-4/month if VM is always running (free if attached to running VM in some regions).

Route Configuration

For traffic from boot server to reach home lab via WireGuard VM:

gcloud compute routes create route-to-homelab \
    --network=default \
    --priority=100 \
    --destination-range=192.168.1.0/24 \
    --next-hop-instance=wireguard-gateway-vm \
    --next-hop-instance-zone=us-central1-a

This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway VM.

UDM Pro WireGuard Integration

Native Support

Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)

The UniFi Dream Machine Pro includes native WireGuard VPN support:

GUI Configuration: Web UI for WireGuard VPN setup
Site-to-Site: Support for site-to-site VPN tunnels
Performance: Hardware acceleration for encryption (if available)
Routing: Automatic route injection for remote subnets

Configuration Steps on UDM Pro

Network Settings → VPN:
- Create new VPN connection
- Select “WireGuard”
- Generate key pair or import existing
Peer Configuration:
- Peer Public Key: GCP WireGuard VM’s public key
- Endpoint: GCP VM’s static external IP
- Port: 51820
- Allowed IPs: GCP VPC subnet (e.g., 10.128.0.0/20)
- Persistent Keepalive: 25 seconds
Route Injection:
- UDM Pro automatically adds routes to GCP subnets
- Home lab servers can reach GCP boot server via VPN
Firewall Rules:
- Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN

Alternative: Manual WireGuard on UDM Pro

If native support is insufficient, use wireguard-go via udm-utilities:

Repository: boostchicken/udm-utilities
Script: on_boot.d script to start WireGuard
Persistence: Survives firmware updates with on-boot script

Performance Considerations

Throughput

WireGuard on Compute Engine performance:

e2-micro (2 vCPU, shared core): ~100-300 Mbps
e2-small (2 vCPU): ~500-800 Mbps
e2-medium (2 vCPU): ~1+ Gbps

For network boot (typical boot = 50-200MB), even e2-micro is sufficient:

Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
Recommendation: e2-micro adequate for home lab scale

Latency

VPN Overhead: WireGuard adds minimal latency (~1-5ms overhead)
GCP Network: Low-latency network to most regions
Total Latency: Primarily dependent on home ISP and GCP region proximity

CPU Usage

Encryption: ChaCha20 is CPU-efficient
Kernel Module: Minimal CPU overhead in kernel space
e2-micro: Sufficient CPU for home lab VPN throughput

Security Considerations

Key Management

Private Keys: Store securely, never commit to version control
Key Rotation: Rotate keys periodically (e.g., annually)
Secret Manager: Store WireGuard private keys in GCP Secret Manager
- Retrieve at VM startup via startup script
- Avoid storing in VM metadata or disk images

Firewall Hardening

Source IP Restriction: Limit WireGuard port to home lab public IP only
Least Privilege: Boot server firewall allows only VPN subnet
No Public Access: Boot server has no external IP

Monitoring and Alerts

Cloud Logging: Log WireGuard connection events
Cloud Monitoring: Alert on VPN tunnel down
Metrics: Monitor handshake failures, data transfer

DDoS Protection

UDP Amplification: WireGuard resistant to DDoS amplification
Cloud Armor: Optional layer for additional DDoS protection (overkill for VPN)

High Availability Options

Multi-Region Failover

Deploy WireGuard gateways in multiple regions:

Primary: us-central1 WireGuard VM
Secondary: us-east1 WireGuard VM
Failover: UDM Pro switches endpoints if primary fails
Cost: Doubles VM costs (~$8-14/month for 2 VMs)

Health Checks

Monitor WireGuard tunnel health:

# On UDM Pro (via SSH)
wg show wg0 latest-handshakes

# If handshake timestamp old (>3 minutes), tunnel may be down

Automate failover with script on UDM Pro or external monitoring.

Startup Scripts for Auto-Healing

GCP VM startup script to ensure WireGuard starts on boot:

#!/bin/bash
# /etc/startup-script.sh

# Retrieve WireGuard private key from Secret Manager
gcloud secrets versions access latest --secret="wireguard-server-key" > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key

# Start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0

Attach as metadata:

gcloud compute instances add-metadata wireguard-gateway-vm \
    --metadata-from-file startup-script=/path/to/startup-script.sh \
    --zone=us-central1-a

Cost Analysis

Self-Managed WireGuard on Compute Engine

Component	Cost
e2-micro VM (730 hrs/month)	~$6.50
Static External IP	~$3.50
Egress (1GB/month boot traffic)	~$0.12
Monthly Total	~$10.12
Annual Total	~$121

Cloud VPN (IPsec - if WireGuard not used)

Component	Cost
HA VPN Gateway (2 tunnels)	~$73
Egress (1GB/month)	~$0.12
Monthly Total	~$73
Annual Total	~$876

Cost Savings: Self-managed WireGuard saves ~$755/year vs Cloud VPN.

Comparison with Requirements

Requirement	GCP Support	Implementation
WireGuard Protocol	✅ Via Compute Engine	Self-managed on VM
Site-to-Site VPN	✅ Yes	WireGuard tunnel
UDM Pro Integration	✅ Native support	WireGuard peer config
Cost Efficiency	✅ Low cost	e2-micro ~$10/month
Performance	✅ Sufficient	100+ Mbps on e2-micro
Security	✅ Modern crypto	ChaCha20, Curve25519
HA (optional)	⚠️ Manual setup	Multi-region VMs

Recommendations

For Home Lab VPN (per ADR-0002)

Self-Managed WireGuard: Deploy on Compute Engine e2-micro VM
- Cost: ~$10/month (vs ~$73/month for Cloud VPN)
- Performance: Sufficient for network boot traffic
- Simplicity: Easy to configure and maintain
Single Region Deployment: Unless HA required, single VM adequate
- Region Selection: Choose region closest to home lab for lowest latency
- Zone: Single zone sufficient (boot server not mission-critical)
UDM Pro Native WireGuard: Use built-in WireGuard client
- Configuration: Add GCP VM as WireGuard peer in UDM Pro UI
- Route Injection: UDM Pro automatically routes GCP subnets
Security Best Practices:
- Store WireGuard private key in Secret Manager
- Restrict WireGuard port to home public IP only
- Use startup script to configure VM on boot
- Enable Cloud Logging for VPN events
Monitoring: Set up Cloud Monitoring alerts for:
- VM down
- High CPU usage (indicates traffic spike or issue)
- Firewall rule blocks (indicates misconfiguration)

Future Enhancements

HA Setup: Deploy secondary WireGuard VM in different region
Automated Failover: Script on UDM Pro to switch endpoints
IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
Mesh VPN: Expand to mesh topology if multiple sites added