This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Google Cloud Platform Analysis

Technical analysis of Google Cloud Platform capabilities for hosting network boot infrastructure

This section contains detailed analysis of Google Cloud Platform (GCP) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.

Overview

Google Cloud Platform is Google’s suite of cloud computing services, offering compute, storage, networking, and managed services. This analysis focuses on GCP’s capabilities to support the network boot architecture decided in ADR-0002.

Key Services Evaluated

  • Compute Engine: Virtual machine instances for hosting boot server
  • Cloud VPN / VPC: Network connectivity and VPN capabilities
  • Cloud Load Balancing: Layer 4 and Layer 7 load balancing for HTTP/HTTPS
  • Cloud NAT: Network address translation for outbound connectivity
  • VPC Network: Software-defined networking and routing

Documentation Sections

1 - Cloud Storage FUSE (gcsfuse)

Analysis of Google Cloud Storage FUSE for mounting GCS buckets as local filesystems in network boot infrastructure

Overview

Cloud Storage FUSE (gcsfuse) is a FUSE-based filesystem adapter that allows Google Cloud Storage (GCS) buckets to be mounted and accessed as local filesystems on Linux systems. This enables applications to interact with object storage using standard filesystem operations (open, read, write, etc.) rather than requiring GCS-specific APIs.

Project: GoogleCloudPlatform/gcsfuse License: Apache 2.0 Status: Generally Available (GA) Latest Version: v2.x (as of 2024)

How gcsfuse Works

gcsfuse translates filesystem operations into GCS API calls:

  1. Mount Operation: gcsfuse bucket-name /mount/point maps a GCS bucket to a local directory
  2. Directory Structure: Interprets / in object names as directory separators
  3. File Operations: Translates read(), write(), open(), etc. into GCS API requests
  4. Metadata: Maintains file attributes (size, modification time) via GCS metadata
  5. Caching: Optional stat, type, list, and file caching to reduce API calls

Example:

  • GCS object: gs://boot-assets/kernels/talos-v1.6.0.img
  • Mounted path: /mnt/boot-assets/kernels/talos-v1.6.0.img

Relevance to Network Boot Infrastructure

In the context of ADR-0005 Network Boot Infrastructure, gcsfuse offers a potential approach for serving boot assets from Cloud Storage without custom integration code.

Potential Use Cases

  1. Boot Asset Storage: Mount gs://boot-assets/ to /var/lib/boot-server/assets/
  2. Configuration Sync: Access boot profiles and machine mappings from GCS as local files
  3. Matchbox Integration: Mount GCS bucket to /var/lib/matchbox/ for assets/profiles/groups
  4. Simplified Development: Eliminate custom Cloud Storage SDK integration in boot server code

Architecture Pattern

┌─────────────────────────┐
│   Boot Server Process   │
│  (Cloud Run/Compute)    │
└───────────┬─────────────┘
            │ filesystem operations
            │ (read, open, stat)
            ▼
┌─────────────────────────┐
│   gcsfuse mount point   │
│   /var/lib/boot-assets  │
└───────────┬─────────────┘
            │ FUSE layer
            │ (translates to GCS API)
            ▼
┌─────────────────────────┐
│  Cloud Storage Bucket   │
│   gs://boot-assets/     │
└─────────────────────────┘

Performance Characteristics

Latency

  • Much higher latency than local filesystem: Every operation requires GCS API call(s)
  • No default caching: Without caching enabled, every read re-fetches from GCS
  • Network round-trip: Minimum ~10-50ms latency per operation (depending on region)

Throughput

Single Large File:

  • Read: ~4.1 MiB/s (individual file), up to 63.3 MiB/s (archive files)
  • Write: Comparable to gsutil cp for large files
  • With parallel downloads: Up to 9x faster for single-threaded reads of large files

Small Files:

  • Poor performance for random I/O on small files
  • Bulk operations on many small files create significant bottlenecks
  • ls on directories with thousands of objects can take minutes

Concurrent Access:

  • Performance degrades significantly with parallel readers (8 instances: ~30 hours vs 16 minutes with local data)
  • Not recommended for high-concurrency scenarios (web servers, NAS)

Performance Improvements (Recent Features)

  1. Streaming Writes (default): Upload data directly to GCS as written

    • Up to 40% faster for large sequential writes
    • Reduces local disk usage (no staging file)
  2. Parallel Downloads: Download large files using multiple workers

    • Up to 9x faster model load times
    • Best for single-threaded reads of large files
  3. File Cache: Cache file contents locally (Local SSD, Persistent Disk, or tmpfs)

    • Up to 2.3x faster training time (AI/ML workloads)
    • Up to 3.4x higher throughput
    • Requires explicit cache directory configuration
  4. Metadata Cache: Cache stat, type, and list operations

    • Stat and type caches enabled by default
    • Configurable TTL (default: 60s, set -1 for unlimited)

Caching Configuration

gcsfuse provides four types of caching:

1. Stat Cache

Caches file attributes (size, modification time, existence).

# Enable with unlimited size and TTL
gcsfuse \
  --stat-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Reduces API calls for repeated stat() operations (e.g., checking file existence).

2. Type Cache

Caches file vs directory type information.

gcsfuse \
  --type-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Speeds up directory traversal and ls operations.

3. List Cache

Caches directory listing results.

gcsfuse \
  --max-conns-per-host=100 \
  --metadata-cache-ttl-secs=-1 \
  bucket-name /mount/point

Use case: Improves performance for applications that repeatedly list directory contents.

4. File Cache

Caches actual file contents locally.

gcsfuse \
  --file-cache-max-size-mb=-1 \
  --cache-dir=/mnt/local-ssd \
  --file-cache-cache-file-for-range-read=true \
  --file-cache-enable-parallel-downloads=true \
  bucket-name /mount/point

Use case: Essential for AI/ML training, repeated reads of large files.

Recommended cache storage:

  • Local SSD: Fastest, but ephemeral (data lost on restart)
  • Persistent Disk: Persistent but slower than Local SSD
  • tmpfs (RAM disk): Fastest but limited by memory

Production Configuration Example

# config.yaml for gcsfuse
metadata-cache:
  ttl-secs: -1  # Never expire (use only if bucket is read-only or single-writer)
  stat-cache-max-size-mb: -1
  type-cache-max-size-mb: -1

file-cache:
  max-size-mb: -1  # Unlimited (limited by disk space)
  cache-file-for-range-read: true
  enable-parallel-downloads: true
  parallel-downloads-per-file: 16
  download-chunk-size-mb: 50

write:
  create-empty-file: false  # Streaming writes (default)

logging:
  severity: info
  format: json
gcsfuse --config-file=config.yaml boot-assets /mnt/boot-assets

Limitations and Considerations

Filesystem Semantics

gcsfuse provides approximate POSIX semantics but is not fully POSIX-compliant:

  • No atomic rename: Rename operations are copy-then-delete (not atomic)
  • No hard links: GCS doesn’t support hard links
  • No file locking: flock() is a no-op
  • Limited permissions: GCS has simpler ACLs than POSIX permissions
  • No sparse files: Writes always materialize full file content

Performance Anti-Patterns

Avoid:

  • Serving web content or acting as NAS (concurrent connections)
  • Random I/O on many small files (image datasets, text corpora)
  • Reading during ML training loops (download first, then train)
  • High-concurrency workloads (multiple parallel readers/writers)

Good for:

  • Sequential reads of large files (models, checkpoints, kernels)
  • Infrequent writes of entire files
  • Read-mostly workloads with caching enabled
  • Single-writer scenarios

Consistency Trade-offs

With caching enabled:

  • Stale reads possible if cache TTL > 0 and external modifications occur
  • Safe only for:
    • Read-only buckets
    • Single-writer, single-mount scenarios
    • Workloads tolerant of eventual consistency

Without caching:

  • Strong consistency (every read fetches latest from GCS)
  • Much slower performance

Resource Requirements

  • Disk space: File cache and streaming writes require local storage
    • File cache: Size of cached files (can be large for ML datasets)
    • Streaming writes: Temporary staging (proportional to concurrent writes)
  • Memory: Metadata caches consume RAM
  • File handles: Can exceed system limits with high concurrency
  • Network bandwidth: All data transfers via GCS API

Installation

On Compute Engine (Container-Optimized OS)

# Install gcsfuse (Container-Optimized OS doesn't include package managers)
export GCSFUSE_VERSION=2.x.x
curl -L -O https://github.com/GoogleCloudPlatform/gcsfuse/releases/download/v${GCSFUSE_VERSION}/gcsfuse_${GCSFUSE_VERSION}_amd64.deb
sudo dpkg -i gcsfuse_${GCSFUSE_VERSION}_amd64.deb

On Debian/Ubuntu

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

sudo apt-get update
sudo apt-get install gcsfuse

In Docker/Cloud Run

FROM ubuntu:22.04

# Install gcsfuse
RUN apt-get update && apt-get install -y \
    curl \
    gnupg \
    lsb-release \
  && export GCSFUSE_REPO=gcsfuse-$(lsb_release -c -s) \
  && echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list \
  && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
  && apt-get update \
  && apt-get install -y gcsfuse \
  && rm -rf /var/lib/apt/lists/*

# Create mount point
RUN mkdir -p /mnt/boot-assets

# Mount gcsfuse at startup
CMD gcsfuse --foreground boot-assets /mnt/boot-assets & \
    /usr/local/bin/boot-server

Note: Cloud Run does not support FUSE filesystems (requires privileged mode). gcsfuse only works on Compute Engine or GKE.

Network Boot Infrastructure Evaluation

Applicability to ADR-0005

Based on the analysis, gcsfuse is not recommended for the network boot infrastructure for the following reasons:

❌ Cloud Run Incompatibility

  • gcsfuse requires FUSE kernel module and privileged containers
  • Cloud Run does not support FUSE or privileged mode
  • ADR-0005 prefers Cloud Run deployment (HTTP-only boot enables serverless)
  • Impact: Blocks Cloud Run deployment, forcing Compute Engine VM

❌ Boot Latency Requirements

  • Boot file requests target < 100ms latency (ADR-0005 confirmation criteria)
  • gcsfuse adds 10-50ms+ latency per operation (network round-trips)
  • Kernel/initrd downloads are latency-sensitive (network boot timeout)
  • Impact: May exceed boot timeout thresholds

❌ No Caching for Read-Write Workloads

  • Boot server needs to write new assets and read existing ones
  • File cache with unlimited TTL requires read-only or single-writer assumption
  • Multiple boot server instances (autoscaling) violate single-writer constraint
  • Impact: Either accept stale reads or disable caching (slow)

❌ Small File Performance

  • Machine mapping configs, boot scripts, profiles are small files (KB range)
  • gcsfuse performs poorly on small, random I/O
  • ls operations on directories with many profiles can be slow
  • Impact: Slow boot configuration lookups

✅ Alternative: Direct Cloud Storage SDK

Using cloud.google.com/go/storage SDK directly offers:

  • Lower latency: Direct API calls without FUSE overhead
  • Cloud Run compatible: No kernel module or privileged mode required
  • Better control: Explicit caching, parallel downloads, streaming
  • Simpler deployment: No mount management, no FUSE dependencies
  • Cost: Similar API call costs to gcsfuse

Recommended approach (from ADR-0005):

// Custom boot server using Cloud Storage SDK
storage := storage.NewClient(ctx)
bucket := storage.Bucket("boot-assets")

// Stream kernel to boot client
obj := bucket.Object("kernels/talos-v1.6.0.img")
reader, _ := obj.NewReader(ctx)
defer reader.Close()
io.Copy(w, reader)  // Stream to HTTP response

When gcsfuse MIGHT Be Useful

Despite the above limitations, gcsfuse could be considered for:

  1. Matchbox on Compute Engine:

    • Matchbox expects filesystem paths for assets (/var/lib/matchbox/assets/)
    • Compute Engine VM supports FUSE
    • Read-heavy workload (boot assets rarely change)
    • Could mount gs://boot-assets/ to /var/lib/matchbox/assets/ with file cache
  2. Development/Testing:

    • Quick prototyping without writing Cloud Storage integration
    • Local development with production bucket access
    • Not recommended for production deployment
  3. Low-Throughput Scenarios:

    • Home lab scale (< 10 boots/hour)
    • File cache enabled with Local SSD
    • Single Compute Engine VM (not autoscaled)

Configuration for Matchbox + gcsfuse:

#!/bin/bash
# Mount boot assets for Matchbox

BUCKET="boot-assets"
MOUNT_POINT="/var/lib/matchbox/assets"
CACHE_DIR="/mnt/disks/local-ssd/gcsfuse-cache"

mkdir -p "$MOUNT_POINT" "$CACHE_DIR"

gcsfuse \
  --stat-cache-max-size-mb=-1 \
  --type-cache-max-size-mb=-1 \
  --metadata-cache-ttl-secs=-1 \
  --file-cache-max-size-mb=-1 \
  --cache-dir="$CACHE_DIR" \
  --file-cache-cache-file-for-range-read=true \
  --file-cache-enable-parallel-downloads=true \
  --implicit-dirs \
  --foreground \
  "$BUCKET" "$MOUNT_POINT"

Monitoring and Troubleshooting

Metrics

gcsfuse exposes Prometheus metrics:

gcsfuse --prometheus --prometheus-port=9101 bucket /mnt/point

Key metrics:

  • gcs_read_count: Number of GCS read operations
  • gcs_write_count: Number of GCS write operations
  • gcs_read_bytes: Bytes read from GCS
  • gcs_write_bytes: Bytes written to GCS
  • fs_ops_count: Filesystem operations by type (open, read, write, etc.)
  • fs_ops_error_count: Filesystem operation errors

Logging

# JSON logging for Cloud Logging integration
gcsfuse --log-format=json --log-file=/var/log/gcsfuse.log bucket /mnt/point

Common Issues

Issue: ls on large directories takes minutes

Solution:

  • Enable list caching with --metadata-cache-ttl-secs=-1
  • Reduce directory depth (flatten object hierarchy)
  • Consider prefix-based filtering instead of full listings

Issue: Stale reads after external bucket modifications

Solution:

  • Reduce --metadata-cache-ttl-secs (default 60s)
  • Disable caching entirely for strong consistency
  • Use versioned object names (immutable assets)

Issue: Transport endpoint is not connected errors

Solution:

  • Unmount cleanly before remounting: fusermount -u /mnt/point
  • Check GCS bucket permissions (IAM roles)
  • Verify network connectivity to storage.googleapis.com

Issue: High memory usage

Solution:

  • Limit metadata cache sizes: --stat-cache-max-size-mb=1024
  • Disable file cache if not needed
  • Monitor with --prometheus metrics

Comparison to Alternatives

gcsfuse vs Direct Cloud Storage SDK

AspectgcsfuseCloud Storage SDK
LatencyHigher (FUSE overhead + GCS API)Lower (direct GCS API)
Cloud Run❌ Not supported✅ Fully supported
Development EffortLow (standard filesystem code)Medium (SDK integration)
PerformanceSlower (filesystem abstraction)Faster (optimized for use case)
CachingBuilt-in (stat, type, list, file)Manual (application-level)
StreamingAutomaticExplicit (io.Copy)
DependenciesFUSE kernel module, privileged modeNone (pure Go library)

Recommendation: Use Cloud Storage SDK directly for production network boot infrastructure.

gcsfuse vs rsync/gsutil Sync

Periodic sync pattern:

# Sync bucket to local disk every 5 minutes
*/5 * * * * gsutil -m rsync -r gs://boot-assets /var/lib/boot-assets
Aspectgcsfusersync/gsutil sync
ConsistencyEventual (with caching)Strong (within sync interval)
Disk UsageMinimal (file cache optional)Full copy of assets
LatencyGCS API per requestLocal disk (fast)
Sync LagReal-time (no caching) or TTLSync interval (minutes)
DeploymentRequires FUSESimple cron job

Recommendation: For read-heavy, infrequent-write workloads on Compute Engine, rsync/gsutil sync is simpler and faster than gcsfuse.

Conclusion

Cloud Storage FUSE (gcsfuse) provides a convenient filesystem abstraction over GCS buckets, but is not recommended for the network boot infrastructure due to:

  1. Cloud Run incompatibility (requires FUSE kernel module)
  2. Added latency (FUSE overhead + network round-trips)
  3. Poor performance for small files and concurrent access
  4. Caching trade-offs (consistency vs performance)

Recommended alternatives:

  • Custom Boot Server: Direct Cloud Storage SDK integration (cloud.google.com/go/storage)
  • Matchbox on Compute Engine: rsync/gsutil sync to local disk
  • Cloud Run Deployment: Direct SDK (no gcsfuse possible)

gcsfuse may be useful for development/testing or Matchbox prototyping on Compute Engine, but production deployments should use direct SDK integration or periodic sync for optimal performance and Cloud Run compatibility.

References

2 - GCP Network Boot Protocol Support

Analysis of Google Cloud Platform’s support for TFTP, HTTP, and HTTPS routing for network boot infrastructure

Network Boot Protocol Support on Google Cloud Platform

This document analyzes GCP’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.

TFTP (Trivial File Transfer Protocol) Support

Native Support

Status: ❌ Not natively supported by Cloud Load Balancing

GCP’s Cloud Load Balancing services (Application Load Balancer, Network Load Balancer) do not support TFTP protocol natively. TFTP operates on UDP port 69 and has unique protocol requirements that are not compatible with GCP’s load balancing services.

Implementation Options

Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from a Compute Engine VM without load balancing:

  • Approach: Run TFTP server (e.g., tftpd-hpa, dnsmasq) on a Compute Engine VM
  • Access: Home lab connects via VPN tunnel to the VM’s private IP
  • Routing: VPC firewall rules allow UDP/69 from VPN subnet
  • Pros:
    • Simple implementation
    • No need for load balancing (single boot server sufficient)
    • TFTP traffic encrypted through VPN tunnel
    • Direct VM-to-client communication
  • Cons:
    • Single point of failure (no load balancing/HA)
    • Manual failover required if VM fails

Option 2: Network Load Balancer (NLB) Passthrough

While NLB doesn’t parse TFTP protocol, it can forward UDP traffic:

  • Approach: Configure Network Load Balancer for UDP/69 passthrough
  • Limitations:
    • No protocol-aware health checks for TFTP
    • Health checks would use TCP or HTTP on alternate port
    • Adds complexity without significant benefit for single boot server
  • Use Case: Only relevant for multi-region HA deployment (overkill for home lab)

TFTP Security Considerations

  • Encryption: TFTP protocol itself is unencrypted, but VPN tunnel provides encryption
  • Firewall Rules: Restrict UDP/69 to VPN subnet only (no public access)
  • File Access Control: Configure TFTP server with restricted file access
  • Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads

HTTP Support

Native Support

Status: ✅ Fully supported

GCP provides comprehensive HTTP support through multiple services:

Cloud Load Balancing - Application Load Balancer

  • Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (QUIC)
  • Port: Any port (typically 80 for HTTP)
  • Routing: URL-based routing, host-based routing, path-based routing
  • Health Checks: HTTP health checks with configurable paths
  • SSL Offloading: Can terminate SSL at load balancer and use HTTP backend
  • Backend: Compute Engine VMs, instance groups, Cloud Run, GKE

Compute Engine Direct Access

For VPN scenario, HTTP can be served directly from VM:

  • Approach: Run HTTP server (nginx, Apache, custom service) on Compute Engine VM
  • Access: Home lab accesses via VPN tunnel to private IP
  • Firewall: VPC firewall rules allow TCP/80 from VPN subnet
  • Pros: Simpler than load balancer for single boot server

HTTP Boot Flow for Network Boot

  1. PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
  2. iPXE → HTTP: iPXE chainloads boot files via HTTP from same server
  3. Kernel/Initrd: Large boot files served efficiently over HTTP

Performance Considerations

  • Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
  • Compression: gzip compression for text-based boot configs
  • Caching: Cloud CDN can cache boot files for faster delivery
  • TCP Optimization: GCP’s network optimized for low-latency TCP

HTTPS Support

Native Support

Status: ✅ Fully supported with advanced features

GCP provides enterprise-grade HTTPS support:

Cloud Load Balancing - Application Load Balancer

  • Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 with QUIC
  • SSL/TLS Termination: Terminate SSL at load balancer
  • Certificate Management:
    • Google-managed SSL certificates (automatic renewal)
    • Self-managed certificates (bring your own)
    • Certificate Map for multiple domains
  • TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable minimum version)
  • Cipher Suites: Modern, compatible, or custom cipher suites
  • mTLS Support: Mutual TLS authentication (client certificates)

Certificate Manager

  • Managed Certificates: Automatic provisioning and renewal via Let’s Encrypt integration
  • Private CA: Integration with Google Cloud Certificate Authority Service
  • Certificate Maps: Route different domains to different backends based on SNI
  • Certificate Monitoring: Automatic alerts before expiration

HTTPS for Network Boot

Use Case

Modern UEFI firmware and iPXE support HTTPS boot:

  • iPXE HTTPS: iPXE compiled with DOWNLOAD_PROTO_HTTPS can fetch over HTTPS
  • UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot (RFC 3720 iSCSI boot)
  • Security: Boot file integrity verified via HTTPS chain of trust

Implementation on GCP

  1. Certificate Provisioning:

    • Use Google-managed certificate for public domain (if boot server has public DNS)
    • Use self-signed certificate for VPN-only access (add to iPXE trust store)
    • Use private CA for internal PKI
  2. Load Balancer Configuration:

    • HTTPS frontend (port 443)
    • Backend service to Compute Engine VM running boot server
    • SSL policy with TLS 1.2+ minimum
  3. Alternative: Direct VM HTTPS:

    • Run nginx/Apache with TLS on Compute Engine VM
    • Access via VPN tunnel to private IP with HTTPS
    • Simpler setup for VPN-only scenario

mTLS Support for Enhanced Security

GCP’s Application Load Balancer supports mutual TLS authentication:

  • Client Certificates: Require client certificates for additional authentication
  • Certificate Validation: Validate client certificates against trusted CA
  • Use Case: Ensure only authorized home lab servers can access boot files
  • Integration: Combine with VPN for defense-in-depth

Routing and Load Balancing Capabilities

VPC Routing

  • Custom Routes: Define routes to direct traffic through VPN gateway
  • Route Priority: Configure route priorities for failover scenarios
  • BGP Support: Dynamic routing with Cloud Router (for advanced VPN setups)

Firewall Rules

  • Ingress/Egress Rules: Fine-grained control over traffic
  • Source/Destination Filters: IP ranges, tags, service accounts
  • Protocol Filtering: Allow specific protocols (UDP/69, TCP/80, TCP/443)
  • VPN Subnet Restriction: Limit access to VPN-connected home lab subnet

Cloud Armor (Optional)

For additional security if boot server has public access:

  • DDoS Protection: Layer 3/4 DDoS mitigation
  • WAF Rules: Application-level filtering
  • IP Allowlisting: Restrict to known public IPs
  • Rate Limiting: Prevent abuse

Cost Implications

Network Egress Costs

  • VPN Traffic: Egress to VPN endpoint charged at standard internet egress rates
  • Intra-Region: Free for traffic within same region
  • Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
  • Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.18/month (US egress)

Load Balancing Costs

  • Application Load Balancer: ~$0.025/hour + $0.008 per LCU-hour
  • Network Load Balancer: ~$0.025/hour + data processing charges
  • For VPN Scenario: Load balancer likely unnecessary (single VM sufficient)

Compute Costs

  • e2-micro Instance: ~$6-7/month (suitable for boot server)
  • f1-micro Instance: ~$4-5/month (even smaller, might suffice)
  • Reserved/Committed Use: Discounts for long-term commitment

Comparison with Requirements

RequirementGCP SupportImplementation
TFTP⚠️ Via VM, not LBDirect VM access via VPN
HTTP✅ Full supportVM or ALB
HTTPS✅ Full supportVM or ALB with Certificate Manager
VPN Integration✅ Native VPNCloud VPN or self-managed WireGuard
Load Balancing✅ ALB, NLBOptional for HA
Certificate Mgmt✅ Managed certsCertificate Manager
Cost Efficiency✅ Low-cost VMse2-micro sufficient

Recommendations

For VPN-Based Architecture (per ADR-0002)

  1. Compute Engine VM: Deploy single e2-micro VM with:

    • TFTP server (tftpd-hpa or dnsmasq)
    • HTTP server (nginx or simple Python HTTP server)
    • Optional HTTPS with self-signed certificate
  2. VPN Tunnel: Connect home lab to GCP via:

    • Cloud VPN (IPsec) - easier setup, higher cost
    • Self-managed WireGuard on Compute Engine - lower cost, more control
  3. VPC Firewall: Restrict access to:

    • UDP/69 (TFTP) from VPN subnet only
    • TCP/80 (HTTP) from VPN subnet only
    • TCP/443 (HTTPS) from VPN subnet only
  4. No Load Balancer: For home lab scale, direct VM access is sufficient

  5. Health Monitoring: Use Cloud Monitoring for VM and service health

If HA Required (Future Enhancement)

  • Deploy multi-zone VMs with Network Load Balancer
  • Use Cloud Storage as backend for boot files with VM serving as cache
  • Implement failover automation with Cloud Functions

References

3 - GCP WireGuard VPN Support

Analysis of WireGuard VPN deployment options on Google Cloud Platform for secure site-to-site connectivity

WireGuard VPN Support on Google Cloud Platform

This document analyzes options for deploying WireGuard VPN on GCP to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.

WireGuard Overview

WireGuard is a modern VPN protocol that provides:

  • Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
  • Performance: High throughput with low overhead
  • Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
  • Configuration: Simple key-based configuration
  • Kernel Integration: Mainline Linux kernel support since 5.6

GCP Native VPN Support

Cloud VPN (IPsec)

Status: ❌ WireGuard not natively supported

GCP’s managed Cloud VPN service supports:

  • IPsec VPN: IKEv1, IKEv2 with PSK or certificate authentication
  • HA VPN: Highly available VPN with 99.99% SLA
  • Classic VPN: Single-tunnel VPN (deprecated)

Limitation: Cloud VPN does not support WireGuard protocol natively.

Cost: Cloud VPN

  • HA VPN: ~$0.05/hour per tunnel × 2 tunnels = ~$73/month
  • Egress: Standard internet egress rates (~$0.12/GB for first 1TB)
  • Total Estimate: ~$75-100/month for managed VPN

Self-Managed WireGuard on Compute Engine

Implementation Approach

Since GCP doesn’t offer managed WireGuard, deploy WireGuard on a Compute Engine VM:

Status: ✅ Fully supported via Compute Engine

Architecture

graph LR
    A[Home Lab] -->|WireGuard Tunnel| B[GCP Compute Engine VM]
    B -->|Private VPC Network| C[Boot Server VM]
    B -->|IP Forwarding| C
    
    subgraph "Home Network"
        A
        D[UDM Pro]
        D -.WireGuard Client.- A
    end
    
    subgraph "GCP VPC"
        B[WireGuard Gateway VM]
        C[Boot Server VM]
    end

VM Configuration

  1. WireGuard Gateway VM:

    • Instance Type: e2-micro or f1-micro ($4-7/month)
    • OS: Ubuntu 22.04 LTS or Debian 12 (native WireGuard kernel support)
    • IP Forwarding: Enable IP forwarding to route traffic to other VMs
    • External IP: Static external IP for stable WireGuard endpoint
    • Firewall: Allow UDP port 51820 (WireGuard) from home lab public IP
  2. Boot Server VM:

    • Network: Same VPC as WireGuard gateway
    • Private IP Only: No external IP (accessed via VPN)
    • Route Traffic: Through WireGuard gateway VM

Installation Steps

# On GCP Compute Engine VM (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools

# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key

# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf

Example /etc/wireguard/wg0.conf on GCP VM:

[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o ens4 -j MASQUERADE

[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24

Corresponding config on UDM Pro:

[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>

[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <GCP_VM_EXTERNAL_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.128.0.0/20
PersistentKeepalive = 25

Enable and Start WireGuard

# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0

# Verify status
sudo wg show

GCP VPC Configuration

Firewall Rules

Create VPC firewall rule to allow WireGuard:

gcloud compute firewall-rules create allow-wireguard \
    --direction=INGRESS \
    --priority=1000 \
    --network=default \
    --action=ALLOW \
    --rules=udp:51820 \
    --source-ranges=<HOME_LAB_PUBLIC_IP>/32 \
    --target-tags=wireguard-gateway

Tag the WireGuard VM:

gcloud compute instances add-tags wireguard-gateway-vm \
    --tags=wireguard-gateway \
    --zone=us-central1-a

Static External IP

Reserve static IP for stable WireGuard endpoint:

gcloud compute addresses create wireguard-gateway-ip \
    --region=us-central1

gcloud compute instances delete-access-config wireguard-gateway-vm \
    --access-config-name="external-nat" \
    --zone=us-central1-a

gcloud compute instances add-access-config wireguard-gateway-vm \
    --access-config-name="external-nat" \
    --address=wireguard-gateway-ip \
    --zone=us-central1-a

Cost: Static IP ~$3-4/month if VM is always running (free if attached to running VM in some regions).

Route Configuration

For traffic from boot server to reach home lab via WireGuard VM:

gcloud compute routes create route-to-homelab \
    --network=default \
    --priority=100 \
    --destination-range=192.168.1.0/24 \
    --next-hop-instance=wireguard-gateway-vm \
    --next-hop-instance-zone=us-central1-a

This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway VM.

UDM Pro WireGuard Integration

Native Support

Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)

The UniFi Dream Machine Pro includes native WireGuard VPN support:

  • GUI Configuration: Web UI for WireGuard VPN setup
  • Site-to-Site: Support for site-to-site VPN tunnels
  • Performance: Hardware acceleration for encryption (if available)
  • Routing: Automatic route injection for remote subnets

Configuration Steps on UDM Pro

  1. Network Settings → VPN:

    • Create new VPN connection
    • Select “WireGuard”
    • Generate key pair or import existing
  2. Peer Configuration:

    • Peer Public Key: GCP WireGuard VM’s public key
    • Endpoint: GCP VM’s static external IP
    • Port: 51820
    • Allowed IPs: GCP VPC subnet (e.g., 10.128.0.0/20)
    • Persistent Keepalive: 25 seconds
  3. Route Injection:

    • UDM Pro automatically adds routes to GCP subnets
    • Home lab servers can reach GCP boot server via VPN
  4. Firewall Rules:

    • Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN

Alternative: Manual WireGuard on UDM Pro

If native support is insufficient, use wireguard-go via udm-utilities:

  • Repository: boostchicken/udm-utilities
  • Script: on_boot.d script to start WireGuard
  • Persistence: Survives firmware updates with on-boot script

Performance Considerations

Throughput

WireGuard on Compute Engine performance:

  • e2-micro (2 vCPU, shared core): ~100-300 Mbps
  • e2-small (2 vCPU): ~500-800 Mbps
  • e2-medium (2 vCPU): ~1+ Gbps

For network boot (typical boot = 50-200MB), even e2-micro is sufficient:

  • Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
  • Recommendation: e2-micro adequate for home lab scale

Latency

  • VPN Overhead: WireGuard adds minimal latency (~1-5ms overhead)
  • GCP Network: Low-latency network to most regions
  • Total Latency: Primarily dependent on home ISP and GCP region proximity

CPU Usage

  • Encryption: ChaCha20 is CPU-efficient
  • Kernel Module: Minimal CPU overhead in kernel space
  • e2-micro: Sufficient CPU for home lab VPN throughput

Security Considerations

Key Management

  • Private Keys: Store securely, never commit to version control
  • Key Rotation: Rotate keys periodically (e.g., annually)
  • Secret Manager: Store WireGuard private keys in GCP Secret Manager
    • Retrieve at VM startup via startup script
    • Avoid storing in VM metadata or disk images

Firewall Hardening

  • Source IP Restriction: Limit WireGuard port to home lab public IP only
  • Least Privilege: Boot server firewall allows only VPN subnet
  • No Public Access: Boot server has no external IP

Monitoring and Alerts

  • Cloud Logging: Log WireGuard connection events
  • Cloud Monitoring: Alert on VPN tunnel down
  • Metrics: Monitor handshake failures, data transfer

DDoS Protection

  • UDP Amplification: WireGuard resistant to DDoS amplification
  • Cloud Armor: Optional layer for additional DDoS protection (overkill for VPN)

High Availability Options

Multi-Region Failover

Deploy WireGuard gateways in multiple regions:

  • Primary: us-central1 WireGuard VM
  • Secondary: us-east1 WireGuard VM
  • Failover: UDM Pro switches endpoints if primary fails
  • Cost: Doubles VM costs (~$8-14/month for 2 VMs)

Health Checks

Monitor WireGuard tunnel health:

# On UDM Pro (via SSH)
wg show wg0 latest-handshakes

# If handshake timestamp old (>3 minutes), tunnel may be down

Automate failover with script on UDM Pro or external monitoring.

Startup Scripts for Auto-Healing

GCP VM startup script to ensure WireGuard starts on boot:

#!/bin/bash
# /etc/startup-script.sh

# Retrieve WireGuard private key from Secret Manager
gcloud secrets versions access latest --secret="wireguard-server-key" > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key

# Start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0

Attach as metadata:

gcloud compute instances add-metadata wireguard-gateway-vm \
    --metadata-from-file startup-script=/path/to/startup-script.sh \
    --zone=us-central1-a

Cost Analysis

Self-Managed WireGuard on Compute Engine

ComponentCost
e2-micro VM (730 hrs/month)~$6.50
Static External IP~$3.50
Egress (1GB/month boot traffic)~$0.12
Monthly Total~$10.12
Annual Total~$121

Cloud VPN (IPsec - if WireGuard not used)

ComponentCost
HA VPN Gateway (2 tunnels)~$73
Egress (1GB/month)~$0.12
Monthly Total~$73
Annual Total~$876

Cost Savings: Self-managed WireGuard saves ~$755/year vs Cloud VPN.

Comparison with Requirements

RequirementGCP SupportImplementation
WireGuard Protocol✅ Via Compute EngineSelf-managed on VM
Site-to-Site VPN✅ YesWireGuard tunnel
UDM Pro Integration✅ Native supportWireGuard peer config
Cost Efficiency✅ Low coste2-micro ~$10/month
Performance✅ Sufficient100+ Mbps on e2-micro
Security✅ Modern cryptoChaCha20, Curve25519
HA (optional)⚠️ Manual setupMulti-region VMs

Recommendations

For Home Lab VPN (per ADR-0002)

  1. Self-Managed WireGuard: Deploy on Compute Engine e2-micro VM

    • Cost: ~$10/month (vs ~$73/month for Cloud VPN)
    • Performance: Sufficient for network boot traffic
    • Simplicity: Easy to configure and maintain
  2. Single Region Deployment: Unless HA required, single VM adequate

    • Region Selection: Choose region closest to home lab for lowest latency
    • Zone: Single zone sufficient (boot server not mission-critical)
  3. UDM Pro Native WireGuard: Use built-in WireGuard client

    • Configuration: Add GCP VM as WireGuard peer in UDM Pro UI
    • Route Injection: UDM Pro automatically routes GCP subnets
  4. Security Best Practices:

    • Store WireGuard private key in Secret Manager
    • Restrict WireGuard port to home public IP only
    • Use startup script to configure VM on boot
    • Enable Cloud Logging for VPN events
  5. Monitoring: Set up Cloud Monitoring alerts for:

    • VM down
    • High CPU usage (indicates traffic spike or issue)
    • Firewall rule blocks (indicates misconfiguration)

Future Enhancements

  • HA Setup: Deploy secondary WireGuard VM in different region
  • Automated Failover: Script on UDM Pro to switch endpoints
  • IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
  • Mesh VPN: Expand to mesh topology if multiple sites added

References