Google Cloud Platform Analysis
Technical analysis of Google Cloud Platform capabilities for hosting network boot infrastructure
This section contains detailed analysis of Google Cloud Platform (GCP) for hosting the network boot server infrastructure, evaluating its support for TFTP, HTTP/HTTPS routing, and WireGuard VPN connectivity as required by ADR-0002.
Overview
Google Cloud Platform is Google’s suite of cloud computing services, offering compute, storage, networking, and managed services. This analysis focuses on GCP’s capabilities to support the network boot architecture decided in ADR-0002.
Key Services Evaluated
- Compute Engine: Virtual machine instances for hosting boot server
- Cloud VPN / VPC: Network connectivity and VPN capabilities
- Cloud Load Balancing: Layer 4 and Layer 7 load balancing for HTTP/HTTPS
- Cloud NAT: Network address translation for outbound connectivity
- VPC Network: Software-defined networking and routing
Documentation Sections
1 - Cloud Storage FUSE (gcsfuse)
Analysis of Google Cloud Storage FUSE for mounting GCS buckets as local filesystems in network boot infrastructure
Overview
Cloud Storage FUSE (gcsfuse) is a FUSE-based filesystem adapter that allows Google Cloud Storage (GCS) buckets to be mounted and accessed as local filesystems on Linux systems. This enables applications to interact with object storage using standard filesystem operations (open, read, write, etc.) rather than requiring GCS-specific APIs.
Project: GoogleCloudPlatform/gcsfuse
License: Apache 2.0
Status: Generally Available (GA)
Latest Version: v2.x (as of 2024)
How gcsfuse Works
gcsfuse translates filesystem operations into GCS API calls:
- Mount Operation:
gcsfuse bucket-name /mount/point maps a GCS bucket to a local directory - Directory Structure: Interprets
/ in object names as directory separators - File Operations: Translates
read(), write(), open(), etc. into GCS API requests - Metadata: Maintains file attributes (size, modification time) via GCS metadata
- Caching: Optional stat, type, list, and file caching to reduce API calls
Example:
- GCS object:
gs://boot-assets/kernels/talos-v1.6.0.img - Mounted path:
/mnt/boot-assets/kernels/talos-v1.6.0.img
Relevance to Network Boot Infrastructure
In the context of ADR-0005 Network Boot Infrastructure, gcsfuse offers a potential approach for serving boot assets from Cloud Storage without custom integration code.
Potential Use Cases
- Boot Asset Storage: Mount
gs://boot-assets/ to /var/lib/boot-server/assets/ - Configuration Sync: Access boot profiles and machine mappings from GCS as local files
- Matchbox Integration: Mount GCS bucket to
/var/lib/matchbox/ for assets/profiles/groups - Simplified Development: Eliminate custom Cloud Storage SDK integration in boot server code
Architecture Pattern
┌─────────────────────────┐
│ Boot Server Process │
│ (Cloud Run/Compute) │
└───────────┬─────────────┘
│ filesystem operations
│ (read, open, stat)
▼
┌─────────────────────────┐
│ gcsfuse mount point │
│ /var/lib/boot-assets │
└───────────┬─────────────┘
│ FUSE layer
│ (translates to GCS API)
▼
┌─────────────────────────┐
│ Cloud Storage Bucket │
│ gs://boot-assets/ │
└─────────────────────────┘
Latency
- Much higher latency than local filesystem: Every operation requires GCS API call(s)
- No default caching: Without caching enabled, every read re-fetches from GCS
- Network round-trip: Minimum ~10-50ms latency per operation (depending on region)
Throughput
Single Large File:
- Read: ~4.1 MiB/s (individual file), up to 63.3 MiB/s (archive files)
- Write: Comparable to
gsutil cp for large files - With parallel downloads: Up to 9x faster for single-threaded reads of large files
Small Files:
- Poor performance for random I/O on small files
- Bulk operations on many small files create significant bottlenecks
ls on directories with thousands of objects can take minutes
Concurrent Access:
- Performance degrades significantly with parallel readers (8 instances: ~30 hours vs 16 minutes with local data)
- Not recommended for high-concurrency scenarios (web servers, NAS)
Streaming Writes (default): Upload data directly to GCS as written
- Up to 40% faster for large sequential writes
- Reduces local disk usage (no staging file)
Parallel Downloads: Download large files using multiple workers
- Up to 9x faster model load times
- Best for single-threaded reads of large files
File Cache: Cache file contents locally (Local SSD, Persistent Disk, or tmpfs)
- Up to 2.3x faster training time (AI/ML workloads)
- Up to 3.4x higher throughput
- Requires explicit cache directory configuration
Metadata Cache: Cache stat, type, and list operations
- Stat and type caches enabled by default
- Configurable TTL (default: 60s, set
-1 for unlimited)
Caching Configuration
gcsfuse provides four types of caching:
1. Stat Cache
Caches file attributes (size, modification time, existence).
# Enable with unlimited size and TTL
gcsfuse \
--stat-cache-max-size-mb=-1 \
--metadata-cache-ttl-secs=-1 \
bucket-name /mount/point
Use case: Reduces API calls for repeated stat() operations (e.g., checking file existence).
2. Type Cache
Caches file vs directory type information.
gcsfuse \
--type-cache-max-size-mb=-1 \
--metadata-cache-ttl-secs=-1 \
bucket-name /mount/point
Use case: Speeds up directory traversal and ls operations.
3. List Cache
Caches directory listing results.
gcsfuse \
--max-conns-per-host=100 \
--metadata-cache-ttl-secs=-1 \
bucket-name /mount/point
Use case: Improves performance for applications that repeatedly list directory contents.
4. File Cache
Caches actual file contents locally.
gcsfuse \
--file-cache-max-size-mb=-1 \
--cache-dir=/mnt/local-ssd \
--file-cache-cache-file-for-range-read=true \
--file-cache-enable-parallel-downloads=true \
bucket-name /mount/point
Use case: Essential for AI/ML training, repeated reads of large files.
Recommended cache storage:
- Local SSD: Fastest, but ephemeral (data lost on restart)
- Persistent Disk: Persistent but slower than Local SSD
- tmpfs (RAM disk): Fastest but limited by memory
Production Configuration Example
# config.yaml for gcsfuse
metadata-cache:
ttl-secs: -1 # Never expire (use only if bucket is read-only or single-writer)
stat-cache-max-size-mb: -1
type-cache-max-size-mb: -1
file-cache:
max-size-mb: -1 # Unlimited (limited by disk space)
cache-file-for-range-read: true
enable-parallel-downloads: true
parallel-downloads-per-file: 16
download-chunk-size-mb: 50
write:
create-empty-file: false # Streaming writes (default)
logging:
severity: info
format: json
gcsfuse --config-file=config.yaml boot-assets /mnt/boot-assets
Limitations and Considerations
Filesystem Semantics
gcsfuse provides approximate POSIX semantics but is not fully POSIX-compliant:
- No atomic rename: Rename operations are copy-then-delete (not atomic)
- No hard links: GCS doesn’t support hard links
- No file locking:
flock() is a no-op - Limited permissions: GCS has simpler ACLs than POSIX permissions
- No sparse files: Writes always materialize full file content
❌ Avoid:
- Serving web content or acting as NAS (concurrent connections)
- Random I/O on many small files (image datasets, text corpora)
- Reading during ML training loops (download first, then train)
- High-concurrency workloads (multiple parallel readers/writers)
✅ Good for:
- Sequential reads of large files (models, checkpoints, kernels)
- Infrequent writes of entire files
- Read-mostly workloads with caching enabled
- Single-writer scenarios
Consistency Trade-offs
With caching enabled:
- Stale reads possible if cache TTL > 0 and external modifications occur
- Safe only for:
- Read-only buckets
- Single-writer, single-mount scenarios
- Workloads tolerant of eventual consistency
Without caching:
- Strong consistency (every read fetches latest from GCS)
- Much slower performance
Resource Requirements
- Disk space: File cache and streaming writes require local storage
- File cache: Size of cached files (can be large for ML datasets)
- Streaming writes: Temporary staging (proportional to concurrent writes)
- Memory: Metadata caches consume RAM
- File handles: Can exceed system limits with high concurrency
- Network bandwidth: All data transfers via GCS API
Installation
On Compute Engine (Container-Optimized OS)
# Install gcsfuse (Container-Optimized OS doesn't include package managers)
export GCSFUSE_VERSION=2.x.x
curl -L -O https://github.com/GoogleCloudPlatform/gcsfuse/releases/download/v${GCSFUSE_VERSION}/gcsfuse_${GCSFUSE_VERSION}_amd64.deb
sudo dpkg -i gcsfuse_${GCSFUSE_VERSION}_amd64.deb
On Debian/Ubuntu
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse
In Docker/Cloud Run
FROM ubuntu:22.04
# Install gcsfuse
RUN apt-get update && apt-get install -y \
curl \
gnupg \
lsb-release \
&& export GCSFUSE_REPO=gcsfuse-$(lsb_release -c -s) \
&& echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | tee /etc/apt/sources.list.d/gcsfuse.list \
&& curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
&& apt-get update \
&& apt-get install -y gcsfuse \
&& rm -rf /var/lib/apt/lists/*
# Create mount point
RUN mkdir -p /mnt/boot-assets
# Mount gcsfuse at startup
CMD gcsfuse --foreground boot-assets /mnt/boot-assets & \
/usr/local/bin/boot-server
Note: Cloud Run does not support FUSE filesystems (requires privileged mode). gcsfuse only works on Compute Engine or GKE.
Network Boot Infrastructure Evaluation
Applicability to ADR-0005
Based on the analysis, gcsfuse is not recommended for the network boot infrastructure for the following reasons:
❌ Cloud Run Incompatibility
- gcsfuse requires FUSE kernel module and privileged containers
- Cloud Run does not support FUSE or privileged mode
- ADR-0005 prefers Cloud Run deployment (HTTP-only boot enables serverless)
- Impact: Blocks Cloud Run deployment, forcing Compute Engine VM
❌ Boot Latency Requirements
- Boot file requests target < 100ms latency (ADR-0005 confirmation criteria)
- gcsfuse adds 10-50ms+ latency per operation (network round-trips)
- Kernel/initrd downloads are latency-sensitive (network boot timeout)
- Impact: May exceed boot timeout thresholds
❌ No Caching for Read-Write Workloads
- Boot server needs to write new assets and read existing ones
- File cache with unlimited TTL requires read-only or single-writer assumption
- Multiple boot server instances (autoscaling) violate single-writer constraint
- Impact: Either accept stale reads or disable caching (slow)
- Machine mapping configs, boot scripts, profiles are small files (KB range)
- gcsfuse performs poorly on small, random I/O
ls operations on directories with many profiles can be slow- Impact: Slow boot configuration lookups
✅ Alternative: Direct Cloud Storage SDK
Using cloud.google.com/go/storage SDK directly offers:
- Lower latency: Direct API calls without FUSE overhead
- Cloud Run compatible: No kernel module or privileged mode required
- Better control: Explicit caching, parallel downloads, streaming
- Simpler deployment: No mount management, no FUSE dependencies
- Cost: Similar API call costs to gcsfuse
Recommended approach (from ADR-0005):
// Custom boot server using Cloud Storage SDK
storage := storage.NewClient(ctx)
bucket := storage.Bucket("boot-assets")
// Stream kernel to boot client
obj := bucket.Object("kernels/talos-v1.6.0.img")
reader, _ := obj.NewReader(ctx)
defer reader.Close()
io.Copy(w, reader) // Stream to HTTP response
When gcsfuse MIGHT Be Useful
Despite the above limitations, gcsfuse could be considered for:
Matchbox on Compute Engine:
- Matchbox expects filesystem paths for assets (
/var/lib/matchbox/assets/) - Compute Engine VM supports FUSE
- Read-heavy workload (boot assets rarely change)
- Could mount
gs://boot-assets/ to /var/lib/matchbox/assets/ with file cache
Development/Testing:
- Quick prototyping without writing Cloud Storage integration
- Local development with production bucket access
- Not recommended for production deployment
Low-Throughput Scenarios:
- Home lab scale (< 10 boots/hour)
- File cache enabled with Local SSD
- Single Compute Engine VM (not autoscaled)
Configuration for Matchbox + gcsfuse:
#!/bin/bash
# Mount boot assets for Matchbox
BUCKET="boot-assets"
MOUNT_POINT="/var/lib/matchbox/assets"
CACHE_DIR="/mnt/disks/local-ssd/gcsfuse-cache"
mkdir -p "$MOUNT_POINT" "$CACHE_DIR"
gcsfuse \
--stat-cache-max-size-mb=-1 \
--type-cache-max-size-mb=-1 \
--metadata-cache-ttl-secs=-1 \
--file-cache-max-size-mb=-1 \
--cache-dir="$CACHE_DIR" \
--file-cache-cache-file-for-range-read=true \
--file-cache-enable-parallel-downloads=true \
--implicit-dirs \
--foreground \
"$BUCKET" "$MOUNT_POINT"
Monitoring and Troubleshooting
Metrics
gcsfuse exposes Prometheus metrics:
gcsfuse --prometheus --prometheus-port=9101 bucket /mnt/point
Key metrics:
gcs_read_count: Number of GCS read operationsgcs_write_count: Number of GCS write operationsgcs_read_bytes: Bytes read from GCSgcs_write_bytes: Bytes written to GCSfs_ops_count: Filesystem operations by type (open, read, write, etc.)fs_ops_error_count: Filesystem operation errors
Logging
# JSON logging for Cloud Logging integration
gcsfuse --log-format=json --log-file=/var/log/gcsfuse.log bucket /mnt/point
Common Issues
Issue: ls on large directories takes minutes
Solution:
- Enable list caching with
--metadata-cache-ttl-secs=-1 - Reduce directory depth (flatten object hierarchy)
- Consider prefix-based filtering instead of full listings
Issue: Stale reads after external bucket modifications
Solution:
- Reduce
--metadata-cache-ttl-secs (default 60s) - Disable caching entirely for strong consistency
- Use versioned object names (immutable assets)
Issue: Transport endpoint is not connected errors
Solution:
- Unmount cleanly before remounting:
fusermount -u /mnt/point - Check GCS bucket permissions (IAM roles)
- Verify network connectivity to
storage.googleapis.com
Issue: High memory usage
Solution:
- Limit metadata cache sizes:
--stat-cache-max-size-mb=1024 - Disable file cache if not needed
- Monitor with
--prometheus metrics
Comparison to Alternatives
gcsfuse vs Direct Cloud Storage SDK
| Aspect | gcsfuse | Cloud Storage SDK |
|---|
| Latency | Higher (FUSE overhead + GCS API) | Lower (direct GCS API) |
| Cloud Run | ❌ Not supported | ✅ Fully supported |
| Development Effort | Low (standard filesystem code) | Medium (SDK integration) |
| Performance | Slower (filesystem abstraction) | Faster (optimized for use case) |
| Caching | Built-in (stat, type, list, file) | Manual (application-level) |
| Streaming | Automatic | Explicit (io.Copy) |
| Dependencies | FUSE kernel module, privileged mode | None (pure Go library) |
Recommendation: Use Cloud Storage SDK directly for production network boot infrastructure.
gcsfuse vs rsync/gsutil Sync
Periodic sync pattern:
# Sync bucket to local disk every 5 minutes
*/5 * * * * gsutil -m rsync -r gs://boot-assets /var/lib/boot-assets
| Aspect | gcsfuse | rsync/gsutil sync |
|---|
| Consistency | Eventual (with caching) | Strong (within sync interval) |
| Disk Usage | Minimal (file cache optional) | Full copy of assets |
| Latency | GCS API per request | Local disk (fast) |
| Sync Lag | Real-time (no caching) or TTL | Sync interval (minutes) |
| Deployment | Requires FUSE | Simple cron job |
Recommendation: For read-heavy, infrequent-write workloads on Compute Engine, rsync/gsutil sync is simpler and faster than gcsfuse.
Conclusion
Cloud Storage FUSE (gcsfuse) provides a convenient filesystem abstraction over GCS buckets, but is not recommended for the network boot infrastructure due to:
- Cloud Run incompatibility (requires FUSE kernel module)
- Added latency (FUSE overhead + network round-trips)
- Poor performance for small files and concurrent access
- Caching trade-offs (consistency vs performance)
Recommended alternatives:
- Custom Boot Server: Direct Cloud Storage SDK integration (
cloud.google.com/go/storage) - Matchbox on Compute Engine: rsync/gsutil sync to local disk
- Cloud Run Deployment: Direct SDK (no gcsfuse possible)
gcsfuse may be useful for development/testing or Matchbox prototyping on Compute Engine, but production deployments should use direct SDK integration or periodic sync for optimal performance and Cloud Run compatibility.
References
2 - GCP Network Boot Protocol Support
Analysis of Google Cloud Platform’s support for TFTP, HTTP, and HTTPS routing for network boot infrastructure
This document analyzes GCP’s capabilities for hosting network boot infrastructure, specifically focusing on TFTP, HTTP, and HTTPS protocol support.
TFTP (Trivial File Transfer Protocol) Support
Native Support
Status: ❌ Not natively supported by Cloud Load Balancing
GCP’s Cloud Load Balancing services (Application Load Balancer, Network Load Balancer) do not support TFTP protocol natively. TFTP operates on UDP port 69 and has unique protocol requirements that are not compatible with GCP’s load balancing services.
Implementation Options
Option 1: Direct VM Access (Recommended for VPN Scenario)
Since ADR-0002 specifies a VPN-based architecture, TFTP can be served directly from a Compute Engine VM without load balancing:
- Approach: Run TFTP server (e.g.,
tftpd-hpa, dnsmasq) on a Compute Engine VM - Access: Home lab connects via VPN tunnel to the VM’s private IP
- Routing: VPC firewall rules allow UDP/69 from VPN subnet
- Pros:
- Simple implementation
- No need for load balancing (single boot server sufficient)
- TFTP traffic encrypted through VPN tunnel
- Direct VM-to-client communication
- Cons:
- Single point of failure (no load balancing/HA)
- Manual failover required if VM fails
Option 2: Network Load Balancer (NLB) Passthrough
While NLB doesn’t parse TFTP protocol, it can forward UDP traffic:
- Approach: Configure Network Load Balancer for UDP/69 passthrough
- Limitations:
- No protocol-aware health checks for TFTP
- Health checks would use TCP or HTTP on alternate port
- Adds complexity without significant benefit for single boot server
- Use Case: Only relevant for multi-region HA deployment (overkill for home lab)
TFTP Security Considerations
- Encryption: TFTP protocol itself is unencrypted, but VPN tunnel provides encryption
- Firewall Rules: Restrict UDP/69 to VPN subnet only (no public access)
- File Access Control: Configure TFTP server with restricted file access
- Read-Only Mode: Deploy TFTP server in read-only mode to prevent uploads
HTTP Support
Native Support
Status: ✅ Fully supported
GCP provides comprehensive HTTP support through multiple services:
Cloud Load Balancing - Application Load Balancer
- Protocol Support: HTTP/1.1, HTTP/2, HTTP/3 (QUIC)
- Port: Any port (typically 80 for HTTP)
- Routing: URL-based routing, host-based routing, path-based routing
- Health Checks: HTTP health checks with configurable paths
- SSL Offloading: Can terminate SSL at load balancer and use HTTP backend
- Backend: Compute Engine VMs, instance groups, Cloud Run, GKE
Compute Engine Direct Access
For VPN scenario, HTTP can be served directly from VM:
- Approach: Run HTTP server (nginx, Apache, custom service) on Compute Engine VM
- Access: Home lab accesses via VPN tunnel to private IP
- Firewall: VPC firewall rules allow TCP/80 from VPN subnet
- Pros: Simpler than load balancer for single boot server
HTTP Boot Flow for Network Boot
- PXE → TFTP: Initial bootloader (iPXE) loaded via TFTP
- iPXE → HTTP: iPXE chainloads boot files via HTTP from same server
- Kernel/Initrd: Large boot files served efficiently over HTTP
- Connection Pooling: HTTP/1.1 keep-alive reduces connection overhead
- Compression: gzip compression for text-based boot configs
- Caching: Cloud CDN can cache boot files for faster delivery
- TCP Optimization: GCP’s network optimized for low-latency TCP
HTTPS Support
Native Support
Status: ✅ Fully supported with advanced features
GCP provides enterprise-grade HTTPS support:
Cloud Load Balancing - Application Load Balancer
- Protocol Support: HTTPS/1.1, HTTP/2 over TLS, HTTP/3 with QUIC
- SSL/TLS Termination: Terminate SSL at load balancer
- Certificate Management:
- Google-managed SSL certificates (automatic renewal)
- Self-managed certificates (bring your own)
- Certificate Map for multiple domains
- TLS Versions: TLS 1.0, 1.1, 1.2, 1.3 (configurable minimum version)
- Cipher Suites: Modern, compatible, or custom cipher suites
- mTLS Support: Mutual TLS authentication (client certificates)
Certificate Manager
- Managed Certificates: Automatic provisioning and renewal via Let’s Encrypt integration
- Private CA: Integration with Google Cloud Certificate Authority Service
- Certificate Maps: Route different domains to different backends based on SNI
- Certificate Monitoring: Automatic alerts before expiration
HTTPS for Network Boot
Use Case
Modern UEFI firmware and iPXE support HTTPS boot:
- iPXE HTTPS: iPXE compiled with
DOWNLOAD_PROTO_HTTPS can fetch over HTTPS - UEFI HTTP Boot: UEFI firmware natively supports HTTP/HTTPS boot (RFC 3720 iSCSI boot)
- Security: Boot file integrity verified via HTTPS chain of trust
Implementation on GCP
Certificate Provisioning:
- Use Google-managed certificate for public domain (if boot server has public DNS)
- Use self-signed certificate for VPN-only access (add to iPXE trust store)
- Use private CA for internal PKI
Load Balancer Configuration:
- HTTPS frontend (port 443)
- Backend service to Compute Engine VM running boot server
- SSL policy with TLS 1.2+ minimum
Alternative: Direct VM HTTPS:
- Run nginx/Apache with TLS on Compute Engine VM
- Access via VPN tunnel to private IP with HTTPS
- Simpler setup for VPN-only scenario
mTLS Support for Enhanced Security
GCP’s Application Load Balancer supports mutual TLS authentication:
- Client Certificates: Require client certificates for additional authentication
- Certificate Validation: Validate client certificates against trusted CA
- Use Case: Ensure only authorized home lab servers can access boot files
- Integration: Combine with VPN for defense-in-depth
Routing and Load Balancing Capabilities
VPC Routing
- Custom Routes: Define routes to direct traffic through VPN gateway
- Route Priority: Configure route priorities for failover scenarios
- BGP Support: Dynamic routing with Cloud Router (for advanced VPN setups)
Firewall Rules
- Ingress/Egress Rules: Fine-grained control over traffic
- Source/Destination Filters: IP ranges, tags, service accounts
- Protocol Filtering: Allow specific protocols (UDP/69, TCP/80, TCP/443)
- VPN Subnet Restriction: Limit access to VPN-connected home lab subnet
Cloud Armor (Optional)
For additional security if boot server has public access:
- DDoS Protection: Layer 3/4 DDoS mitigation
- WAF Rules: Application-level filtering
- IP Allowlisting: Restrict to known public IPs
- Rate Limiting: Prevent abuse
Cost Implications
Network Egress Costs
- VPN Traffic: Egress to VPN endpoint charged at standard internet egress rates
- Intra-Region: Free for traffic within same region
- Boot File Sizes: Typical kernel + initrd = 50-200MB per boot
- Monthly Estimate: 10 boots/month × 150MB = 1.5GB ≈ $0.18/month (US egress)
Load Balancing Costs
- Application Load Balancer: ~$0.025/hour + $0.008 per LCU-hour
- Network Load Balancer: ~$0.025/hour + data processing charges
- For VPN Scenario: Load balancer likely unnecessary (single VM sufficient)
Compute Costs
- e2-micro Instance: ~$6-7/month (suitable for boot server)
- f1-micro Instance: ~$4-5/month (even smaller, might suffice)
- Reserved/Committed Use: Discounts for long-term commitment
Comparison with Requirements
| Requirement | GCP Support | Implementation |
|---|
| TFTP | ⚠️ Via VM, not LB | Direct VM access via VPN |
| HTTP | ✅ Full support | VM or ALB |
| HTTPS | ✅ Full support | VM or ALB with Certificate Manager |
| VPN Integration | ✅ Native VPN | Cloud VPN or self-managed WireGuard |
| Load Balancing | ✅ ALB, NLB | Optional for HA |
| Certificate Mgmt | ✅ Managed certs | Certificate Manager |
| Cost Efficiency | ✅ Low-cost VMs | e2-micro sufficient |
Recommendations
For VPN-Based Architecture (per ADR-0002)
Compute Engine VM: Deploy single e2-micro VM with:
- TFTP server (
tftpd-hpa or dnsmasq) - HTTP server (nginx or simple Python HTTP server)
- Optional HTTPS with self-signed certificate
VPN Tunnel: Connect home lab to GCP via:
- Cloud VPN (IPsec) - easier setup, higher cost
- Self-managed WireGuard on Compute Engine - lower cost, more control
VPC Firewall: Restrict access to:
- UDP/69 (TFTP) from VPN subnet only
- TCP/80 (HTTP) from VPN subnet only
- TCP/443 (HTTPS) from VPN subnet only
No Load Balancer: For home lab scale, direct VM access is sufficient
Health Monitoring: Use Cloud Monitoring for VM and service health
If HA Required (Future Enhancement)
- Deploy multi-zone VMs with Network Load Balancer
- Use Cloud Storage as backend for boot files with VM serving as cache
- Implement failover automation with Cloud Functions
References
3 - GCP WireGuard VPN Support
Analysis of WireGuard VPN deployment options on Google Cloud Platform for secure site-to-site connectivity
This document analyzes options for deploying WireGuard VPN on GCP to establish secure site-to-site connectivity between the home lab and cloud-hosted network boot infrastructure.
WireGuard Overview
WireGuard is a modern VPN protocol that provides:
- Simplicity: Minimal codebase (~4,000 lines vs 100,000+ for IPsec)
- Performance: High throughput with low overhead
- Security: Modern cryptography (Curve25519, ChaCha20, Poly1305, BLAKE2s)
- Configuration: Simple key-based configuration
- Kernel Integration: Mainline Linux kernel support since 5.6
GCP Native VPN Support
Cloud VPN (IPsec)
Status: ❌ WireGuard not natively supported
GCP’s managed Cloud VPN service supports:
- IPsec VPN: IKEv1, IKEv2 with PSK or certificate authentication
- HA VPN: Highly available VPN with 99.99% SLA
- Classic VPN: Single-tunnel VPN (deprecated)
Limitation: Cloud VPN does not support WireGuard protocol natively.
Cost: Cloud VPN
- HA VPN: ~$0.05/hour per tunnel × 2 tunnels = ~$73/month
- Egress: Standard internet egress rates (~$0.12/GB for first 1TB)
- Total Estimate: ~$75-100/month for managed VPN
Self-Managed WireGuard on Compute Engine
Implementation Approach
Since GCP doesn’t offer managed WireGuard, deploy WireGuard on a Compute Engine VM:
Status: ✅ Fully supported via Compute Engine
Architecture
graph LR
A[Home Lab] -->|WireGuard Tunnel| B[GCP Compute Engine VM]
B -->|Private VPC Network| C[Boot Server VM]
B -->|IP Forwarding| C
subgraph "Home Network"
A
D[UDM Pro]
D -.WireGuard Client.- A
end
subgraph "GCP VPC"
B[WireGuard Gateway VM]
C[Boot Server VM]
endVM Configuration
WireGuard Gateway VM:
- Instance Type: e2-micro or f1-micro ($4-7/month)
- OS: Ubuntu 22.04 LTS or Debian 12 (native WireGuard kernel support)
- IP Forwarding: Enable IP forwarding to route traffic to other VMs
- External IP: Static external IP for stable WireGuard endpoint
- Firewall: Allow UDP port 51820 (WireGuard) from home lab public IP
Boot Server VM:
- Network: Same VPC as WireGuard gateway
- Private IP Only: No external IP (accessed via VPN)
- Route Traffic: Through WireGuard gateway VM
Installation Steps
# On GCP Compute Engine VM (Ubuntu 22.04+)
sudo apt update
sudo apt install wireguard wireguard-tools
# Generate server keys
wg genkey | tee /etc/wireguard/server_private.key | wg pubkey > /etc/wireguard/server_public.key
chmod 600 /etc/wireguard/server_private.key
# Configure WireGuard interface
sudo nano /etc/wireguard/wg0.conf
Example /etc/wireguard/wg0.conf on GCP VM:
[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = <SERVER_PRIVATE_KEY>
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o ens4 -j MASQUERADE
[Peer]
# Home Lab (UDM Pro)
PublicKey = <CLIENT_PUBLIC_KEY>
AllowedIPs = 10.200.0.2/32, 192.168.1.0/24
Corresponding config on UDM Pro:
[Interface]
Address = 10.200.0.2/24
PrivateKey = <CLIENT_PRIVATE_KEY>
[Peer]
PublicKey = <SERVER_PUBLIC_KEY>
Endpoint = <GCP_VM_EXTERNAL_IP>:51820
AllowedIPs = 10.200.0.0/24, 10.128.0.0/20
PersistentKeepalive = 25
Enable and Start WireGuard
# Enable IP forwarding permanently
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Enable WireGuard interface
sudo systemctl enable wg-quick@wg0
sudo systemctl start wg-quick@wg0
# Verify status
sudo wg show
GCP VPC Configuration
Firewall Rules
Create VPC firewall rule to allow WireGuard:
gcloud compute firewall-rules create allow-wireguard \
--direction=INGRESS \
--priority=1000 \
--network=default \
--action=ALLOW \
--rules=udp:51820 \
--source-ranges=<HOME_LAB_PUBLIC_IP>/32 \
--target-tags=wireguard-gateway
Tag the WireGuard VM:
gcloud compute instances add-tags wireguard-gateway-vm \
--tags=wireguard-gateway \
--zone=us-central1-a
Static External IP
Reserve static IP for stable WireGuard endpoint:
gcloud compute addresses create wireguard-gateway-ip \
--region=us-central1
gcloud compute instances delete-access-config wireguard-gateway-vm \
--access-config-name="external-nat" \
--zone=us-central1-a
gcloud compute instances add-access-config wireguard-gateway-vm \
--access-config-name="external-nat" \
--address=wireguard-gateway-ip \
--zone=us-central1-a
Cost: Static IP ~$3-4/month if VM is always running (free if attached to running VM in some regions).
Route Configuration
For traffic from boot server to reach home lab via WireGuard VM:
gcloud compute routes create route-to-homelab \
--network=default \
--priority=100 \
--destination-range=192.168.1.0/24 \
--next-hop-instance=wireguard-gateway-vm \
--next-hop-instance-zone=us-central1-a
This routes home lab subnet (192.168.1.0/24) through the WireGuard gateway VM.
UDM Pro WireGuard Integration
Native Support
Status: ✅ WireGuard supported natively (UniFi OS 1.12.22+)
The UniFi Dream Machine Pro includes native WireGuard VPN support:
- GUI Configuration: Web UI for WireGuard VPN setup
- Site-to-Site: Support for site-to-site VPN tunnels
- Performance: Hardware acceleration for encryption (if available)
- Routing: Automatic route injection for remote subnets
Configuration Steps on UDM Pro
Network Settings → VPN:
- Create new VPN connection
- Select “WireGuard”
- Generate key pair or import existing
Peer Configuration:
- Peer Public Key: GCP WireGuard VM’s public key
- Endpoint: GCP VM’s static external IP
- Port: 51820
- Allowed IPs: GCP VPC subnet (e.g., 10.128.0.0/20)
- Persistent Keepalive: 25 seconds
Route Injection:
- UDM Pro automatically adds routes to GCP subnets
- Home lab servers can reach GCP boot server via VPN
Firewall Rules:
- Add firewall rule to allow boot traffic (TFTP, HTTP) from LAN to VPN
Alternative: Manual WireGuard on UDM Pro
If native support is insufficient, use wireguard-go via udm-utilities:
- Repository: boostchicken/udm-utilities
- Script:
on_boot.d script to start WireGuard - Persistence: Survives firmware updates with on-boot script
Throughput
WireGuard on Compute Engine performance:
- e2-micro (2 vCPU, shared core): ~100-300 Mbps
- e2-small (2 vCPU): ~500-800 Mbps
- e2-medium (2 vCPU): ~1+ Gbps
For network boot (typical boot = 50-200MB), even e2-micro is sufficient:
- Boot Time: 150MB at 100 Mbps = ~12 seconds transfer time
- Recommendation: e2-micro adequate for home lab scale
Latency
- VPN Overhead: WireGuard adds minimal latency (~1-5ms overhead)
- GCP Network: Low-latency network to most regions
- Total Latency: Primarily dependent on home ISP and GCP region proximity
CPU Usage
- Encryption: ChaCha20 is CPU-efficient
- Kernel Module: Minimal CPU overhead in kernel space
- e2-micro: Sufficient CPU for home lab VPN throughput
Security Considerations
Key Management
- Private Keys: Store securely, never commit to version control
- Key Rotation: Rotate keys periodically (e.g., annually)
- Secret Manager: Store WireGuard private keys in GCP Secret Manager
- Retrieve at VM startup via startup script
- Avoid storing in VM metadata or disk images
Firewall Hardening
- Source IP Restriction: Limit WireGuard port to home lab public IP only
- Least Privilege: Boot server firewall allows only VPN subnet
- No Public Access: Boot server has no external IP
Monitoring and Alerts
- Cloud Logging: Log WireGuard connection events
- Cloud Monitoring: Alert on VPN tunnel down
- Metrics: Monitor handshake failures, data transfer
DDoS Protection
- UDP Amplification: WireGuard resistant to DDoS amplification
- Cloud Armor: Optional layer for additional DDoS protection (overkill for VPN)
High Availability Options
Multi-Region Failover
Deploy WireGuard gateways in multiple regions:
- Primary: us-central1 WireGuard VM
- Secondary: us-east1 WireGuard VM
- Failover: UDM Pro switches endpoints if primary fails
- Cost: Doubles VM costs (~$8-14/month for 2 VMs)
Health Checks
Monitor WireGuard tunnel health:
# On UDM Pro (via SSH)
wg show wg0 latest-handshakes
# If handshake timestamp old (>3 minutes), tunnel may be down
Automate failover with script on UDM Pro or external monitoring.
Startup Scripts for Auto-Healing
GCP VM startup script to ensure WireGuard starts on boot:
#!/bin/bash
# /etc/startup-script.sh
# Retrieve WireGuard private key from Secret Manager
gcloud secrets versions access latest --secret="wireguard-server-key" > /etc/wireguard/server_private.key
chmod 600 /etc/wireguard/server_private.key
# Start WireGuard
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0
Attach as metadata:
gcloud compute instances add-metadata wireguard-gateway-vm \
--metadata-from-file startup-script=/path/to/startup-script.sh \
--zone=us-central1-a
Cost Analysis
Self-Managed WireGuard on Compute Engine
| Component | Cost |
|---|
| e2-micro VM (730 hrs/month) | ~$6.50 |
| Static External IP | ~$3.50 |
| Egress (1GB/month boot traffic) | ~$0.12 |
| Monthly Total | ~$10.12 |
| Annual Total | ~$121 |
Cloud VPN (IPsec - if WireGuard not used)
| Component | Cost |
|---|
| HA VPN Gateway (2 tunnels) | ~$73 |
| Egress (1GB/month) | ~$0.12 |
| Monthly Total | ~$73 |
| Annual Total | ~$876 |
Cost Savings: Self-managed WireGuard saves ~$755/year vs Cloud VPN.
Comparison with Requirements
| Requirement | GCP Support | Implementation |
|---|
| WireGuard Protocol | ✅ Via Compute Engine | Self-managed on VM |
| Site-to-Site VPN | ✅ Yes | WireGuard tunnel |
| UDM Pro Integration | ✅ Native support | WireGuard peer config |
| Cost Efficiency | ✅ Low cost | e2-micro ~$10/month |
| Performance | ✅ Sufficient | 100+ Mbps on e2-micro |
| Security | ✅ Modern crypto | ChaCha20, Curve25519 |
| HA (optional) | ⚠️ Manual setup | Multi-region VMs |
Recommendations
For Home Lab VPN (per ADR-0002)
Self-Managed WireGuard: Deploy on Compute Engine e2-micro VM
- Cost: ~$10/month (vs ~$73/month for Cloud VPN)
- Performance: Sufficient for network boot traffic
- Simplicity: Easy to configure and maintain
Single Region Deployment: Unless HA required, single VM adequate
- Region Selection: Choose region closest to home lab for lowest latency
- Zone: Single zone sufficient (boot server not mission-critical)
UDM Pro Native WireGuard: Use built-in WireGuard client
- Configuration: Add GCP VM as WireGuard peer in UDM Pro UI
- Route Injection: UDM Pro automatically routes GCP subnets
Security Best Practices:
- Store WireGuard private key in Secret Manager
- Restrict WireGuard port to home public IP only
- Use startup script to configure VM on boot
- Enable Cloud Logging for VPN events
Monitoring: Set up Cloud Monitoring alerts for:
- VM down
- High CPU usage (indicates traffic spike or issue)
- Firewall rule blocks (indicates misconfiguration)
Future Enhancements
- HA Setup: Deploy secondary WireGuard VM in different region
- Automated Failover: Script on UDM Pro to switch endpoints
- IPv6 Support: Enable WireGuard over IPv6 if home ISP supports
- Mesh VPN: Expand to mesh topology if multiple sites added
References