This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Services

Documentation for GCP Cloud Run microservices

1: Boot Service

1.1: GET /boot.ipxe
1.2: GET /asset/{boot_profile_id}/kernel
1.3: GET /asset/{boot_profile_id}/initrd
1.4: POST /api/v1/profiles
1.5: GET /api/v1/boot/{machine_id}/profile
1.6: PUT /api/v1/boot/{machine_id}/profile
1.7: DELETE /api/v1/boot/{machine_id}/profile
1.8: GET /health/startup
1.9: GET /health/liveness

2: Machine Service

2.1: POST /api/v1/machines
2.2: GET /api/v1/machines
2.3: GET /api/v1/machines/{id}
2.4: PUT /api/v1/machines/{id}
2.5: DELETE /api/v1/machines/{id}
2.6: GET /health/startup
2.7: GET /health/liveness

This section contains documentation for the Go microservices deployed to GCP Cloud Run as part of the home lab infrastructure.

Service Architecture

All services follow a consistent architecture pattern:

Framework: Built using z5labs/humus framework with OpenAPI-first design
Runtime: Go 1.24+ deployed to GCP Cloud Run
Observability: OpenTelemetry metrics, traces, and logs
Health Checks: Standard /health/startup and /health/liveness endpoints
Configuration: Embedded config.yaml with OpenAPI specifications

1 - Boot Service

UEFI HTTP boot endpoints and boot profile management

The Boot Service is a custom Go microservice that provides UEFI HTTP boot endpoints for bare metal servers and manages boot profiles. It serves boot scripts, streams kernel/initrd assets, and handles boot profile administration (kernel/initrd upload, storage, and lifecycle management).

Architecture Overview

The Boot Service is deployed on GCP Cloud Run and accessed through a WireGuard VPN tunnel from bare metal servers. It integrates with:

Machine Service: Retrieves machine hardware profiles by MAC address
Cloud Storage: Stores and retrieves kernel/initrd blobs
Firestore: Stores boot profile metadata
Cloud Monitoring: OpenTelemetry observability with distributed tracing

Machine Service - Machine hardware profile management
ADR-0005: Network Boot Infrastructure Implementation on Google Cloud - Architecture decision and design rationale
ADR-0002: Network Boot Architecture - Overall network boot strategy

API Endpoints

UEFI HTTP Boot Endpoints

Accessed by bare metal servers during boot process (via WireGuard VPN):

GET /boot.ipxe - Serves iPXE boot scripts customized for the requesting machine
GET /asset/{boot_profile_id}/kernel - Streams kernel images from Cloud Storage
GET /asset/{boot_profile_id}/initrd - Streams initrd images from Cloud Storage

Admin API

Boot profile management endpoints for administrators:

POST /api/v1/profiles - Create a new boot profile for a machine
GET /api/v1/boot/{machine_id}/profile - Retrieve the active boot profile for a machine
PUT /api/v1/boot/{machine_id}/profile - Update the boot profile for a machine
DELETE /api/v1/boot/{machine_id}/profile - Delete a machine’s boot profile

Health Check Endpoints

Standard Cloud Run health endpoints:

GET /health/startup - Startup probe endpoint
GET /health/liveness - Liveness probe endpoint

Security Model

VPN-Based Access Control

Since HP DL360 Gen 9 servers do not support client-side TLS certificates for UEFI HTTP boot, all boot traffic is secured via WireGuard VPN:

Boot Endpoints: Only accessible through WireGuard tunnel (source IP validation)
Transport Security: WireGuard provides mutual authentication and encryption

Authentication Methods

UEFI Boot Endpoints: VPN source IP validation (bare metal servers)
Health Checks: Unauthenticated (used by Cloud Run for liveness/startup probes)

Common Patterns

Error Responses

All API endpoints follow the RFC 7807 Problem Details standard (see ADR-0007):

{
  "type": "https://api.example.com/errors/resource-not-found",
  "title": "Resource Not Found",
  "status": 404,
  "detail": "Machine with MAC address aa:bb:cc:dd:ee:ff not found",
  "instance": "/api/v1/boot/aa:bb:cc:dd:ee:ff/profile",
  "mac_address": "aa:bb:cc:dd:ee:ff"
}

Error responses use Content-Type: application/problem+json.

Standard HTTP Status Codes

200 OK - Successful request
201 Created - Resource created successfully
204 No Content - Successful deletion
400 Bad Request - Invalid request parameters
401 Unauthorized - Missing or invalid authentication
403 Forbidden - Insufficient permissions
404 Not Found - Resource not found
409 Conflict - Resource already exists
422 Unprocessable Entity - Validation error
500 Internal Server Error - Server error

Content Types

application/json - JSON responses (admin API)
application/problem+json - RFC 7807 error responses
text/plain - iPXE boot scripts
application/octet-stream - Binary boot assets (kernel, initrd)
text/cloud-config - Cloud-init configuration files

1.1 - GET /boot.ipxe

Serves iPXE boot scripts customized for the requesting machine

Serves iPXE boot scripts customized for the requesting machine based on its MAC address. This endpoint is accessed by bare metal servers (HP DL360 Gen 9) during the UEFI HTTP boot process through the WireGuard VPN tunnel.

Sequence Diagram

sequenceDiagram
    participant Client as Bare Metal Server
    participant Boot as Boot Service
    participant MachineAPI as Machine Service
    participant DB as Firestore

    Client->>Boot: GET /boot.ipxe?mac=52:54:00:12:34:56
    Boot->>Boot: Validate MAC address format
    Boot->>MachineAPI: GET /api/v1/machines?mac=52:54:00:12:34:56
    MachineAPI->>DB: Query machine by NIC MAC
    DB-->>MachineAPI: Machine profile (machine_id)
    MachineAPI-->>Boot: Machine profile
    Boot->>DB: Query boot profile by machine_id
    DB-->>Boot: Boot profile (profile_id, kernel_id, initrd_id, kernel args)
    Boot->>Boot: Generate iPXE script with profile_id
    Boot-->>Client: 200 OK (iPXE script)

Request

Query Parameters:

Parameter	Type	Required	Description
`mac`	string	Yes	MAC address of the requesting machine (format: `aa:bb:cc:dd:ee:ff`)

Request Example:

GET /boot.ipxe?mac=52:54:00:12:34:56 HTTP/1.1
Host: boot.internal

Response

Response Example (200 OK):

#!ipxe

# Boot configuration for node-01 (52:54:00:12:34:56)
# Boot Profile ID: 018c7dbd-a1b2-7000-8000-987654321def
# Generated: 2025-11-19T06:00:00Z

kernel /asset/018c7dbd-a1b2-7000-8000-987654321def/kernel console=tty0 console=ttyS0 ip=dhcp
initrd /asset/018c7dbd-a1b2-7000-8000-987654321def/initrd
boot

Response Headers:

Content-Type: text/plain; charset=utf-8
Cache-Control: no-cache, no-store, must-revalidate

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

400 Bad Request - Missing or invalid MAC address:

{
  "type": "https://api.example.com/errors/invalid-mac-address",
  "title": "Invalid MAC Address",
  "status": 400,
  "detail": "MAC address must be in format aa:bb:cc:dd:ee:ff",
  "instance": "/boot.ipxe",
  "mac_address": "invalid-mac"
}

404 Not Found - No boot configuration found for MAC:

{
  "type": "https://api.example.com/errors/machine-not-configured",
  "title": "Machine Not Configured",
  "status": 404,
  "detail": "No boot configuration found for MAC address 52:54:00:12:34:56",
  "instance": "/boot.ipxe?mac=52:54:00:12:34:56",
  "mac_address": "52:54:00:12:34:56"
}

500 Internal Server Error - Database or template error:

{
  "type": "https://api.example.com/errors/internal-error",
  "title": "Internal Server Error",
  "status": 500,
  "detail": "Failed to generate boot script due to an internal error",
  "instance": "/boot.ipxe?mac=52:54:00:12:34:56"
}

Boot Script Variables

The iPXE script may include the following dynamic values:

Machine-specific kernel parameters
Asset download URLs (using boot profile ID format)
Network configuration parameters

Security Considerations

VPN Source IP Validation

All boot endpoints validate that requests originate from the WireGuard VPN subnet:

Allowed CIDR: 10.x.x.0/24 (WireGuard VPN network)
Validation: Performed at Cloud Run ingress or application layer
Rejection: Requests from outside VPN return 403 Forbidden

Rate Limiting

To prevent abuse, boot endpoints are rate-limited:

Boot Script: 10 requests/minute per MAC address

Observability

All boot endpoint requests are instrumented with OpenTelemetry following HTTP semantic conventions:

Metrics: OpenTelemetry HTTP server metrics (request count, duration, size)
- http.server.request.duration - Request duration histogram
- http.server.request.body.size - Request body size
- http.server.response.body.size - Response body size
Traces: End-to-end tracing from request to database retrieval
- HTTP server span captures request details (method, route, status code)
- Child spans for database queries and Machine Service API calls
Logs: Structured logs with MAC address, boot profile ID, response status

1.2 - GET /asset/{boot_profile_id}/kernel

Streams kernel images from Cloud Storage for the boot process

Streams kernel images from Cloud Storage for the boot process. This endpoint is accessed by bare metal servers during UEFI HTTP boot through the WireGuard VPN tunnel.

Sequence Diagram

sequenceDiagram
    participant Client as Bare Metal Server
    participant Boot as Boot Service
    participant Storage as Cloud Storage
    participant DB as Firestore

    Client->>Boot: GET /asset/018c7dbd-a1b2-7000-8000-987654321def/kernel
    Boot->>Boot: Validate UUIDv7 format
    Boot->>DB: Query boot profile by ID
    DB-->>Boot: Boot profile (kernel_id)
    Boot->>Storage: GET gs://bucket/blobs/{kernel_id}
    Storage-->>Boot: Kernel data stream
    Boot-->>Client: 200 OK (kernel stream)

Request

Path Parameters:

Parameter	Type	Required	Description
`boot_profile_id`	string (UUIDv7)	Yes	Boot profile identifier (UUIDv7 format: `018c7dbd-a1b2-7000-8000-987654321def`)

Request Example:

GET /asset/018c7dbd-a1b2-7000-8000-987654321def/kernel HTTP/1.1
Host: boot.internal

Response

Response Example (200 OK):

Binary kernel image streamed from Cloud Storage.

Response Headers:

Content-Type: application/octet-stream
Content-Length: 8388608 (actual kernel size in bytes)
Cache-Control: public, max-age=3600
ETag: "abc123..."

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

404 Not Found - Kernel image not found:

{
  "type": "https://api.example.com/errors/kernel-not-found",
  "title": "Kernel Not Found",
  "status": 404,
  "detail": "Kernel image not found for boot profile 018c7dbd-a1b2-7000-8000-987654321def",
  "instance": "/asset/018c7dbd-a1b2-7000-8000-987654321def/kernel",
  "boot_profile_id": "018c7dbd-a1b2-7000-8000-987654321def"
}

500 Internal Server Error - Cloud Storage error:

{
  "type": "https://api.example.com/errors/storage-error",
  "title": "Storage Error",
  "status": 500,
  "detail": "Failed to retrieve kernel from storage due to an internal error",
  "instance": "/asset/018c7dbd-a1b2-7000-8000-987654321def/kernel"
}

Performance Characteristics

Streaming: File is streamed directly from Cloud Storage (no buffering in memory)
Target Latency: < 100ms to first byte
Typical Size: 8-15 MB for Linux kernels

Security Considerations

VPN Source IP Validation

All boot endpoints validate that requests originate from the WireGuard VPN subnet:

Allowed CIDR: 10.x.x.0/24 (WireGuard VPN network)
Validation: Performed at Cloud Run ingress or application layer
Rejection: Requests from outside VPN return 403 Forbidden

Rate Limiting

To prevent abuse, asset download endpoints are rate-limited:

Asset Downloads: 5 concurrent downloads per MAC address

Asset Integrity

Boot assets are validated for integrity:

Checksums: SHA-256 checksums stored in Firestore
Verification: Computed on upload, verified on download (optional)
ETag Headers: Enable client-side caching and integrity checks

Observability

All boot endpoint requests are instrumented with OpenTelemetry following HTTP semantic conventions:

Metrics: OpenTelemetry HTTP server metrics
- http.server.request.duration - Request duration histogram
- http.server.response.body.size - Response body size (tracks bytes transferred)
Traces: End-to-end tracing from request to Cloud Storage retrieval
- HTTP server span captures request details (method, route, status code)
- Child spans for database queries and Cloud Storage operations
Logs: Structured logs with boot profile ID, kernel ID, response status

1.3 - GET /asset/{boot_profile_id}/initrd

Streams initial ramdisk images from Cloud Storage for the boot process

Streams initial ramdisk (initrd) images from Cloud Storage for the boot process. This endpoint is accessed by bare metal servers during UEFI HTTP boot through the WireGuard VPN tunnel.

Sequence Diagram

sequenceDiagram
    participant Client as Bare Metal Server
    participant Boot as Boot Service
    participant Storage as Cloud Storage
    participant DB as Firestore

    Client->>Boot: GET /asset/018c7dbd-a1b2-7000-8000-987654321def/initrd
    Boot->>Boot: Validate UUIDv7 format
    Boot->>DB: Query boot profile by ID
    DB-->>Boot: Boot profile (initrd_id)
    Boot->>Storage: GET gs://bucket/blobs/{initrd_id}
    Storage-->>Boot: Initrd data stream
    Boot-->>Client: 200 OK (initrd stream)

Request

Path Parameters:

Parameter	Type	Required	Description
`boot_profile_id`	string (UUIDv7)	Yes	Boot profile identifier (UUIDv7 format: `018c7dbd-a1b2-7000-8000-987654321def`)

Request Example:

GET /asset/018c7dbd-a1b2-7000-8000-987654321def/initrd HTTP/1.1
Host: boot.internal

Response

Response Example (200 OK):

Binary initrd image streamed from Cloud Storage.

Response Headers:

Content-Type: application/octet-stream
Content-Length: 52428800 (actual initrd size in bytes)
Cache-Control: public, max-age=3600
ETag: "def456..."

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

404 Not Found - Initrd image not found:

{
  "type": "https://api.example.com/errors/initrd-not-found",
  "title": "Initrd Not Found",
  "status": 404,
  "detail": "Initrd image not found for boot profile 018c7dbd-a1b2-7000-8000-987654321def",
  "instance": "/asset/018c7dbd-a1b2-7000-8000-987654321def/initrd",
  "boot_profile_id": "018c7dbd-a1b2-7000-8000-987654321def"
}

500 Internal Server Error - Cloud Storage error:

{
  "type": "https://api.example.com/errors/storage-error",
  "title": "Storage Error",
  "status": 500,
  "detail": "Failed to retrieve initrd from storage due to an internal error",
  "instance": "/asset/018c7dbd-a1b2-7000-8000-987654321def/initrd"
}

Performance Characteristics

Streaming: File is streamed directly from Cloud Storage (no buffering in memory)
Target Latency: < 100ms to first byte
Typical Size: 50-150 MB for Linux initrd images

Security Considerations

VPN Source IP Validation

All boot endpoints validate that requests originate from the WireGuard VPN subnet:

Allowed CIDR: 10.x.x.0/24 (WireGuard VPN network)
Validation: Performed at Cloud Run ingress or application layer
Rejection: Requests from outside VPN return 403 Forbidden

Rate Limiting

To prevent abuse, asset download endpoints are rate-limited:

Asset Downloads: 5 concurrent downloads per MAC address

Asset Integrity

Boot assets are validated for integrity:

Checksums: SHA-256 checksums stored in Firestore
Verification: Computed on upload, verified on download (optional)
ETag Headers: Enable client-side caching and integrity checks

Observability

All boot endpoint requests are instrumented with OpenTelemetry following HTTP semantic conventions:

Metrics: OpenTelemetry HTTP server metrics
- http.server.request.duration - Request duration histogram
- http.server.response.body.size - Response body size (tracks bytes transferred)
Traces: End-to-end tracing from request to Cloud Storage retrieval
- HTTP server span captures request details (method, route, status code)
- Child spans for database queries and Cloud Storage operations
Logs: Structured logs with boot profile ID, initrd ID, response status

1.4 - POST /api/v1/profiles

Create a new boot profile for a machine

Create a new boot profile for a machine. If the machine already has a boot profile, this operation will fail - use PUT to update instead.

Cloud Storage Structure

Kernel and initrd binaries are stored in Google Cloud Storage using their UUIDv7 identifiers as object keys:

gs://{bucket}/blobs/{kernel_id}
gs://{bucket}/blobs/{initrd_id}

For example:

gs://boot-server-blobs/blobs/018c7dbd-b100-7000-8000-123456789abc
gs://boot-server-blobs/blobs/018c7dbd-b200-7000-8000-987654321fed

The UUIDv7 identifiers are generated server-side during upload, ensuring:

Globally unique object keys
Time-ordered storage (UUIDv7 timestamp prefix)
No namespace collisions between profiles

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant Boot as Boot Service
    participant Storage as Cloud Storage
    participant DB as Firestore

    Client->>Boot: POST /api/v1/profiles (multipart/form-data)
    Boot->>DB: Check if machine already has a boot profile
    DB-->>Boot: No existing profile
    Boot->>Boot: Generate UUIDv7 for profile
    Boot->>Boot: Generate UUIDv7 for kernel blob
    Boot->>Boot: Generate UUIDv7 for initrd blob
    Boot->>Storage: PUT gs://bucket/blobs/{kernel_id}
    Storage-->>Boot: Kernel stored
    Boot->>Storage: PUT gs://bucket/blobs/{initrd_id}
    Storage-->>Boot: Initrd stored
    Boot->>DB: Store profile metadata (profile_id, kernel_id, initrd_id, machine_id)
    DB-->>Boot: Profile created
    Boot-->>Client: 201 Created (profile metadata with IDs)

Request

Request Body (multipart/form-data):

Form fields:

machine_id (text): Machine identifier (UUIDv7)
kernel (file): Kernel image file
initrd (file): Initrd image file
kernel_args (JSON array): Kernel command-line arguments

Example Request:

POST /api/v1/profiles HTTP/1.1
Host: boot.example.com
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW

------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="machine_id"

018c7dbd-c000-7000-8000-fedcba987654
------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="kernel"; filename="vmlinuz"
Content-Type: application/octet-stream

<kernel binary data>
------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="initrd"; filename="initrd.img"
Content-Type: application/octet-stream

<initrd binary data>
------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="kernel_args"
Content-Type: application/json

["console=tty0", "console=ttyS0", "ip=dhcp"]
------WebKitFormBoundary7MA4YWxkTrZu0gW--

Request Headers:

Content-Type: multipart/form-data

Response

Response (201 Created):

{
  "id": "018c7dbd-a000-7000-8000-abcdef123456",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654",
  "kernel": {
    "id": "018c7dbd-b100-7000-8000-123456789abc",
    "args": ["console=tty0", "console=ttyS0", "ip=dhcp"]
  },
  "initrd": {
    "id": "018c7dbd-b200-7000-8000-987654321fed"
  }
}

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

400 Bad Request - Invalid request body or missing required fields:

{
  "type": "https://api.example.com/errors/validation-error",
  "title": "Validation Error",
  "status": 400,
  "detail": "The request body failed validation",
  "instance": "/api/v1/profiles",
  "invalid_fields": [
    {
      "field": "machine_id",
      "reason": "required field is missing"
    }
  ]
}

409 Conflict - Machine already has a boot profile:

{
  "type": "https://api.example.com/errors/boot-profile-exists",
  "title": "Boot Profile Already Exists",
  "status": 409,
  "detail": "Machine 018c7dbd-c000-7000-8000-fedcba987654 already has a boot profile",
  "instance": "/api/v1/profiles",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654",
  "existing_profile_id": "018c7dbd-a000-7000-8000-abcdef123456"
}

422 Unprocessable Entity - Validation error (file too large, invalid JSON, machine_id not found):

{
  "type": "https://api.example.com/errors/file-too-large",
  "title": "File Too Large",
  "status": 422,
  "detail": "Kernel file exceeds maximum allowed size of 100MB",
  "instance": "/api/v1/profiles",
  "field": "kernel",
  "file_size": 125829120,
  "max_size": 104857600
}

Data Models

All data models are defined as Protocol Buffer (protobuf) messages and stored in Firestore.

Boot Profile

syntax = "proto3";

message Kernel {
  string id = 1;              // UUIDv7 blob identifier
  repeated string args = 2;   // Kernel command-line arguments
}

message Initrd {
  string id = 1;              // UUIDv7 blob identifier
}

message BootProfile {
  string id = 1;              // UUIDv7 identifier
  string machine_id = 2;      // Reference to machine (UUIDv7) - unique constraint
  Kernel kernel = 3;          // Kernel configuration
  Initrd initrd = 4;          // Initrd configuration
}

Note: The machine_id field has a unique constraint in Firestore, ensuring each machine has exactly one active boot profile.

1.5 - GET /api/v1/boot/{machine_id}/profile

Retrieve the active boot profile for a specific machine

Retrieve the active boot profile for a specific machine.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant Boot as Boot Service
    participant DB as Firestore

    Client->>Boot: GET /api/v1/boot/{machine_id}/profile
    Boot->>DB: Query active boot profile for machine
    DB-->>Boot: Boot profile
    Boot-->>Client: 200 OK (boot profile)

Request

Path Parameters:

Parameter	Type	Required	Description
`machine_id`	string	Yes	Machine identifier (UUIDv7 format)

Example Request:

GET /api/v1/boot/018c7dbd-c000-7000-8000-fedcba987654/profile HTTP/1.1
Host: boot.example.com

Response

Response (200 OK):

{
  "id": "018c7dbd-a000-7000-8000-abcdef123456",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654",
  "kernel": {
    "id": "018c7dbd-b100-7000-8000-123456789abc",
    "args": ["console=tty0", "console=ttyS0", "ip=dhcp"]
  },
  "initrd": {
    "id": "018c7dbd-b200-7000-8000-987654321fed"
  }
}

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

404 Not Found - Machine not found or has no boot profile:

{
  "type": "https://api.example.com/errors/boot-profile-not-found",
  "title": "Boot Profile Not Found",
  "status": 404,
  "detail": "No boot profile found for machine 018c7dbd-c000-7000-8000-fedcba987654",
  "instance": "/api/v1/boot/018c7dbd-c000-7000-8000-fedcba987654/profile",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654"
}

1.6 - PUT /api/v1/boot/{machine_id}/profile

Update the boot profile for a machine

Update the boot profile for a machine (replaces the existing profile).

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant Boot as Boot Service
    participant Storage as Cloud Storage
    participant DB as Firestore

    Client->>Boot: PUT /api/v1/boot/{machine_id}/profile
    Boot->>DB: Get current active profile
    DB-->>Boot: Current profile (old kernel_id, old initrd_id)
    Boot->>Boot: Generate UUIDs for new kernel/initrd
    Boot->>Storage: PUT new kernel/initrd blobs
    Storage-->>Boot: Blobs stored
    Boot->>DB: Update boot profile (replace kernel_id, initrd_id, args)
    DB-->>Boot: Profile updated
    Boot->>Storage: DELETE old kernel/initrd blobs
    Boot-->>Client: 200 OK (updated profile)

Request

Path Parameters:

Parameter	Type	Required	Description
`machine_id`	string	Yes	Machine identifier (UUIDv7 format)

Request Body (multipart/form-data):

Form fields:

kernel (file): Kernel image file
initrd (file): Initrd image file
kernel_args (JSON array): Kernel command-line arguments

Example Request:

PUT /api/v1/boot/018c7dbd-c000-7000-8000-fedcba987654/profile HTTP/1.1
Host: boot.example.com
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW

------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="kernel"; filename="vmlinuz"
Content-Type: application/octet-stream

<kernel binary data>
------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="initrd"; filename="initrd.img"
Content-Type: application/octet-stream

<initrd binary data>
------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="kernel_args"
Content-Type: application/json

["console=tty0", "console=ttyS0", "ip=dhcp"]
------WebKitFormBoundary7MA4YWxkTrZu0gW--

Response

Response (200 OK):

{
  "id": "018c7dbd-a000-7000-8000-abcdef123456",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654",
  "kernel": {
    "id": "018c7dbd-b100-7000-8000-123456789abc",
    "args": ["console=tty0", "console=ttyS0", "ip=dhcp"]
  },
  "initrd": {
    "id": "018c7dbd-b200-7000-8000-987654321fed"
  }
}

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

404 Not Found - Machine not found or has no boot profile:

{
  "type": "https://api.example.com/errors/boot-profile-not-found",
  "title": "Boot Profile Not Found",
  "status": 404,
  "detail": "No boot profile found for machine 018c7dbd-c000-7000-8000-fedcba987654",
  "instance": "/api/v1/boot/018c7dbd-c000-7000-8000-fedcba987654/profile",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654"
}

422 Unprocessable Entity - Validation error:

{
  "type": "https://api.example.com/errors/file-too-large",
  "title": "File Too Large",
  "status": 422,
  "detail": "Kernel file exceeds maximum allowed size of 100MB",
  "instance": "/api/v1/boot/018c7dbd-c000-7000-8000-fedcba987654/profile",
  "field": "kernel",
  "file_size": 125829120,
  "max_size": 104857600
}

1.7 - DELETE /api/v1/boot/{machine_id}/profile

Delete a machine’s boot profile and its associated blobs

Delete a machine’s boot profile and its associated blobs.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant Boot as Boot Service
    participant Storage as Cloud Storage
    participant DB as Firestore

    Client->>Boot: DELETE /api/v1/boot/{machine_id}/profile
    Boot->>DB: Get kernel_id and initrd_id
    DB-->>Boot: Blob IDs
    Boot->>Storage: DELETE gs://bucket/blobs/{kernel_id}
    Boot->>Storage: DELETE gs://bucket/blobs/{initrd_id}
    Boot->>DB: Delete boot profile
    Boot-->>Client: 204 No Content

Request

Path Parameters:

Parameter	Type	Required	Description
`machine_id`	string	Yes	Machine identifier (UUIDv7 format)

Example Request:

DELETE /api/v1/boot/018c7dbd-c000-7000-8000-fedcba987654/profile HTTP/1.1
Host: boot.example.com

Response

Response (204 No Content):

Empty response body.

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

404 Not Found - Machine not found or has no boot profile:

{
  "type": "https://api.example.com/errors/boot-profile-not-found",
  "title": "Boot Profile Not Found",
  "status": 404,
  "detail": "No boot profile found for machine 018c7dbd-c000-7000-8000-fedcba987654",
  "instance": "/api/v1/boot/018c7dbd-c000-7000-8000-fedcba987654/profile",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654"
}

1.8 - GET /health/startup

Startup probe endpoint for Cloud Run

Indicates whether the application has completed initialization and is ready to receive traffic.

Request

Request Example:

GET /health/startup HTTP/1.1
Host: boot.example.com

Response

Response (200 OK):

Empty response body with HTTP 200 status code.

Response (503 Service Unavailable):

Empty response body with HTTP 503 status code.

Response Headers:

Cache-Control: no-cache, no-store, must-revalidate

Startup Check Components

Firestore Connection - Verifies database connectivity
Cloud Storage Access - Validates access to boot image buckets

Cloud Run Configuration

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  initialDelaySeconds: 0
  timeoutSeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Behavior

Success (200): Application is fully initialized and ready to serve requests
Failure (503): Application is still starting up or encountered initialization errors
Timeout: After 30 seconds of no response, Cloud Run considers startup failed

Observability

Metrics:

health_check_total{probe="startup",status="ok"} - Successful startup checks
health_check_total{probe="startup",status="error"} - Failed startup checks
health_check_duration_ms{probe="startup"} - Startup check duration

Structured Logs:

{
  "severity": "INFO",
  "timestamp": "2025-11-19T06:00:00Z",
  "message": "Health check completed",
  "probe": "startup",
  "status": "ok",
  "duration_ms": 15
}

Alerts:

Startup Failure: Alert if startup check fails for > 1 minute

Testing

Manual Testing

curl -v http://localhost:8080/health/startup

Automated Testing

func TestHealthStartup(t *testing.T) {
    resp, err := http.Get("http://localhost:8080/health/startup")
    require.NoError(t, err)
    defer resp.Body.Close()

    assert.Equal(t, http.StatusOK, resp.StatusCode)
}

Troubleshooting

Startup Check Never Succeeds

Symptoms:

Container restarts repeatedly
Cloud Run shows “unhealthy” status
Startup probe returns 503

Debugging:

# Check Cloud Run logs for startup errors
gcloud logging read "resource.type=cloud_run_revision AND labels.service_name=boot-server" \
  --limit 50 --format json | jq '.[] | select(.jsonPayload.probe=="startup")'

# Test locally with debug logging
DEBUG=true go run main.go

Common Causes:

Firestore credentials not configured
Cloud Storage bucket permissions missing
Network connectivity issues
Timeout too short for slow dependencies

1.9 - GET /health/liveness

Liveness probe endpoint for Cloud Run

Indicates whether the application is alive and healthy. Used by Cloud Run to detect and restart unhealthy instances.

Request

Request Example:

GET /health/liveness HTTP/1.1
Host: boot.example.com

Response

Response (200 OK):

Empty response body with HTTP 200 status code.

Response (503 Service Unavailable):

Empty response body with HTTP 503 status code.

Response Headers:

Cache-Control: no-cache, no-store, must-revalidate

Liveness Check Components

HTTP Server Health - Verifies the HTTP server is responsive
Basic health validation - Ensures the application can handle requests

Cloud Run Configuration

livenessProbe:
  httpGet:
    path: /health/liveness
    port: 8080
  initialDelaySeconds: 0
  timeoutSeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Behavior

Success (200): Application is healthy and functioning normally
Failure (503): Application is unhealthy and should be restarted
Consecutive Failures: After 3 consecutive failures (30 seconds), Cloud Run restarts the instance

Graceful Degradation

The health check is designed with graceful degradation in mind:

Critical Failures: Return 503 and trigger restart (e.g., database connection lost)
Non-Critical Failures: Log warnings but return 200 (e.g., temporary Cloud Storage timeout)
Transient Errors: Retry internally before reporting failure

Observability

Metrics:

health_check_total{probe="liveness",status="ok"} - Successful liveness checks
health_check_total{probe="liveness",status="error"} - Failed liveness checks
health_check_duration_ms{probe="liveness"} - Liveness check duration

Structured Logs:

{
  "severity": "INFO",
  "timestamp": "2025-11-19T06:00:00Z",
  "message": "Health check completed",
  "probe": "liveness",
  "status": "ok",
  "duration_ms": 15
}

Alerts:

Liveness Failure: Alert if liveness check fails 3+ times consecutively
High Restart Rate: Alert if container restarts > 3 times in 5 minutes

Testing

Manual Testing

curl -v http://localhost:8080/health/liveness

Load Testing

Health check endpoints should handle high request rates without degrading application performance:

Target: 100 requests/second sustained
Timeout: < 10ms average response time
Resource Impact: < 1% CPU, < 10MB memory overhead

Troubleshooting

Liveness Check Intermittent Failures

Symptoms:

Occasional container restarts
Liveness probe returns 503 sporadically
High request latency

Debugging:

# Check error rate in last 5 minutes
gcloud monitoring time-series list \
  --filter='metric.type="custom.googleapis.com/health_check_total" AND metric.labels.status="error"' \
  --interval-start-time="5 minutes ago"

# Check for resource exhaustion (Cloud Run)
gcloud run services describe boot-server --region=<region> --format=json | jq '.status'

Common Causes:

Database connection pool exhausted
Memory pressure triggering GC pauses
High request volume overwhelming server
Dependency timeouts

Security Considerations

Unauthenticated Access

Health check endpoints are intentionally unauthenticated to allow Cloud Run infrastructure to probe without credentials. This is safe because:

Endpoints return only HTTP status codes (no response body)
No sensitive data is returned
Rate limiting prevents abuse
Endpoints are read-only

Information Disclosure

Health checks return only HTTP status codes with no response body, ensuring:

No internal IP addresses disclosed
No error messages or stack traces exposed
No database connection strings revealed
No API keys or secrets leaked

Detailed diagnostics are logged internally (not returned in response):

{
  "severity": "ERROR",
  "message": "Firestore connection failed",
  "error": "rpc error: code = PermissionDenied desc = Missing or insufficient permissions"
}

2 - Machine Service

Service for managing machine hardware profiles

The Machine Service is a REST API that manages machine hardware profiles for the network boot infrastructure. It stores machine specifications (CPUs, memory, NICs, drives, accelerators) in Firestore and is queried by the Boot Service during boot operations and by administrators for configuration management.

Architecture

The service is responsible for:

Machine Profile Management: Creating, listing, retrieving, updating, and deleting machine hardware profiles
Hardware Specification Storage: Storing detailed hardware specifications in Firestore
Machine Lookup: Providing machine profile queries by ID or NIC MAC address

Components

Firestore: Stores machine hardware profiles
REST API: HTTP endpoints for machine profile management

Clients

The service is consumed by:

Boot Service: Queries machine profiles by MAC address during boot operations
Admin Tools: CLI or web interfaces for managing machine inventory
Monitoring Systems: Hardware inventory and asset management tools

Deployment

Platform: GCP Cloud Run
Scaling: Automatic scaling based on request load
Availability: Min instances = 1 for low-latency responses
Region: Same region as Boot Service for minimal latency

API Endpoints

Machine Management

POST /api/v1/machines - Register a new machine with hardware specifications
GET /api/v1/machines - List all registered machines
GET /api/v1/machines/{id} - Retrieve a specific machine by ID
PUT /api/v1/machines/{id} - Update a machine’s hardware profile
DELETE /api/v1/machines/{id} - Delete a machine registration

Rate Limiting

Admin API endpoints are rate-limited to prevent abuse:

Per User/Service Account: 100 requests/minute
Per IP Address: 300 requests/minute
Global: 1000 requests/minute

Rate limit headers are included in responses:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1700000000

When rate limit is exceeded, API returns 429 Too Many Requests using RFC 7807 Problem Details format (see ADR-0007):

{
  "type": "https://api.example.com/errors/rate-limit-exceeded",
  "title": "Rate Limit Exceeded",
  "status": 429,
  "detail": "Rate limit exceeded. Try again in 30 seconds.",
  "instance": "/api/v1/machines",
  "retry_after": 30
}

All error responses use Content-Type: application/problem+json.

Versioning

The Admin API uses URL versioning (/api/v1/):

Current Version: v1
Deprecation Policy: Minimum 6 months notice before version deprecation
Version Header: X-API-Version: v1 included in all responses

2.1 - POST /api/v1/machines

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant API as Machine Service
    participant DB as Firestore

    Client->>API: POST /api/v1/machines
    API->>API: Generate machine id (UUIDv7)
    API->>API: Validate machine profile
    API->>DB: Insert machine profile
    DB-->>API: Machine created
    API-->>Client: 201 Created (machine id)

Request

Request Body:

{
  "cpus": [
    {
      "manufacturer": "Intel",
      "clock_frequency": 2400000000,
      "cores": 8
    }
  ],
  "memory_modules": [
    {
      "size": 17179869184
    },
    {
      "size": 17179869184
    }
  ],
  "accelerators": [],
  "nics": [
    {
      "mac": "52:54:00:12:34:56"
    }
  ],
  "drives": [
    {
      "capacity": 500107862016
    }
  ]
}

Response

Response (201 Created):

{
  "id": "018c7dbd-c000-7000-8000-fedcba987654"
}

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

400 Bad Request - Invalid request body or missing required fields:

{
  "type": "https://api.example.com/errors/validation-error",
  "title": "Validation Error",
  "status": 400,
  "detail": "The request body failed validation",
  "instance": "/api/v1/machines",
  "invalid_fields": [
    {
      "field": "nics",
      "reason": "at least one NIC is required"
    }
  ]
}

409 Conflict - Machine with the same NIC MAC address already exists:

{
  "type": "https://api.example.com/errors/duplicate-mac-address",
  "title": "Duplicate MAC Address",
  "status": 409,
  "detail": "A machine with MAC address 52:54:00:12:34:56 already exists",
  "instance": "/api/v1/machines",
  "mac_address": "52:54:00:12:34:56",
  "existing_machine_id": "018c7dbd-a000-7000-8000-fedcba987650"
}

Notes

The machine ID is generated server-side (UUIDv7)
MAC addresses must be unique across all machines
All size/capacity values are in bytes
Clock frequency is in hertz

Data Models

All data models are defined as Protocol Buffer (protobuf) messages and stored in Firestore.

Machine

syntax = "proto3";

message CPU {
  string manufacturer = 1;
  int64 clock_frequency = 2;  // measured in hertz
  int64 cores = 3;            // number of cores
}

message MemoryModule {
  int64 size = 1;             // measured in bytes
}

message Accelerator {
  string manufacturer = 1;
}

message NIC {
  string mac = 1;             // mac address
}

message Drive {
  int64 capacity = 1;         // capacity in bytes
}

message Machine {
  string id = 1;              // UUIDv7 machine identifier
  repeated CPU cpus = 2;
  repeated MemoryModule memory_modules = 3;
  repeated Accelerator accelerators = 4;
  repeated NIC nics = 5;
  repeated Drive drives = 6;
}

2.2 - GET /api/v1/machines

List all registered machines

List all registered machines with optional filtering by MAC address.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant API as Machine Service
    participant DB as Firestore

    Client->>API: GET /api/v1/machines?mac=...
    API->>DB: Query machines with filters
    DB-->>API: Machine list
    API-->>Client: 200 OK (machines list)

Request

Query Parameters:

Parameter	Type	Required	Description	Default
`page`	integer	No	Page number (1-indexed)	1
`per_page`	integer	No	Results per page (1-100)	20
`mac`	string	No	Filter by NIC MAC address	-

Example Request:

GET /api/v1/machines?page=1&per_page=20 HTTP/1.1
Host: machine.example.com

Example Request with MAC filter:

GET /api/v1/machines?mac=52:54:00:12:34:56 HTTP/1.1
Host: machine.example.com

Response

Response (200 OK):

{
  "machines": [
    {
      "id": "018c7dbd-c000-7000-8000-fedcba987654",
      "cpus": [
        {
          "manufacturer": "Intel",
          "clock_frequency": 2400000000,
          "cores": 8
        }
      ],
      "memory_modules": [
        {
          "size": 17179869184
        }
      ],
      "accelerators": [],
      "nics": [
        {
          "mac": "52:54:00:12:34:56"
        }
      ],
      "drives": [
        {
          "capacity": 500107862016
        }
      ]
    }
  ],
  "pagination": {
    "total": 1,
    "page": 1,
    "per_page": 20,
    "total_pages": 1
  }
}

2.3 - GET /api/v1/machines/{id}

Retrieve a specific machine by ID

Retrieve a specific machine by ID.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant API as Machine Service
    participant DB as Firestore

    Client->>API: GET /api/v1/machines/{id}
    API->>DB: Query machine by ID
    DB-->>API: Machine profile
    API-->>Client: 200 OK (machine profile)

Request

Path Parameters:

Parameter	Type	Required	Description
`id`	string	Yes	Machine identifier (UUIDv7 format)

Example Request:

GET /api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654 HTTP/1.1
Host: machine.example.com

Response

Response (200 OK):

{
  "id": "018c7dbd-c000-7000-8000-fedcba987654",
  "cpus": [
    {
      "manufacturer": "Intel",
      "clock_frequency": 2400000000,
      "cores": 8
    }
  ],
  "memory_modules": [
    {
      "size": 17179869184
    },
    {
      "size": 17179869184
    }
  ],
  "accelerators": [],
  "nics": [
    {
      "mac": "52:54:00:12:34:56"
    }
  ],
  "drives": [
    {
      "capacity": 500107862016
    }
  ]
}

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

404 Not Found - Machine with specified ID not found:

{
  "type": "https://api.example.com/errors/machine-not-found",
  "title": "Machine Not Found",
  "status": 404,
  "detail": "Machine with ID 018c7dbd-c000-7000-8000-fedcba987654 not found",
  "instance": "/api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654"
}

2.4 - PUT /api/v1/machines/{id}

Update a machine’s hardware profile

Update a machine’s hardware profile.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant API as Machine Service
    participant DB as Firestore

    Client->>API: PUT /api/v1/machines/{id}
    API->>DB: Update machine profile
    DB-->>API: Machine updated
    API-->>Client: 200 OK (updated profile)

Request

Path Parameters:

Parameter	Type	Required	Description
`id`	string	Yes	Machine identifier (UUIDv7 format)

Request Body:

Full machine profile (same structure as POST /api/v1/machines):

{
  "cpus": [
    {
      "manufacturer": "Intel",
      "clock_frequency": 2400000000,
      "cores": 8
    }
  ],
  "memory_modules": [
    {
      "size": 17179869184
    },
    {
      "size": 17179869184
    }
  ],
  "accelerators": [],
  "nics": [
    {
      "mac": "52:54:00:12:34:56"
    }
  ],
  "drives": [
    {
      "capacity": 500107862016
    }
  ]
}

Response

Response (200 OK):

Full machine profile with updated fields:

{
  "id": "018c7dbd-c000-7000-8000-fedcba987654",
  "cpus": [
    {
      "manufacturer": "Intel",
      "clock_frequency": 2400000000,
      "cores": 8
    }
  ],
  "memory_modules": [
    {
      "size": 17179869184
    },
    {
      "size": 17179869184
    }
  ],
  "accelerators": [],
  "nics": [
    {
      "mac": "52:54:00:12:34:56"
    }
  ],
  "drives": [
    {
      "capacity": 500107862016
    }
  ]
}

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

400 Bad Request - Invalid request body:

{
  "type": "https://api.example.com/errors/validation-error",
  "title": "Validation Error",
  "status": 400,
  "detail": "The request body failed validation",
  "instance": "/api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654",
  "invalid_fields": [
    {
      "field": "nics",
      "reason": "at least one NIC is required"
    }
  ]
}

404 Not Found - Machine with specified ID not found:

{
  "type": "https://api.example.com/errors/machine-not-found",
  "title": "Machine Not Found",
  "status": 404,
  "detail": "Machine with ID 018c7dbd-c000-7000-8000-fedcba987654 not found",
  "instance": "/api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654"
}

2.5 - DELETE /api/v1/machines/{id}

Delete a machine registration

Delete a machine registration.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant API as Machine Service
    participant DB as Firestore

    Client->>API: DELETE /api/v1/machines/{id}
    API->>DB: Delete machine by ID
    DB-->>API: Machine deleted
    API-->>Client: 204 No Content

Request

Path Parameters:

Parameter	Type	Required	Description
`id`	string	Yes	Machine identifier (UUIDv7 format)

Example Request:

DELETE /api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654 HTTP/1.1
Host: machine.example.com

Response

Response (204 No Content):

Empty response body.

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

404 Not Found - Machine with specified ID not found:

{
  "type": "https://api.example.com/errors/machine-not-found",
  "title": "Machine Not Found",
  "status": 404,
  "detail": "Machine with ID 018c7dbd-c000-7000-8000-fedcba987654 not found",
  "instance": "/api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654"
}

2.6 - GET /health/startup

Startup probe endpoint for Cloud Run

Indicates whether the application has completed initialization and is ready to receive traffic.

Request

Request Example:

GET /health/startup HTTP/1.1
Host: machine.example.com

Response

Response (200 OK):

Empty response body with HTTP 200 status code.

Response (503 Service Unavailable):

Empty response body with HTTP 503 status code.

Response Headers:

Cache-Control: no-cache, no-store, must-revalidate

Startup Check Components

Firestore Connection - Verifies database connectivity
Machine Management Service Readiness - Validates service initialization

Cloud Run Configuration

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  initialDelaySeconds: 0
  timeoutSeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Behavior

Success (200): Application is fully initialized and ready to serve requests
Failure (503): Application is still starting up or encountered initialization errors
Timeout: After 30 seconds of no response, Cloud Run considers startup failed

Observability

Metrics:

health_check_total{probe="startup",status="ok"} - Successful startup checks
health_check_total{probe="startup",status="error"} - Failed startup checks
health_check_duration_ms{probe="startup"} - Startup check duration

Structured Logs:

{
  "severity": "INFO",
  "timestamp": "2025-11-24T03:19:00Z",
  "message": "Health check completed",
  "probe": "startup",
  "status": "ok",
  "duration_ms": 15
}

Alerts:

Startup Failure: Alert if startup check fails for > 1 minute

Testing

Manual Testing

curl -v http://localhost:8080/health/startup

Automated Testing

func TestHealthStartup(t *testing.T) {
    resp, err := http.Get("http://localhost:8080/health/startup")
    require.NoError(t, err)
    defer resp.Body.Close()

    assert.Equal(t, http.StatusOK, resp.StatusCode)
}

Troubleshooting

Startup Check Never Succeeds

Symptoms:

Container restarts repeatedly
Cloud Run shows “unhealthy” status
Startup probe returns 503

Debugging:

# Check Cloud Run logs for startup errors
gcloud logging read "resource.type=cloud_run_revision AND labels.service_name=machine-service" \
  --limit 50 --format json | jq '.[] | select(.jsonPayload.probe=="startup")'

# Test locally with debug logging
DEBUG=true go run main.go

Common Causes:

Firestore credentials not configured
Network connectivity issues
Timeout too short for slow dependencies

2.7 - GET /health/liveness

Liveness probe endpoint for Cloud Run

Indicates whether the application is alive and healthy. Used by Cloud Run to detect and restart unhealthy instances.

Request

Request Example:

GET /health/liveness HTTP/1.1
Host: machine.example.com

Response

Response (200 OK):

Empty response body with HTTP 200 status code.

Response (503 Service Unavailable):

Empty response body with HTTP 503 status code.

Response Headers:

Cache-Control: no-cache, no-store, must-revalidate

Liveness Check Components

HTTP Server Health - Verifies the HTTP server is responsive
Basic health validation - Ensures the application can handle requests

Cloud Run Configuration

livenessProbe:
  httpGet:
    path: /health/liveness
    port: 8080
  initialDelaySeconds: 0
  timeoutSeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Behavior

Success (200): Application is healthy and functioning normally
Failure (503): Application is unhealthy and should be restarted
Consecutive Failures: After 3 consecutive failures (30 seconds), Cloud Run restarts the instance

Graceful Degradation

The health check is designed with graceful degradation in mind:

Critical Failures: Return 503 and trigger restart (e.g., database connection lost)
Non-Critical Failures: Log warnings but return 200 (e.g., temporary Firestore timeout)
Transient Errors: Retry internally before reporting failure

Observability

Metrics:

health_check_total{probe="liveness",status="ok"} - Successful liveness checks
health_check_total{probe="liveness",status="error"} - Failed liveness checks
health_check_duration_ms{probe="liveness"} - Liveness check duration

Structured Logs:

{
  "severity": "INFO",
  "timestamp": "2025-11-24T03:19:00Z",
  "message": "Health check completed",
  "probe": "liveness",
  "status": "ok",
  "duration_ms": 15
}

Alerts:

Liveness Failure: Alert if liveness check fails 3+ times consecutively
High Restart Rate: Alert if container restarts > 3 times in 5 minutes

Testing

Manual Testing

curl -v http://localhost:8080/health/liveness

Load Testing

Health check endpoints should handle high request rates without degrading application performance:

Target: 100 requests/second sustained
Timeout: < 10ms average response time
Resource Impact: < 1% CPU, < 10MB memory overhead

Troubleshooting

Liveness Check Intermittent Failures

Symptoms:

Occasional container restarts
Liveness probe returns 503 sporadically
High request latency

Debugging:

# Check error rate in last 5 minutes
gcloud monitoring time-series list \
  --filter='metric.type="custom.googleapis.com/health_check_total" AND metric.labels.status="error"' \
  --interval-start-time="5 minutes ago"

# Check for resource exhaustion (Cloud Run)
gcloud run services describe machine-service --region=<region> --format=json | jq '.status'

Common Causes:

Database connection pool exhausted
Memory pressure triggering GC pauses
High request volume overwhelming server
Dependency timeouts

Security Considerations

Unauthenticated Access

Health check endpoints are intentionally unauthenticated to allow Cloud Run infrastructure to probe without credentials. This is safe because:

Endpoints return only HTTP status codes (no response body)
No sensitive data is returned
Rate limiting prevents abuse
Endpoints are read-only

Information Disclosure

Health checks return only HTTP status codes with no response body, ensuring:

No internal IP addresses disclosed
No error messages or stack traces exposed
No database connection strings revealed
No API keys or secrets leaked

Detailed diagnostics are logged internally (not returned in response):

{
  "severity": "ERROR",
  "message": "Firestore connection failed",
  "error": "rpc error: code = PermissionDenied desc = Missing or insufficient permissions"
}

Services

Service Architecture

1 - Boot Service

Architecture Overview

Related Documentation

API Endpoints

UEFI HTTP Boot Endpoints

Admin API

Health Check Endpoints

Security Model

VPN-Based Access Control

Authentication Methods

Common Patterns

Error Responses

Standard HTTP Status Codes

Content Types

1.1 - GET /boot.ipxe

Sequence Diagram

Request

Response

Boot Script Variables

Security Considerations

VPN Source IP Validation

Rate Limiting

Observability

1.2 - GET /asset/{boot_profile_id}/kernel

Sequence Diagram

Request

Response

Performance Characteristics

Security Considerations

VPN Source IP Validation

Rate Limiting

Asset Integrity

Observability

1.3 - GET /asset/{boot_profile_id}/initrd

Sequence Diagram

Request

Response

Performance Characteristics

Security Considerations

VPN Source IP Validation

Rate Limiting

Asset Integrity

Observability

1.4 - POST /api/v1/profiles

Cloud Storage Structure

Sequence Diagram

Request

Response

Data Models

Boot Profile

1.5 - GET /api/v1/boot/{machine_id}/profile

Sequence Diagram

Request

Response

1.6 - PUT /api/v1/boot/{machine_id}/profile

Sequence Diagram

Request

Response

1.7 - DELETE /api/v1/boot/{machine_id}/profile

Sequence Diagram

Request

Response

1.8 - GET /health/startup

Request

Response

Startup Check Components

Cloud Run Configuration

Behavior

Observability

Testing

Manual Testing

Automated Testing

Troubleshooting

Startup Check Never Succeeds

1.9 - GET /health/liveness

Request

Response

Liveness Check Components