This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Machine Service

Service for managing machine hardware profiles

The Machine Service is a REST API that manages machine hardware profiles for the network boot infrastructure. It stores machine specifications (CPUs, memory, NICs, drives, accelerators) in Firestore and is queried by the Boot Service during boot operations and by administrators for configuration management.

Architecture

The service is responsible for:

  • Machine Profile Management: Creating, listing, retrieving, updating, and deleting machine hardware profiles
  • Hardware Specification Storage: Storing detailed hardware specifications in Firestore
  • Machine Lookup: Providing machine profile queries by ID or NIC MAC address

Components

  • Firestore: Stores machine hardware profiles
  • REST API: HTTP endpoints for machine profile management

Clients

The service is consumed by:

  1. Boot Service: Queries machine profiles by MAC address during boot operations
  2. Admin Tools: CLI or web interfaces for managing machine inventory
  3. Monitoring Systems: Hardware inventory and asset management tools

Deployment

  • Platform: GCP Cloud Run
  • Scaling: Automatic scaling based on request load
  • Availability: Min instances = 1 for low-latency responses
  • Region: Same region as Boot Service for minimal latency

API Endpoints

Machine Management

Rate Limiting

Admin API endpoints are rate-limited to prevent abuse:

  • Per User/Service Account: 100 requests/minute
  • Per IP Address: 300 requests/minute
  • Global: 1000 requests/minute

Rate limit headers are included in responses:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1700000000

When rate limit is exceeded, API returns 429 Too Many Requests using RFC 7807 Problem Details format (see ADR-0007):

{
  "type": "https://api.example.com/errors/rate-limit-exceeded",
  "title": "Rate Limit Exceeded",
  "status": 429,
  "detail": "Rate limit exceeded. Try again in 30 seconds.",
  "instance": "/api/v1/machines",
  "retry_after": 30
}

All error responses use Content-Type: application/problem+json.

Versioning

The Admin API uses URL versioning (/api/v1/):

  • Current Version: v1
  • Deprecation Policy: Minimum 6 months notice before version deprecation
  • Version Header: X-API-Version: v1 included in all responses

1 - POST /api/v1/machines

Register a new machine with hardware specifications

Register a new machine with hardware specifications.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant API as Machine Service
    participant DB as Firestore

    Client->>API: POST /api/v1/machines
    API->>API: Generate machine id (UUIDv7)
    API->>API: Validate machine profile
    API->>DB: Insert machine profile
    DB-->>API: Machine created
    API-->>Client: 201 Created (machine id)

Request

Request Body:

{
  "cpus": [
    {
      "manufacturer": "Intel",
      "clock_frequency": 2400000000,
      "cores": 8
    }
  ],
  "memory_modules": [
    {
      "size": 17179869184
    },
    {
      "size": 17179869184
    }
  ],
  "accelerators": [],
  "nics": [
    {
      "mac": "52:54:00:12:34:56"
    }
  ],
  "drives": [
    {
      "capacity": 500107862016
    }
  ]
}

Response

Response (201 Created):

{
  "id": "018c7dbd-c000-7000-8000-fedcba987654"
}

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

400 Bad Request - Invalid request body or missing required fields:

{
  "type": "https://api.example.com/errors/validation-error",
  "title": "Validation Error",
  "status": 400,
  "detail": "The request body failed validation",
  "instance": "/api/v1/machines",
  "invalid_fields": [
    {
      "field": "nics",
      "reason": "at least one NIC is required"
    }
  ]
}

409 Conflict - Machine with the same NIC MAC address already exists:

{
  "type": "https://api.example.com/errors/duplicate-mac-address",
  "title": "Duplicate MAC Address",
  "status": 409,
  "detail": "A machine with MAC address 52:54:00:12:34:56 already exists",
  "instance": "/api/v1/machines",
  "mac_address": "52:54:00:12:34:56",
  "existing_machine_id": "018c7dbd-a000-7000-8000-fedcba987650"
}

Notes

  • The machine ID is generated server-side (UUIDv7)
  • MAC addresses must be unique across all machines
  • All size/capacity values are in bytes
  • Clock frequency is in hertz

Data Models

All data models are defined as Protocol Buffer (protobuf) messages and stored in Firestore.

Machine

syntax = "proto3";

message CPU {
  string manufacturer = 1;
  int64 clock_frequency = 2;  // measured in hertz
  int64 cores = 3;            // number of cores
}

message MemoryModule {
  int64 size = 1;             // measured in bytes
}

message Accelerator {
  string manufacturer = 1;
}

message NIC {
  string mac = 1;             // mac address
}

message Drive {
  int64 capacity = 1;         // capacity in bytes
}

message Machine {
  string id = 1;              // UUIDv7 machine identifier
  repeated CPU cpus = 2;
  repeated MemoryModule memory_modules = 3;
  repeated Accelerator accelerators = 4;
  repeated NIC nics = 5;
  repeated Drive drives = 6;
}

2 - GET /api/v1/machines

List all registered machines

List all registered machines with optional filtering by MAC address.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant API as Machine Service
    participant DB as Firestore

    Client->>API: GET /api/v1/machines?mac=...
    API->>DB: Query machines with filters
    DB-->>API: Machine list
    API-->>Client: 200 OK (machines list)

Request

Query Parameters:

ParameterTypeRequiredDescriptionDefault
pageintegerNoPage number (1-indexed)1
per_pageintegerNoResults per page (1-100)20
macstringNoFilter by NIC MAC address-

Example Request:

GET /api/v1/machines?page=1&per_page=20 HTTP/1.1
Host: machine.example.com

Example Request with MAC filter:

GET /api/v1/machines?mac=52:54:00:12:34:56 HTTP/1.1
Host: machine.example.com

Response

Response (200 OK):

{
  "machines": [
    {
      "id": "018c7dbd-c000-7000-8000-fedcba987654",
      "cpus": [
        {
          "manufacturer": "Intel",
          "clock_frequency": 2400000000,
          "cores": 8
        }
      ],
      "memory_modules": [
        {
          "size": 17179869184
        }
      ],
      "accelerators": [],
      "nics": [
        {
          "mac": "52:54:00:12:34:56"
        }
      ],
      "drives": [
        {
          "capacity": 500107862016
        }
      ]
    }
  ],
  "pagination": {
    "total": 1,
    "page": 1,
    "per_page": 20,
    "total_pages": 1
  }
}

3 - GET /api/v1/machines/{id}

Retrieve a specific machine by ID

Retrieve a specific machine by ID.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant API as Machine Service
    participant DB as Firestore

    Client->>API: GET /api/v1/machines/{id}
    API->>DB: Query machine by ID
    DB-->>API: Machine profile
    API-->>Client: 200 OK (machine profile)

Request

Path Parameters:

ParameterTypeRequiredDescription
idstringYesMachine identifier (UUIDv7 format)

Example Request:

GET /api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654 HTTP/1.1
Host: machine.example.com

Response

Response (200 OK):

{
  "id": "018c7dbd-c000-7000-8000-fedcba987654",
  "cpus": [
    {
      "manufacturer": "Intel",
      "clock_frequency": 2400000000,
      "cores": 8
    }
  ],
  "memory_modules": [
    {
      "size": 17179869184
    },
    {
      "size": 17179869184
    }
  ],
  "accelerators": [],
  "nics": [
    {
      "mac": "52:54:00:12:34:56"
    }
  ],
  "drives": [
    {
      "capacity": 500107862016
    }
  ]
}

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

404 Not Found - Machine with specified ID not found:

{
  "type": "https://api.example.com/errors/machine-not-found",
  "title": "Machine Not Found",
  "status": 404,
  "detail": "Machine with ID 018c7dbd-c000-7000-8000-fedcba987654 not found",
  "instance": "/api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654"
}

4 - PUT /api/v1/machines/{id}

Update a machine’s hardware profile

Update a machine’s hardware profile.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant API as Machine Service
    participant DB as Firestore

    Client->>API: PUT /api/v1/machines/{id}
    API->>DB: Update machine profile
    DB-->>API: Machine updated
    API-->>Client: 200 OK (updated profile)

Request

Path Parameters:

ParameterTypeRequiredDescription
idstringYesMachine identifier (UUIDv7 format)

Request Body:

Full machine profile (same structure as POST /api/v1/machines):

{
  "cpus": [
    {
      "manufacturer": "Intel",
      "clock_frequency": 2400000000,
      "cores": 8
    }
  ],
  "memory_modules": [
    {
      "size": 17179869184
    },
    {
      "size": 17179869184
    }
  ],
  "accelerators": [],
  "nics": [
    {
      "mac": "52:54:00:12:34:56"
    }
  ],
  "drives": [
    {
      "capacity": 500107862016
    }
  ]
}

Response

Response (200 OK):

Full machine profile with updated fields:

{
  "id": "018c7dbd-c000-7000-8000-fedcba987654",
  "cpus": [
    {
      "manufacturer": "Intel",
      "clock_frequency": 2400000000,
      "cores": 8
    }
  ],
  "memory_modules": [
    {
      "size": 17179869184
    },
    {
      "size": 17179869184
    }
  ],
  "accelerators": [],
  "nics": [
    {
      "mac": "52:54:00:12:34:56"
    }
  ],
  "drives": [
    {
      "capacity": 500107862016
    }
  ]
}

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

400 Bad Request - Invalid request body:

{
  "type": "https://api.example.com/errors/validation-error",
  "title": "Validation Error",
  "status": 400,
  "detail": "The request body failed validation",
  "instance": "/api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654",
  "invalid_fields": [
    {
      "field": "nics",
      "reason": "at least one NIC is required"
    }
  ]
}

404 Not Found - Machine with specified ID not found:

{
  "type": "https://api.example.com/errors/machine-not-found",
  "title": "Machine Not Found",
  "status": 404,
  "detail": "Machine with ID 018c7dbd-c000-7000-8000-fedcba987654 not found",
  "instance": "/api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654"
}

5 - DELETE /api/v1/machines/{id}

Delete a machine registration

Delete a machine registration.

Sequence Diagram

sequenceDiagram
    participant Client as Admin Client
    participant API as Machine Service
    participant DB as Firestore

    Client->>API: DELETE /api/v1/machines/{id}
    API->>DB: Delete machine by ID
    DB-->>API: Machine deleted
    API-->>Client: 204 No Content

Request

Path Parameters:

ParameterTypeRequiredDescription
idstringYesMachine identifier (UUIDv7 format)

Example Request:

DELETE /api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654 HTTP/1.1
Host: machine.example.com

Response

Response (204 No Content):

Empty response body.

Error Responses:

All error responses follow RFC 7807 Problem Details format (see ADR-0007) with Content-Type: application/problem+json.

404 Not Found - Machine with specified ID not found:

{
  "type": "https://api.example.com/errors/machine-not-found",
  "title": "Machine Not Found",
  "status": 404,
  "detail": "Machine with ID 018c7dbd-c000-7000-8000-fedcba987654 not found",
  "instance": "/api/v1/machines/018c7dbd-c000-7000-8000-fedcba987654",
  "machine_id": "018c7dbd-c000-7000-8000-fedcba987654"
}

6 - GET /health/startup

Startup probe endpoint for Cloud Run

Indicates whether the application has completed initialization and is ready to receive traffic.

Request

Request Example:

GET /health/startup HTTP/1.1
Host: machine.example.com

Response

Response (200 OK):

Empty response body with HTTP 200 status code.

Response (503 Service Unavailable):

Empty response body with HTTP 503 status code.

Response Headers:

  • Cache-Control: no-cache, no-store, must-revalidate

Startup Check Components

  1. Firestore Connection - Verifies database connectivity
  2. Machine Management Service Readiness - Validates service initialization

Cloud Run Configuration

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  initialDelaySeconds: 0
  timeoutSeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Behavior

  • Success (200): Application is fully initialized and ready to serve requests
  • Failure (503): Application is still starting up or encountered initialization errors
  • Timeout: After 30 seconds of no response, Cloud Run considers startup failed

Observability

Metrics:

  • health_check_total{probe="startup",status="ok"} - Successful startup checks
  • health_check_total{probe="startup",status="error"} - Failed startup checks
  • health_check_duration_ms{probe="startup"} - Startup check duration

Structured Logs:

{
  "severity": "INFO",
  "timestamp": "2025-11-24T03:19:00Z",
  "message": "Health check completed",
  "probe": "startup",
  "status": "ok",
  "duration_ms": 15
}

Alerts:

  • Startup Failure: Alert if startup check fails for > 1 minute

Testing

Manual Testing

curl -v http://localhost:8080/health/startup

Automated Testing

func TestHealthStartup(t *testing.T) {
    resp, err := http.Get("http://localhost:8080/health/startup")
    require.NoError(t, err)
    defer resp.Body.Close()

    assert.Equal(t, http.StatusOK, resp.StatusCode)
}

Troubleshooting

Startup Check Never Succeeds

Symptoms:

  • Container restarts repeatedly
  • Cloud Run shows “unhealthy” status
  • Startup probe returns 503

Debugging:

# Check Cloud Run logs for startup errors
gcloud logging read "resource.type=cloud_run_revision AND labels.service_name=machine-service" \
  --limit 50 --format json | jq '.[] | select(.jsonPayload.probe=="startup")'

# Test locally with debug logging
DEBUG=true go run main.go

Common Causes:

  • Firestore credentials not configured
  • Network connectivity issues
  • Timeout too short for slow dependencies

7 - GET /health/liveness

Liveness probe endpoint for Cloud Run

Indicates whether the application is alive and healthy. Used by Cloud Run to detect and restart unhealthy instances.

Request

Request Example:

GET /health/liveness HTTP/1.1
Host: machine.example.com

Response

Response (200 OK):

Empty response body with HTTP 200 status code.

Response (503 Service Unavailable):

Empty response body with HTTP 503 status code.

Response Headers:

  • Cache-Control: no-cache, no-store, must-revalidate

Liveness Check Components

  1. HTTP Server Health - Verifies the HTTP server is responsive
  2. Basic health validation - Ensures the application can handle requests

Cloud Run Configuration

livenessProbe:
  httpGet:
    path: /health/liveness
    port: 8080
  initialDelaySeconds: 0
  timeoutSeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Behavior

  • Success (200): Application is healthy and functioning normally
  • Failure (503): Application is unhealthy and should be restarted
  • Consecutive Failures: After 3 consecutive failures (30 seconds), Cloud Run restarts the instance

Graceful Degradation

The health check is designed with graceful degradation in mind:

  • Critical Failures: Return 503 and trigger restart (e.g., database connection lost)
  • Non-Critical Failures: Log warnings but return 200 (e.g., temporary Firestore timeout)
  • Transient Errors: Retry internally before reporting failure

Observability

Metrics:

  • health_check_total{probe="liveness",status="ok"} - Successful liveness checks
  • health_check_total{probe="liveness",status="error"} - Failed liveness checks
  • health_check_duration_ms{probe="liveness"} - Liveness check duration

Structured Logs:

{
  "severity": "INFO",
  "timestamp": "2025-11-24T03:19:00Z",
  "message": "Health check completed",
  "probe": "liveness",
  "status": "ok",
  "duration_ms": 15
}

Alerts:

  • Liveness Failure: Alert if liveness check fails 3+ times consecutively
  • High Restart Rate: Alert if container restarts > 3 times in 5 minutes

Testing

Manual Testing

curl -v http://localhost:8080/health/liveness

Load Testing

Health check endpoints should handle high request rates without degrading application performance:

  • Target: 100 requests/second sustained
  • Timeout: < 10ms average response time
  • Resource Impact: < 1% CPU, < 10MB memory overhead

Troubleshooting

Liveness Check Intermittent Failures

Symptoms:

  • Occasional container restarts
  • Liveness probe returns 503 sporadically
  • High request latency

Debugging:

# Check error rate in last 5 minutes
gcloud monitoring time-series list \
  --filter='metric.type="custom.googleapis.com/health_check_total" AND metric.labels.status="error"' \
  --interval-start-time="5 minutes ago"

# Check for resource exhaustion (Cloud Run)
gcloud run services describe machine-service --region=<region> --format=json | jq '.status'

Common Causes:

  • Database connection pool exhausted
  • Memory pressure triggering GC pauses
  • High request volume overwhelming server
  • Dependency timeouts

Security Considerations

Unauthenticated Access

Health check endpoints are intentionally unauthenticated to allow Cloud Run infrastructure to probe without credentials. This is safe because:

  1. Endpoints return only HTTP status codes (no response body)
  2. No sensitive data is returned
  3. Rate limiting prevents abuse
  4. Endpoints are read-only

Information Disclosure

Health checks return only HTTP status codes with no response body, ensuring:

  • No internal IP addresses disclosed
  • No error messages or stack traces exposed
  • No database connection strings revealed
  • No API keys or secrets leaked

Detailed diagnostics are logged internally (not returned in response):

{
  "severity": "ERROR",
  "message": "Firestore connection failed",
  "error": "rpc error: code = PermissionDenied desc = Missing or insufficient permissions"
}