This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Self-Hosted Application Platform

A reproducible, operator-controlled platform that hosts the operator’s other capabilities, so each one does not have to solve hosting on its own or fall back to a vendor.

One-line definition: Provide a reproducible, operator-controlled platform on which the operator’s other capabilities run by default, so that no capability has to depend on a vendor-specific hosting solution to be delivered.

Purpose & Business Outcome

What business outcome does this capability deliver? Why does it exist?

This capability exists so that the operator’s other capabilities (e.g. self-hosted personal media storage) have a well-defined, reproducible place to run that the operator controls end-to-end, instead of each capability independently choosing a vendor (e.g. a hosted Plex provider, a hosted Minecraft provider, a hosted Nextcloud provider). The outcomes it delivers, in order of importance:

  1. Default hosting target for the operator’s capabilities. Any capability the operator defines should be able to run here, so that “where does this run?” is a solved question rather than re-litigated per capability.
  2. Reproducibility. The platform itself can be rebuilt from its definitions; it is not a snowflake. A total loss does not mean a permanent loss of the platform.
  3. Independence from hosting vendors. The operator is not locked into any single provider’s product roadmap, pricing, or terms for the things their capabilities depend on.
  4. A coherent place to invest infrastructure effort. Improvements (resiliency, observability, backup) made once at the platform level benefit every tenant capability, instead of each capability re-solving them.

When these outcomes conflict: tenant adoption beats reproducibility (a perfect platform with no tenants is a failure); reproducibility beats vendor independence (a platform that can’t be rebuilt is worse than one that uses some vendor components); vendor independence beats minimizing operator effort.

Stakeholders

  • Owner / Accountable party: The operator. Sole accountable party for the platform existing, running, and continuing to run.
  • Primary actors (initiators): Capability owners — currently the operator wearing a different hat — who bring a capability to the platform to be hosted, or change what an already-hosted capability needs.
  • Secondary actors / consumers: The tenant capabilities themselves, while running, consume platform services (compute, storage, network, identity, backup, observability).
  • Affected parties (impacted but not directly involved): End users of the tenant capabilities (e.g. family and friends using self-hosted personal media storage). They never interact with the platform directly, but a platform outage or data loss directly affects them.

Triggers & Inputs

What initiates the capability, and what information must be available?

  • Triggers:
    • A capability owner brings a new capability to be hosted.
    • A capability owner changes the requirements of an already-hosted capability (more storage, new external endpoint, etc.).
    • The operator stands up the platform from scratch (initial build or full rebuild after loss).
    • The operator performs routine maintenance on the platform.
    • A tenant capability’s components fall behind what the platform supports and need to be updated.
  • Required inputs:
    • From the capability owner: the capability packaged in the form the platform accepts, a declaration of its resource needs (compute, storage, network reachability), and its availability expectations.
    • For tenants whose end users need to authenticate: either use of the platform-provided identity service, or a declared decision to bring their own.
  • Preconditions:
    • The operator has authorized the capability to run on the platform (no self-onboarding by tenants — the operator is the only person making this decision).
    • The capability accepts the platform’s contract (see Business Rules).

Outputs & Deliverables

What does the capability produce? What changes in the world after it runs?

  • Direct outputs: For each tenant capability, the platform provides:
    • Compute — a place for the application to run.
    • Persistent storage — durable storage for the application’s data.
    • Network reachability — both internal (between tenants) and external (reachable by the tenant’s end users).
    • Identity & authentication for end users — available to any tenant that wants it; tenants may opt to bring their own.
    • Backup and disaster recovery — of tenant data, to a standard the platform defines.
    • Observability — the operator can tell whether each tenant is up and healthy without the tenant having to instrument that itself.
  • Downstream effects / state changes:
    • The operator’s capabilities have a default answer to “where does this run?” and stop being individually coupled to vendor choices.
    • Investments in resiliency, backup, and observability accrue across all tenants instead of being repeated per capability.
    • The operator accumulates operational knowledge of one platform rather than fragmented knowledge of many vendor products.

Business Rules & Constraints

  • Default hosting target. All capabilities defined in this repo are expected to run on the platform unless explicitly exempted. A capability owner may choose to host elsewhere, but the platform is the default and the burden of justification is on opting out.
  • Operator-only operation. Only the operator operates the platform and has administrative access to it. There are no co-operators and no delegated administration. A designated successor (see Operator succession) holds sealed/escrowed emergency credentials but does not exercise them while the operator is active — there is no shared day-to-day administration and no routine successor access.
  • Operator skill development is incidental, not an outcome. The operator may personally learn from building and running the platform, but skill development is not a stated outcome of this capability and must not influence buy-vs-build trade-offs. Those trade-offs are judged on convenience, resiliency, and cost only — “I want to learn this” is not, on its own, a valid reason to choose build over buy at the capability level.
  • Tenants must accept the platform’s contract. To be hosted, a tenant must be packaged in the form the platform accepts, declare its resource needs up front, and accept the platform’s availability characteristics. A tenant that needs guarantees stronger than the platform offers must host elsewhere.
  • Eviction is allowed when needs and capabilities diverge. The platform may decline to continue hosting a tenant whose requirements it cannot meet (e.g. specialized hardware, regulatory constraints, an availability target the platform does not offer). However, where the divergence is merely that the tenant’s components have fallen behind what the platform supports, the platform works with the tenant to bring them current rather than evicting.
  • Eviction threshold. A tenant is evicted when accommodating it would either push routine operation sustainably above twice the Operator maintenance budget KPI, or break the Reproducibility KPI (e.g. requires manual snowflake configuration that cannot be captured as definitions). Either condition alone is sufficient grounds for eviction. The numeric thresholds are whatever those KPIs currently say; this rule is not restated in absolute hours so it cannot drift from them.
  • Identity service honors tenant credential-recovery rules. Any identity implementation the platform offers to tenants must be capable of honoring a “lost credentials cannot be recovered” property (Signal-style), because at least one tenant (self-hosted personal media storage) requires it. An identity option that cannot honor this property is not eligible to be the platform-provided identity service.
  • Operator succession. The platform must support both (a) on-demand exportable archives so each tenant’s users can retrieve their own content without operator involvement while the platform is healthy — users are expected to pull these proactively (and may schedule periodic pulls), since on-demand export is only available when the platform is up — and (b) a designated successor operator who holds the credentials and runbook needed to keep the platform running if the primary operator becomes unavailable. Successor credentials are sealed/escrowed (e.g. via a password-manager handoff or physical envelope) and not used for routine operation; takeover is a discrete event triggered by operator unavailability, not ongoing shared administration. The two mechanisms are complementary: exports preserve user data even if no successor takes over; a successor preserves continuity of the platform itself. If the platform is down and no successor takes over, only previously-pulled exports survive — this is the accepted trade-off.
  • The platform may span public and private infrastructure. “Self-hosted” means the operator controls the platform end-to-end, not that every component runs on hardware the operator owns. Public-cloud components are allowed where the operator retains control of configuration, data, and the ability to leave.
  • No direct end-user access to the platform. End users of tenant capabilities reach the tenant, not the platform. The platform has no notion of “end users” of itself; its consumers are tenant capabilities (and behind them, the operator).
  • Cost is secondary to convenience and resiliency. Because there is one operator, added cost is acceptable when it buys meaningful convenience or resiliency. Cost should still be minimized where it does not cost convenience or resiliency.
  • The capability evolves with its tenants. When a tenant capability needs something the platform does not yet provide, the default response is to update this capability’s definition (and the platform) rather than push the requirement back onto the tenant.

Success Criteria & KPIs

  • Tenant adoption. Every implemented capability defined in this repo runs on this platform. A capability is “implemented” when it is deployed and serving its intended users in production — distinct from “defined” (a capability doc exists) and “designed” (a technical design exists but nothing is running). Only implemented capabilities count toward this KPI; defined-or-designed-only capabilities are neutral, neither success nor failure. An implemented capability that runs elsewhere counts negatively against this KPI: either the platform did not meet the tenant’s needs, or the tenant was never asked to use it.
  • Reproducibility. The platform can be stood up from its definitions in at most 1 hour, starting from no platform at all. This is the operational form of “reproducible” — if it takes longer than that, the platform is a snowflake regardless of how much of its config is in version control.
  • Operator maintenance budget. Routine operation of the platform takes no more than 2 hours per week of the operator’s time. If maintenance regularly exceeds this, the platform is consuming more attention than it earns and must be simplified, not grown.
  • Cost stays proportional to value. Total operating cost remains within what the operator considers acceptable given the convenience and resiliency it delivers. There is no fixed dollar target; the test is whether the operator would still choose to run it knowing the bill.

Out of Scope

  • Hosting for anyone other than the operator’s own capabilities. The platform does not offer hosting to third parties, the public, or family/friends directly. Family and friends reach the platform only as end users of a tenant capability (e.g. via self-hosted personal media storage), never as platform users.
  • Dictating the implementation. “Homelab,” “Kubernetes,” and any specific stack are possible implementations of this capability, not part of its definition. The capability is satisfied by anything that meets its rules and KPIs.
  • A specific availability or performance SLA. The platform offers whatever availability its current implementation can deliver within the operator’s maintenance budget. Tenants needing stronger guarantees host elsewhere (per Business Rules).
  • End-user-facing features of tenant capabilities. Photo viewing, game server gameplay, document editing, etc. are tenant concerns, not platform concerns.
  • Multi-operator administration, role delegation, or self-service onboarding. Explicitly excluded by the operator-only rule.

Open Questions

None at this time.

1 - User Experiences

End-to-end user journeys for the Self-Hosted Application Platform capability.

This section documents the user experiences for the Self-Hosted Application Platform capability — the end-to-end journeys taken by the actors named in the parent capability’s Stakeholders, in pursuit of the outcomes the capability promises.

1.1 - Host a Capability

A capability owner brings a fully-designed capability onto the platform, gets it provisioned and live, and continues to evolve its needs over time.

One-line definition: A capability owner brings a fully-designed capability onto the platform, gets it provisioned and live, and continues to evolve its needs over time.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is a capability owner — one of the people named in the parent capability’s Primary actors. Although the capability doc notes this is currently the operator wearing a different hat, this UX is written as if the capability owner were a separate person from the platform operator. The role boundary is treated as real: there is an interface, a handoff, and a contract between them.

  • Role: Capability owner. They have just finished defining one of the operator’s capabilities — its UX docs and its tech design are both complete. The tech design picked this platform as the host.
  • Context they come from: They are not building the platform; they are a customer of it. They arrive with a capability doc, a tech design that calls out which components must run on the platform, and (ideally, but not strictly required) a mapping from those components to specific platform offerings.
  • What they care about here: Getting their capability running on a controlled, reproducible substrate, declaring its needs once, understanding what they are signing up for, and having a clear path to change those needs later — without onboarding becoming a multi-day project for either side of the handoff.

Goal

“I want my capability running on the platform — with its compute, storage, network, and identity needs declared once, the platform’s contract understood, and a clear path to update those needs later — and I want it to stay running healthily as my capability evolves.”

This is a lifecycle goal, not just an onboarding one: the change-later branch lives in the same journey as the initial onboarding because it shares the same persona, surface, and contract.

Entry Point

The capability owner arrives at this experience having just finished the tech-design phase of their capability. Specifically:

  • Their capability’s UX docs are complete.
  • Their tech design is complete and explicitly designates this platform as the host for the components that need to run somewhere.
  • Their decision about whether to use platform-provided identity or bring their own has already been made and recorded in the tech design itself — it is not a fresh question at onboarding.

What they have in hand: the capability doc and the tech design. Nothing else is required. A tech design that already names specific platform offerings per component is nice; one that doesn’t can have those gaps filled during onboarding.

Their state of mind depends on what they’re asking for:

  • Fully confident if every component in their tech design maps to an offering the platform already provides.
  • Semi-confident if some component requires something the platform may or may not be able to support (e.g. their capability needs GPU compute, which the platform may never provide because no GPUs are installed and buying them is out of scope).

Journey

The capability owner’s journey is a single end-to-end flow with three branches that can occur during operator review (approved as-is, new-offering needed, declined) and one re-entry loop for changing requirements after going live.

1. File an “onboard my capability” issue on GitHub

The capability owner opens an issue against the infra repo using the onboard my capability issue type. GitHub issues are the only channel for engaging the platform — there is no self-service portal and no other front door — and this is the issue type for onboarding. They link or attach the capability doc and the tech design.

What they perceive: the issue is filed, and now they wait. There is no response-time guarantee — this is personal-scale, async by default.

2. Operator review on the issue

The operator reviews the tech design with a deliberately narrow scope:

  • Does each platform-hosted component align with an existing platform offering?
  • Are there any components that would require a new platform offering to be added?

What the capability owner perceives: clarifying questions appear as comments on the issue, and possibly a meeting if the operator deems it necessary. They answer the questions in-thread.

3. Resolution — one of three branches

3a. Approved as-is. The operator comments “approved” on the issue. That comment is the moment the capability owner knows hosting is real. There is no separate contract-acceptance step at this point: the contract was accepted by virtue of the tech design already conforming to it (declared resource needs, identity choice, packaging, availability expectations).

3b. New offering needed. The operator agrees the right answer is to add a new platform offering to support the capability, and that the offering is still within the platform’s intended scope — meaning the platform can add it while keeping the offering reproducible within the parent capability’s Reproducibility KPI and routine operation within the Operator maintenance budget KPI. The operator does not commit to a timeline. The capability owner waits. While they wait, there is nothing for them to do on their side. Eventually the operator returns and the journey resumes at step 3a.

3c. Declined — host elsewhere. The operator closes the issue with a comment explaining why the request cannot be supported. That can be because it is simply impossible (e.g. the platform will never have GPUs because the hardware cannot be added), or because it is only technically possible and would require the platform to grow into an offering the operator does not want to carry as routine scope — specifically, one the platform could not keep reproducible within the parent capability’s Reproducibility KPI or operate within the Operator maintenance budget KPI. The capability owner now knows this capability has to be hosted somewhere else; the journey ends here.

4. Hand off packaged artifacts

For each component in the tech design that needs to be deployed, the capability owner provides a packaged artifact in the form the platform accepts. The capability owner does the packaging themselves; they do not hand over raw source for the operator to package.

What they perceive: they post or link the artifacts on the issue and wait.

5. Wait while the operator provisions

While the operator is actually wiring up compute, storage, networking, identity, backup, and observability for the new tenant, the capability owner does nothing. They are not pinged for DNS choices or secrets. They simply wait until the operator asks them to test.

6. Test on request

The operator comments asking the capability owner to test the deployed capability. The capability owner exercises it however they would normally validate that their capability works (this is their judgment — the platform doesn’t prescribe a test plan).

  • If something is wrong (the deployment doesn’t work right, networking can’t reach it, an artifact failed to deploy as-given), the capability owner comments on the issue and the two iterate back-and-forth in comments until it works.
  • If everything works, the capability owner says so on the issue.

7. Operator closes the issue

The operator closes the onboarding issue. The capability is now live on the platform.

8. Change-later loop (re-entry)

When the capability owner needs something different — more storage, a new external endpoint, a new component, a routine version bump, retirement of a component — they file a different issue type: modify my capability (distinct from the onboarding type, and the distinction is meaningful to the capability owner because the operator’s review scope differs).

Operator review on a modify issue covers only the delta, not a full re-evaluation. The platform contract is evergreen — the capability owner does not re-accept it on each modification. If the platform’s own contract changes, the operator is responsible for communicating the change ahead of time and migrating existing tenants; it is never sprung on the capability owner during a modify request.

The flow from issue → review → branches → artifact handoff → test → close repeats.

Flow Diagram

flowchart TD
    Start([Tech design complete & names this platform]) --> File[File 'onboard my capability' issue on GitHub]
    File --> Review[Operator reviews tech design:<br/>alignment to offerings + new-offering needs]
    Review --> Decision{Outcome}
    Decision -->|Approved as-is| Approved[Operator comments 'approved']
    Decision -->|New offering needed| Wait[Wait — no timeline guarantee]
    Decision -->|Declined| Decline[Issue closed with explanation —<br/>host elsewhere. Journey ends.]
    Wait --> Approved
    Approved --> Handoff[Capability owner hands off<br/>packaged artifacts on the issue]
    Handoff --> Provision[Wait while operator provisions]
    Provision --> Test[Operator asks capability owner to test]
    Test --> Works{Works?}
    Works -->|No| Iterate[Comment back-and-forth on the issue]
    Iterate --> Test
    Works -->|Yes| Close[Operator closes the issue —<br/>capability is live]
    Close --> Live((Hosted))
    Live -->|Needs change later| Modify[File 'modify my capability' issue]
    Modify --> ReviewDelta[Operator reviews delta only;<br/>contract is evergreen]
    ReviewDelta --> Decision
    Live -->|Operator initiates eviction| Eviction[Operator raises eviction issue<br/>with eviction date — see Edge Cases]

Success

When the onboarding issue closes, the capability owner walks away with:

  • Their capability is running on infrastructure they trust to be reproducible and operator-controlled.
  • The operator knows exactly what they signed up to host — needs were declared in the tech design and reviewed before approval.
  • A known, low-friction path back when needs change: file a modify my capability issue and run the same loop.
  • No surprises: there is no hidden ongoing obligation on their side beyond filing issues for changes.

For change-later iterations, success looks the same in miniature: the delta is reviewed, deployed, tested, and closed without re-litigating the entire capability.

Edge Cases & Failure Modes

  • Test step fails after provisioning. Capability owner sees their capability isn’t working post-deploy. Experience-level handling: the issue stays open and the two iterate via comments until the deployment works. The journey doesn’t reset to the start; it loops between test and operator action.
  • Operator goes silent / issue stalls. There is no response-time guarantee, so some waiting is normal. The signal that the silence has gone on too long is not a timer; it is the capability owner explicitly commenting that they are withdrawing the request and hosting elsewhere because they can no longer wait (or closing the issue saying so). Experience-level handling: that outcome is recorded on the issue itself and counts as a lost tenant against the parent capability’s Tenant adoption KPI. When the operator returns, the response is to acknowledge the loss in-thread and close the issue if it is still open — not to let the thread silently rot.
  • Handed-off artifact is broken or undeployable. Symmetric with the test-fails case: comment back-and-forth on the issue until a working artifact is in place.
  • New offering requested but no commitment. The capability owner’s request to add a new offering is accepted in principle but with no timeline. They wait indefinitely. If they cannot wait, they say so on the issue and host elsewhere; that is a tracked Tenant adoption KPI loss, not invisible churn.
  • Capability is evicted later. This is operator-initiated, not capability-owner-initiated, so it is not a step inside this journey. From the capability owner’s perspective: at some point the operator opens an eviction issue tagging them and naming the eviction date. The capability owner now knows they must move off the platform by that date. Eviction is governed by the parent capability’s Eviction threshold rule (the request would push routine maintenance sustainably above 2× the maintenance budget, or break reproducibility).
  • Operator-driven update because tenant components fell behind. Out of scope for this UX — see Out of Scope.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability’s Business Rules and Success Criteria — by name, so future readers can trace the lineage:

  • Operator-only operation. There is no self-service onboarding flow. The journey’s only engagement surface is a GitHub issue the capability owner files, which the operator personally services. No co-operator or delegated administration appears anywhere in the journey.
  • Tenants must accept the platform’s contract. Contract acceptance is implicit in the tech-design submission: the design declares resource needs, identity choice, packaging form, and availability expectations conforming to the platform’s contract. There is no explicit “I accept” gate — the design is the acceptance.
  • Identity service honors tenant credential-recovery rules. Whichever identity option is named in the capability owner’s tech design must be one the platform actually offers. The platform-provided identity service must be capable of honoring “lost credentials cannot be recovered.” If a capability needs that property and bring-your-own is chosen, it is the capability owner’s responsibility to honor it themselves.
  • Eviction threshold. The operator may raise eviction when routine accommodation would exceed 2× the operator-maintenance-budget KPI or break the reproducibility KPI. This UX surfaces eviction only as an external operator-initiated event affecting the capability owner — see Edge Cases.
  • The capability evolves with its tenants. The “new offering needed” branch in step 3 is the operationalization of this rule: the default response when a tenant needs something the platform doesn’t yet provide is to consider expanding the platform, not to refuse the tenant. But the operator is not obligated to grow the platform without bound. A request is declined once satisfying it would require a new ongoing offering the platform could not keep reproducible within the Reproducibility KPI or operate within the Operator maintenance budget KPI, even if the offering is technically buildable.
  • No specific availability or performance SLA. The journey does not include any negotiation of availability targets — tenants accept whatever the platform’s current implementation offers. A capability owner needing stronger guarantees should not have arrived here (their tech design would have picked a different host).
  • KPI: Tenant adoption. A capability owner who explicitly gives up on onboarding because the operator stayed silent too long is counted as a lost tenant, not waved away as “they changed their mind.” The signal is the GitHub issue itself: they say they are hosting elsewhere because waiting no longer works for them. The response is to leave that loss recorded in-thread and close the issue, so the KPI reflects what actually happened.
  • KPI: 1-hour reproducibility. Implication for this UX: provisioning during step 5 must be done by running the platform’s existing definitions, not by the operator hand-rolling per-tenant snowflake configuration. If onboarding requires bespoke manual config that cannot be captured as definitions, the platform itself has fallen out of compliance with this KPI — and the right response is to update the platform’s definitions, not to tolerate the snowflake.
  • KPI: 2-hr/week operator maintenance budget. Implication for this UX: change-later iterations (step 8) must remain quick enough that running them does not eat the operator’s weekly budget across all hosted tenants. A tenant whose modify requests routinely cost disproportionate operator time crosses into the eviction-threshold rule. The same KPI also bounds the admission of new offerings: “technically possible” is still a decline if the resulting routine platform scope would no longer fit inside this budget.

Out of Scope

  • Data migration of an existing tenant. Bringing data from a prior vendor or local install into the newly-provisioned tenant is a separate UX, not covered here. This UX is strictly about provisioning the capability on the platform.
  • Operator-initiated tenant updates (“your component has fallen behind”). When the operator notices a tenant’s components have aged out of platform support, the operator initiates the conversation — that is a different journey with the operator as the primary actor and the capability owner as the responder. It belongs in its own UX doc.
  • Running-tenant observability for the capability owner. This onboarding journey provisions observability as part of bringing the tenant live, but it does not cover the later “is my thing healthy right now?” monitoring journey itself. That ongoing experience belongs in Tenant-Facing Observability, not here.
  • Platform-side standup or rebuild. The operator standing up the platform from scratch is one of the parent capability’s other triggers, not this UX.
  • The capability owner’s tech-design phase. The decision to use this platform was made before this journey starts. How that decision is made (build vs. buy, host-here vs. host-elsewhere) is a tech-design concern, not a hosting-UX concern.

Open Questions

None at this time.

1.2 - Migrate Existing Data Into a Newly-Provisioned Tenant

A capability owner whose capability is already onboarded and running on the platform brings their existing end-user data over from the prior host by handing off a one-time migration process for the platform to run.

One-line definition: A capability owner whose capability is already onboarded and running on the platform brings their existing end-user data over from the prior host by handing off a one-time migration process for the platform to run.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is the same capability owner described in Host a Capability. They are not a different role for this journey — they are mid-lifecycle, having already completed onboarding, and now coming back to deal with one specific concern: their pre-existing data.

  • Role: Capability owner. Their capability is already onboarded and live on the platform — compute, storage, network, identity, observability are all provisioned and running per the closed onboarding issue. The tenant is empty: no end-user data is in it yet.
  • Context they come from: Their capability has historical data living somewhere else — on a vendor (e.g. a hosted Plex provider), on a local install, on a previous self-hosted setup. End users are still on that old host. The capability owner is running the new tenant and the old host concurrently during this period; cutting end users over is their concern, deliberately separate from this UX.
  • What they care about here: Getting their existing data into the new tenant intact, so that when they decide to cut end users over, the new tenant is not a fresh-start regression. They want to do this with a defined, repeatable mechanism rather than ad-hoc — the operator’s 2hr/week maintenance budget depends on migrations not becoming bespoke projects.

Goal

“I want my existing end-user data moved from my old host into my new tenant on the platform — using a migration process I wrote, run by the platform on my behalf — so that when I cut my users over (on my own schedule), the new tenant has everything they expect.”

This is a one-shot goal per migration: when the data has landed and the capability owner has validated it, the migration job is torn down. There is no ongoing sync.

Entry Point

The capability owner arrives at this experience after Host a Capability has fully completed for their tenant — onboarding issue closed, tenant live and empty. They have a parallel, still-running deployment of their capability on a prior host (vendor or self-managed), and they have written a migration process — a one-time job that reads from the prior host and writes into the new tenant via the new tenant’s normal interfaces — packaged in the form the platform accepts (same packaging as any other capability component).

What they have in hand:

  • A reference to the closed onboarding issue (so the destination tenant is unambiguous).
  • A packaged migration process artifact.
  • Credentials needed by the migration process to talk to the old host.
  • A rough sense of resource needs for the migration job (compute, network egress to the old host, expected runtime).

State of mind: pragmatic. They know this is bespoke to their capability — the platform is providing a runner for a process they wrote, not a magic mover.

Journey

1. Register old-host credentials with the platform secret management offering

Before filing the issue, the capability owner registers any credentials their migration process needs (to read from the old host) with the platform’s secret-management offering. The migration process artifact will reference these by name; the secrets themselves do not appear on the issue.

What they perceive: standard usage of the platform’s secret-management offering. This step exists outside the issue and is the capability owner’s responsibility to complete before handoff.

2. File a “migrate my data” issue on GitHub

The capability owner opens an issue against the infra repo using the migrate my data issue type — distinct from onboard my capability and modify my capability because the operator’s review scope and the lifecycle (one-shot, torn down on completion) differ. The issue contains:

  • A link to the closed onboarding issue (identifying the destination tenant).
  • A description of the source (old host, format, rough data volume).
  • The packaged migration process artifact (or a link to it).
  • A declaration of the migration job’s resource needs (compute, storage, network reachability — including egress to the old host), including any temporary migration-only spikes beyond the tenant’s steady-state footprint, and the names of the secrets it expects to read from the platform’s secret-management offering.
  • A declaration of the migration process’s re-run contract: whether it is safe to run against an already-populated destination tenant, or whether the destination must be wiped / empty before each run.

What they perceive: the issue is filed. They wait, async, just like onboarding.

3. Operator review on the issue

The operator reviews the migration request with a deliberately narrow scope — the delta the platform is being asked to support for this one-shot job. Specifically, the operator confirms with the capability owner:

  • Resources: the migration’s peak temporary footprint — the destination tenant’s steady-state compute and storage footprint plus any migration-only spike declared on the issue — is no more than 2x the destination tenant’s steady-state compute and storage footprint, and it fits within the platform’s currently available migration-process capacity. If either compute or storage exceeds that threshold, the operator rejects the request as written and asks the capability owner to split the migration into smaller runs, reduce the spike, or resize the tenant first via modify my capability.
  • Network: the migration job has the egress reachability it needs to talk to the old host, and ingress to the destination tenant’s storage interfaces.
  • Credentials: the named secrets are registered and the migration process is wired to read them correctly.
  • Re-run contract: the issue is explicit about whether retries or later top-up migrations can run against existing data, or whether each run requires an empty / wiped destination.

What the capability owner perceives: clarifying questions appear as comments on the issue. They answer in-thread. There is no review of the migration process’s internal logic — that is the capability owner’s domain. The operator is reviewing what the platform must provide to run it, not whether it does the right thing.

4. Operator onboards and starts the migration job

Once the review converges, the operator wires up the one-time migration job using the platform’s migration-process offering and starts it. The capability owner does nothing during this step — same as the provisioning step in host-a-capability. They simply wait for the migration job to be running.

Concurrent migrations across different tenants are supported. The capability owner should not expect exclusive use of the migration-process offering; if other tenants are migrating at the same time, their own journey still looks the same.

5. Capability owner observes the running job

While the migration job runs, the capability owner watches it through the platform’s observability — the same observability surface every other platform offering exposes to its tenant. They can see whether the job is making progress, whether it has errored, and whatever signals their migration process emits.

What they perceive: visibility into their own job, on their own time. There is no operator handholding during the run. Long migrations (hours, days) are normal — there is no SLA, just observability.

6. Operator reports the job’s terminal state on the issue

When the migration job finishes — successfully or with an error — the operator reports the terminal state on the issue and asks the capability owner to validate.

7. Resolution — one of two branches

7a. Success — capability owner validates data presence. The capability owner verifies the data landed correctly, per their capability’s own definition of correct (open the app, check counts, spot-check records — their judgment, not the platform’s). When they’re satisfied, they say so on the issue.

7b. Failure — capability owner provides the plan for next steps. If the migration job errored, or if validation reveals the data is incomplete or wrong, the capability owner is responsible for deciding what happens next — because this is their data and their migration process. Possible plans they may propose on the issue:

  • Wipe the destination tenant’s storage and re-run with a fixed migration process (re-handoff a new artifact).
  • Resume from where it failed (only viable if their migration process supports this).
  • Accept the partial state and run a follow-up migration for the remainder.
  • Abandon this migration attempt entirely.

The platform does not prescribe a recovery model. The operator executes whatever next-step plan the capability owner provides, looping back through the appropriate earlier step (re-handoff → re-review → re-run, or just re-run).

8. Operator tears down the migration job and closes the issue

Once the capability owner confirms validation success, the operator tears down the one-time migration job (it is not retained — re-running later means filing a fresh migrate my data issue) and closes the issue.

The new tenant now holds the migrated data. Cutting end users over from the old host to the new tenant is the capability owner’s separate concern, outside this UX.

Flow Diagram

flowchart TD
    Start([Onboarding complete; tenant live & empty]) --> Secrets[Register old-host credentials with<br/>platform secret-management offering]
    Secrets --> File[File 'migrate my data' issue<br/>linking the closed onboarding issue]
    File --> Review[Operator confirms resources,<br/>network, and credentials with CO]
    Review --> Run[Operator onboards and starts<br/>the one-time migration job]
    Run --> Observe[CO observes job via platform observability]
    Observe --> Terminal[Operator reports terminal state on issue]
    Terminal --> Validate{CO validates data?}
    Validate -->|Yes — data is present and correct| Teardown[Operator tears down migration job<br/>and closes the issue]
    Validate -->|No — failure or incomplete data| Plan[CO provides plan for next steps]
    Plan --> Branch{Plan}
    Branch -->|Re-handoff fixed artifact| Review
    Branch -->|Re-run as-is| Run
    Branch -->|Abandon| Teardown
    Teardown --> Done((Data migrated;<br/>cutover is CO's concern))

Success

When the issue closes, the capability owner walks away with:

  • Their existing end-user data sitting inside the new tenant, validated by them against their own capability’s definition of correctness.
  • A clean platform state: the one-time migration job is torn down, leaving only the tenant and its data behind.
  • Confidence that when they decide to cut their end users over, the new tenant will not look like a regression.
  • A known, repeatable path if they ever need to migrate again (file another migrate my data issue and declare the process’s re-run contract again).

Edge Cases & Failure Modes

  • Migration job errors out partway, leaving partial data in the tenant. Experience-level handling: the operator reports the error on the issue; the capability owner provides the plan (wipe-and-retry, resume, accept partial, abandon). The platform does not auto-clean — the data belongs to the capability owner and they decide what to do with it.
  • Validation reveals data is wrong even though the job reported success. Same as above — capability owner provides the plan. This is treated identically to a job-level failure from the journey’s perspective.
  • Migration takes far longer than the capability owner expected. Experience-level handling: there is no SLA, and the capability owner can see what is happening through the platform’s observability. They can decide whether to let it run or to file a plan to abort and re-approach.
  • Migration job needs more resources than declared (storage too small in the tenant, more compute, etc.). Experience-level handling: temporary migration-only spikes are allowed only if declared up front and approved during review, and approval is bounded by the step-3 rule that the migration’s peak temporary footprint can be at most 2x the destination tenant’s steady-state compute and storage footprint. If the real job exceeds what was declared, the operator surfaces this on the issue; the capability owner may need to file a separate modify my capability issue against the destination tenant first (e.g., to enlarge storage), split the migration into smaller runs, or re-file the migration with a corrected declaration. The two issues are explicitly distinct because they touch different review scopes.
  • Old host becomes unavailable mid-migration (vendor outage, account suspended, etc.). Experience-level handling: the migration job will fail; same as any other failure — capability owner provides the plan. The platform makes no attempt to resume on the capability owner’s behalf.
  • Capability owner registered the wrong secrets, or the migration process can’t authenticate to the old host. Same as any other failure mode — surfaces during the run, capability owner adjusts and the issue iterates.
  • Another tenant is migrating at the same time. Experience-level handling: no special branch. Concurrent migrations are part of the offering; the capability owner still files the same issue, waits through the same review, and observes only their own job.
  • Capability owner wants to re-run the migration months later (e.g., to top up data accumulated on the old host since the first migration). The experience is still: file a fresh migrate my data issue. The previous migration job is gone; the new one is a separate one-shot, and the capability owner must explicitly declare whether the process is safe against existing data or whether the destination must be wiped first.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability’s Business Rules and Success Criteria — by name, so future readers can trace the lineage:

  • Operator-only operation. As with host-a-capability, the only engagement surface is a GitHub issue the capability owner files; the operator personally services it. The capability owner has no direct access to start, stop, or observe migration jobs except through the platform’s observability surface, which is itself an offering the operator runs.
  • Tenants must accept the platform’s contract. The migration process is packaged in the same form the platform accepts for any tenant component — the contract does not relax for migration. A migration process that cannot be packaged this way cannot be run by the platform. Declaring the process’s resource needs and re-run contract up front is part of that contract.
  • The capability evolves with its tenants. The existence of a migration-process offering — a platform-provided one-shot-job runner with the platform’s standard observability — is itself an instance of this rule. The platform extends to support a need (migrating in pre-existing data) that tenants have, rather than refusing tenants whose data already exists somewhere.
  • Identity service honors tenant credential-recovery rules. Indirectly relevant: if the migration includes user-account or credential references from the old host, the capability owner’s migration process must produce data that respects whatever identity properties their capability requires (e.g. for self-hosted personal media storage, the “lost credentials cannot be recovered” property must still hold post-migration). This is the capability owner’s responsibility, embedded in their migration process — the platform does not enforce it.
  • KPI: 1-hour reproducibility. The migration offering itself must be reproducible from definitions, like every other offering. A specific migration job is per-tenant and not part of the platform’s reproducible state — it is a one-shot artifact that ceases to exist after teardown.
  • KPI: 2-hr/week operator maintenance budget. A migration that demands disproportionate operator time across the issue’s review-run-iterate loop pressures this budget. Repeated failed migrations from the same capability owner — or migrations that require the operator to deeply understand the capability owner’s data to make progress — would cross into the eviction-threshold rule’s territory.
  • Eviction threshold. Sustained migration friction is a possible (if unusual) path into eviction. The platform offers to run a migration process; it does not offer to write one, debug it, or shepherd a problem capability through repeated attempts.
  • No specific availability or performance SLA. No SLA on migration completion either. Migrations take however long they take; the capability owner sees progress through observability and decides what to do about long-running jobs. Supporting concurrent migrations does not imply exclusive capacity or a completion-time guarantee for any one tenant’s job.
  • Operator succession. The migration job’s lifespan is bounded — it exists only between steps 4 and 8 of this journey. If the operator becomes unavailable mid-migration, the successor’s takeover responsibility is to keep the platform running, not to finish in-flight migration jobs. A mid-migration tenant simply has a stalled job; the capability owner provides a plan when a successor (or recovered operator) is back.

Out of Scope

  • Cutting end users over from the old host to the new tenant. This is a capability-owner concern, deliberately outside the platform’s view. The capability owner runs old + new concurrently and cuts over on their own schedule using whatever mechanisms their capability provides for end users.
  • Ongoing sync or replication between the old host and the new tenant. This UX is one-shot. A capability that needs continuous sync is a different capability (and likely a different UX, if it ever exists).
  • Writing or debugging the capability owner’s migration process. The platform runs what is handed to it. Logic correctness, source-format handling, schema translation, and idempotency belong to the capability owner.
  • Helping the capability owner pull data out of the old host. The migration process must speak to the old host on its own. The platform does not maintain adapters or know about specific vendors.
  • Validation of data correctness. Per Move Off the Platform After Eviction, the platform provides bytes faithfully but does not validate semantic correctness. The same applies in reverse here — the capability owner is the only judge of “did the data land correctly.”
  • Rollback to the old host. The capability owner is already running the old host concurrently; “rollback” simply means they don’t cut over. There is no platform-side rollback because there was nothing to roll back from — end users were never on the new tenant during the migration window.

Open Questions

None at this time.

1.3 - Move Off the Platform After Eviction

A capability owner whose capability has been evicted gets their data out cleanly and walks away with no obligations and no tenant-accessible copy left on the platform once the retention window closes.

One-line definition: A capability owner whose capability has been evicted gets their data out cleanly and walks away with no obligations and no tenant-accessible copy left on the platform once the retention window closes.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is a capability owner whose capability has been evicted — a Primary actor (initiator) from the parent capability’s Stakeholders, on the way out. As elsewhere in this capability’s UX docs, the role is treated as separate from the operator’s even though today both hats are worn by the same person.

  • Role: Capability owner. The party who originally onboarded a capability onto the platform via host-a-capability, has been hosting it for some period, and is now being removed.
  • Context they come from: The parting is amicable. Eviction was triggered by a divergence the platform legitimately cannot meet — specialized hardware, regulatory constraints, an availability target stronger than the platform offers — not by a missed deadline in the operator-initiated-tenant-update flow. Negotiation over the eviction date has already happened upstream, before this UX begins. The capability owner accepts that they are leaving and has agreed to the date.
  • What they care about here: A clean exit. By the eviction date their capability is fully off the platform, their data is in their hands in a portable form they can verify, and nothing remains available for them to retrieve from the platform after the retention window ends. They are not asking the platform to help them figure out where to run next — that is their problem to solve.

Goal

“By the time the platform is finished with my capability, I have my data, I know it’s complete, and I have nothing left to chase down here.”

Entry Point

The capability owner arrives at this experience because the operator has filed an eviction issue against the infra repo tagging them. The issue contains exactly:

  • The eviction date (already negotiated upstream — not up for renegotiation in this journey).
  • The reason for eviction (so it is on the record and the parting stays amicable).
  • A link to the platform’s export tooling, with documentation on how to use it and what the export shape looks like for their tenant.

That is all the issue carries. The capability owner’s state of mind is “the date is set, I know where the export tool is, I have a window of time to get my data out and walk away cleanly.”

Journey

The journey runs in three phases keyed off the eviction date: a pre-eviction window where the tenant is still live, the eviction date itself when compute and network resources go away, and a 30-day grace window where data is held in an export-only, read-only state before tenant data is permanently deleted across all tiers at day 30.

Phase A — Before the eviction date (tenant still live)

1. Read the eviction issue and the export documentation

The capability owner reads the issue, follows the link to the export tooling, and reads its documentation. They learn what the export will produce — file layout, formats, what is included, what is not — and roughly how long an export of their dataset will take to run. No back-and-forth with the operator is expected here; the issue and the docs are meant to be self-sufficient.

2. Notify their own end users

The capability owner tells their end users that the capability is going away on the eviction date — separately from the platform, on whatever channel they use with their users. The platform plays no role here; end users of a tenant capability are not visible to the platform and the platform does not communicate with them. (See No direct end-user access to the platform in Constraints.)

3. Run the export and verify it themselves

The capability owner kicks off the export using the platform’s export tool. What they perceive is an archive of their tenant’s data, produced for them to download then and there, plus a checksum/hash and total size in bytes that the platform produces alongside it. Validation that the export is complete and correct is the capability owner’s responsibility, not the platform’s. Only the capability owner knows their data well enough to say “yes, this is all of it and it is intact.” The platform offers checksum/hash and total size as the ceiling of what it can verify on the capability owner’s behalf — anything beyond that (record counts, schema integrity, business invariants) is theirs.

4. (Optional) Run the export iteratively

Because end users may still be writing data while the tenant is live, an export taken in Phase A is not necessarily the final export. The capability owner may run multiple exports across Phase A — one early to validate that the tooling produces something usable, another later to capture more recent writes. Whether they do this is their call; the platform supports it because the export tool simply runs whenever invoked. Each run is ephemeral: if they want to keep an export, they download it when it is produced. The platform does not keep a history of prior exports around for them.

Phase B — The eviction date

5. Compute and network resources are torn down; the tenant stops serving

On the eviction date the operator deprovisions the tenant’s compute, network, and other live resources. From the capability owner’s seat: the tenant is no longer reachable by their end users. The data persists, but only in an export-only, read-only state — no further writes can occur, by anyone. A comment is posted on the eviction issue confirming the cutover and the start of the 30-day retention window.

What the capability owner perceives: the issue gets a status comment, and they now know their dataset is frozen. If they had not finished extracting data before this point, they still have 30 days — but the dataset they extract from now on is the final one.

Phase C — Post-eviction (30-day retention window)

6. Run the export of record (if not already taken)

In Phase C the export tool still works, but now against a stable, read-only snapshot. For capability owners with more data than they could extract during Phase A, or for those who deliberately deferred to avoid racing live writes, this is when the definitive export is pulled. As in Phase A, the generated export artifact is ephemeral: they re-run the same export tool, get back an archive plus checksum/hash and size, and must download it when it is produced rather than assuming the platform will keep that generated file around for later pickup. If they miss that download, they can run the export tool again at any point within the 30-day retention window and validate the newly generated archive the same way they validated in Phase A.

For capability owners who already pulled what they needed in Phase A, Phase C is a safety net — “I forgot a thing, let me grab it” — rather than the main event.

7. Walk away

Once the capability owner is satisfied they have everything, they comment on the issue indicating they are done. The operator closes the issue. After 30 days from the eviction date, the platform permanently deletes the tenant’s data — both the tenant-accessible copy and any platform-held backup-tier copies — regardless of whether the capability owner ever closed the loop. No residual copy survives day 30 in any tier the platform controls. There is no “are you sure?” — the 30-day clock is hard.

Flow Diagram

flowchart TD
    Start([Eviction issue filed by operator<br/>date already negotiated]) --> Read[Read issue + export tooling docs]
    Read --> Notify[Notify own end users<br/>off-platform]
    Notify --> ExportLive[Run export against live tenant<br/>verify checksum / size / contents]
    ExportLive --> Iter{More writes expected<br/>before eviction date?}
    Iter -->|Yes| ExportLive
    Iter -->|No| Wait[Wait for eviction date]
    Wait --> Cutover[Eviction date:<br/>compute/network torn down<br/>data → read-only<br/>comment posted on issue]
    Cutover --> PhaseC{Need more data<br/>from final snapshot?}
    PhaseC -->|Yes| ExportFinal[Run export against frozen snapshot<br/>download now + verify]
    PhaseC -->|No, already complete| Done
    ExportFinal --> Done[Comment 'done' on issue;<br/>operator closes it]
    Done --> RetentionEnds([30 days post-eviction:<br/>all tenant data permanently deleted<br/>including backup-tier copies])

Success

When the journey ends cleanly, the capability owner walks away with:

  • A verified, complete archive of their tenant’s data, sized and checksummed by the platform, validated by them.
  • A clear paper trail on the eviction issue showing the date, the reason, and confirmation that they pulled what they needed.
  • Nothing left to chase down on the platform. After the 30-day window the platform permanently deletes the tenant’s data across every tier it controls — no tenant-accessible copy and no deeper backup-tier copy survives.
  • An amicable ending. The operator filed the issue, the platform held the data the agreed amount of time, and the capability owner left under their own power. The relationship is intact for whatever comes next.

Edge Cases & Failure Modes

  • Capability owner asks for more time after the eviction date. Hard wall. The negotiation over the eviction date happened upstream of this journey; once that date is set, it is the date. The 30-day post-eviction retention is the only post-date slack and it is fixed.
  • Export takes longer than 30 days to actually run on a very large dataset. Same hard wall — the capability owner had Phase A plus 30 days of Phase C to extract; if that is not enough, they had advance warning during eviction-date negotiation and should have raised it then. The platform does not extend the retention window for slow extracts.
  • Export comes back wrong (checksum mismatch, missing files, corruption visible to the capability owner). The capability owner reports the problem on the eviction issue so that thread remains the coordination record. This is the one exception to the 30-day hard wall: if the failure is shown to be in the platform’s export tooling or its data hosting, the operator pauses that tenant’s retention-window countdown for removal of tenant-accessible data until the platform-side issue is resolved and a clean export has been produced, so the capability owner can continue exporting during that pause. No separate restoration SLA is promised in this UX; the issue stays open until the capability owner can pull a clean export. Failures rooted in the capability owner’s own validation steps do not pause that retention-window countdown.
  • Export tooling does not exist for this tenant’s data shape at the time of eviction. Cannot happen by design — export tooling is a core platform feature, present for every kind of data the platform hosts. If a hole is discovered, that is itself a platform bug, handled the same way as the previous bullet (eviction issue remains open, that tenant’s retention-window countdown for removal of tenant-accessible data is paused).
  • Capability owner ignores the issue entirely and never extracts anything. No special handling. The 30-day clock runs, tenant-accessible data is removed, the issue is closed by the operator. The capability owner may have made themselves whole through other means (their own backups, accepting the loss); the platform does not chase them.
  • End users keep hitting the tenant after the eviction date. They get whatever connection failure the underlying infra produces. The capability owner is responsible for having warned their end users; the platform does not present a “this tenant has been retired” page or otherwise communicate with end users — end users belong to the capability, and from the platform’s seat, the capability is the end user.
  • Capability owner wants to come back later (re-onboard the same capability after the divergence is resolved). That is a new host-a-capability journey, not a continuation of this one. It is not blocked, but nothing about this UX preserves state to make it easier.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability — by name:

  • Eviction is allowed when needs and capabilities diverge. This UX is the operationalization of the amicable form of that rule: the divergence is real (specialized hardware, regulatory constraint, availability target the platform cannot meet) and the parting is mutual. The fall-behind variant of eviction is handled separately via operator-initiated-tenant-update.
  • No direct end-user access to the platform. End users of the tenant capability are not visible to the platform and are not communicated with by the platform during eviction. Notification of end users is purely the capability owner’s responsibility.
  • Operator succession — on-demand exportable archives. The same export mechanism that the parent capability promises for operator-succession scenarios is what powers this journey. Export tooling is therefore not bespoke to eviction; it is a core platform feature that exists at all times for every tenant. This UX simply consumes it.
  • Operator-only operation. The capability owner has no administrative access during this journey. Everything they do — running exports, leaving comments — is done through the same surfaces an end-state non-operator has. The operator is the one who deprovisions resources and closes the issue.
  • Affected parties (end users of the tenant capability). End users feel this journey indirectly: their access to the capability ends on the eviction date. The platform does not surface this to them — the capability owner does, separately, on their own channels.
  • KPI: 2-hr/week operator maintenance budget. Implication: this journey must not require the operator to do bespoke per-tenant work. The export tool is generic and runs on demand; the operator’s only routine touchpoints are filing the issue, posting the cutover comment, and closing the issue at the end. A tenant whose eviction would require custom export work is itself a sign the platform’s export tooling has a gap that needs fixing — handled as a platform bug, not as an operator-effort overrun.
  • KPI: 1-hour reproducibility. Implication: the data formats produced by the export tool, and the way they relate to the platform’s definitions, should be expressible as part of the platform itself, not as snowflake per-tenant logic. (Standing the platform up should not require remembering “and here is the special export path for tenant X.”)

Out of Scope

  • The eviction-decision journey itself. Why the operator decided to evict, and the conversation that established the eviction date, happens before this UX. By the time this UX begins, the issue is filed, the date is set, and both parties have agreed.
  • The fall-behind eviction path. Eviction triggered by a missed extended date in operator-initiated-tenant-update is a different shape (less amicable, possibly compressed timelines). It enters a separate journey not covered here, even though the mechanics of getting data out via the export tool may overlap.
  • Helping the capability owner figure out where to run next. The platform does not point at alternative hosts, port the capability’s runtime, or assist with migration. The export tool produces data; the rest is the capability owner’s problem.
  • Application/runtime/configuration migration tooling. Only data export is provided. Capability code, container images, configuration, secrets management at the destination — none of this is the platform’s concern.
  • Re-onboarding the same capability later. If the capability owner wants to come back, that is a fresh host-a-capability journey with no special path inherited from having previously been here.
  • Operator’s side of this journey. This UX is written from the capability owner’s seat. The operator’s experience (filing the issue, deprovisioning on the date, posting the cutover comment, closing the issue, watching the 30-day clock) is captured here as a responder, not as a separate document.

Open Questions

None at this time.

1.4 - Operator-Initiated Tenant Update

The operator notices a hosted tenant’s components have fallen behind what the platform supports, opens the conversation, and works with the capability owner to bring them current — without evicting.

One-line definition: The operator notices a hosted tenant’s components have fallen behind what the platform supports, opens the conversation, and works with the capability owner to bring them current — without evicting.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is the operator — the Owner / Accountable party from the parent capability’s Stakeholders. The capability owner is a responder in this journey, not the initiator. As with host-a-capability, this UX is written as if the operator and the capability owner were separate people: the role boundary is treated as real, even though today both hats are worn by the same person.

  • Role: Operator. Sole administrator of the platform; the only person who can see across tenants and notice that one of them has aged out of what the current platform offers.
  • Context they come from: They have just learned that something in the platform itself must change — a cloud provider is sunsetting a service the platform depends on; a CVE has landed against a platform component; a runtime version the platform offers is being retired upstream. The change forces an update on every tenant still using the affected component.
  • What they care about here: Getting affected tenants migrated with their capability owners, on a timeline driven by the real external pressure, without burning down the “we work with you, we don’t evict for fall-behind” promise — and without letting the situation drag past the point where the platform itself becomes unsafe or unsupportable.

Goal

“I want every tenant still on the falling-behind component to be moved onto what the platform now supports, on a timeline that fits the external pressure that forced this — and I want to do it by working with each capability owner rather than evicting them.”

Entry Point

The operator arrives at this experience because of a platform-level dependency event that is not under their control:

  • A cloud provider has announced a sunset date for a service the platform uses.
  • A CVE has been disclosed against a platform component, so the platform itself must update — and any tenant pinned to the affected component must update with it.
  • An upstream runtime, library, or base image the platform offers is reaching end-of-life.

The deadline is therefore inherited from the external event, not invented by the operator. The operator’s state of mind is “I have to do this anyway; how many tenants am I dragging through it with me, and what do they each need to ship?”

What they have in hand: knowledge of which platform offering is changing, by when, and which currently-hosted tenants are using it.

There is no formal tenant-facing pending-update view ahead of this moment. If the platform ever adds an earlier deprecation or pending-update signal for capability owners, that signal would live in Tenant-Facing Observability rather than in this operator-side journey. The operator-filed issue remains the first official signal that this journey has begun.

Journey

1. File a “platform update required” issue per affected tenant

For each affected tenant, the operator opens an issue against the infra repo using the platform update required issue type. This is a distinct issue type from onboard my capability and modify my capability — the distinct type is the signal to the capability owner that this is not optional cleanup, it is a required update with a real deadline behind it.

If the same tenant is hit by two unrelated forcing events at roughly the same time, the operator opens separate platform update required issues — one per event — and cross-links them if the remediation overlaps. The forcing event, reason, and deadline stay distinct even when the same code change may help satisfy more than one thread.

The issue tags the capability owner and contains:

  • What is falling behind (the specific platform offering / component / version).
  • What it is being replaced by, or what the new platform-supported version is.
  • The shape of the update being asked for — repackage against a new runtime, swap a dependency, rebuild against a new base, etc.
  • The deadline, with the external reason for it (sunset date, CVE remediation window, EOL date).

What the operator perceives at this point: the issue is filed and the capability owner has been notified. They wait for acknowledgment.

2. Capability owner acknowledges and plans

The capability owner reads the issue, asks any clarifying questions in-thread, and indicates whether the requested shape of update is feasible within the deadline. The operator answers questions as they come.

If the capability owner needs more time than the inherited deadline allows, the conversation moves into step 4 (slack negotiation) before any artifacts are handed off. Otherwise it proceeds to step 3.

3. Run the modify inner-loop

From here the mechanics are identical to the modify my capability journey:

  • The capability owner hands off updated packaged artifacts on the issue.
  • The operator re-provisions against the platform’s new offering.
  • The operator asks the capability owner to test.
  • They iterate in comments until it works.
  • The operator closes the issue.

The inner loop is the same surface; only the initiator and the issue type differ. End-user impact during the test/redeploy step is the same as a routine modify — typically a brief outage during cutover, nothing more.

4. Negotiate slack against the inherited deadline

If the capability owner cannot ship within the inherited deadline, the operator and capability owner first determine whether the external pressure leaves any safe slack at all. If it does, they negotiate an extended delivery date in the issue thread. The extension is not unbounded — the operator sets it based on how much slack the external pressure actually allows (a CVE with a known exploit allows much less slack than a vendor sunset announced 18 months out).

If the inherited deadline leaves no safe slack, the operator declines the extension and the original inherited deadline remains the operative date. The capability owner still gets the chance to ship against that date; they just do not get more time.

Whether extended or not, the date the operator and capability owner are now working against is recorded clearly on the issue. The journey then resumes at step 3.

5. Tip into eviction (after the last workable date is missed)

If the capability owner misses the operative delivery date — either the original inherited deadline when no extension was possible, or an agreed extended delivery date when one was — the operator opens a separate eviction issue (per the parent capability’s eviction journey — to be defined as its own UX) that links back to this issue for context. The eviction issue carries its own eviction date.

This update issue is then closed as superseded by the eviction. The journey ends here from the operator’s side; the capability owner’s experience continues in the Capability owner moves off the platform after eviction UX.

The decision to evict is governed by the parent capability’s Eviction threshold rule: continuing to accommodate this tenant would either push routine maintenance sustainably above 2× the operator-maintenance-budget KPI, or break the reproducibility KPI by leaving the platform stuck on a snowflake configuration to keep one tenant alive. A missed operative delivery date is the operational signal that the threshold has been crossed; it is not eviction-by-policy for being late.

Flow Diagram

flowchart TD
    Start([Platform dependency event:<br/>vendor sunset / CVE / EOL]) --> File[Operator files 'platform update required'<br/>issue per affected tenant]
    File --> Ack[Capability owner acknowledges<br/>and asks clarifying questions]
    Ack --> Feasible{Feasible by<br/>inherited deadline?}
    Feasible -->|Yes| Modify[Run modify inner-loop:<br/>artifacts → provision → test → close]
    Feasible -->|No| Slack{Safe slack for<br/>extension?}
    Slack -->|Yes| Extend[Negotiate extended delivery date<br/>recorded on the issue]
    Slack -->|No| NoExtend[No extension available;<br/>original deadline stands]
    Extend --> Modify
    NoExtend --> Modify
    Modify --> MetDeadline{Delivered by<br/>operative date?}
    MetDeadline -->|Yes| Done([Issue closed — tenant current])
    MetDeadline -->|No| Evict[Operator opens separate eviction issue,<br/>links back to this one]
    Evict --> Closed([This issue closed —<br/>superseded by eviction])

Success

When the issue closes cleanly, the operator walks away with:

  • Every affected tenant is now running on what the platform currently supports — no stragglers pinned to the retired offering.
  • The “we work with you, we don’t evict for fall-behind” promise was honored: each capability owner was given the chance to ship the update, with extension where the inherited deadline didn’t fit and safe slack existed.
  • The platform is free to actually retire the old offering, since there are no tenants left on it. The external pressure that started the whole journey can now be fully addressed.
  • A trail on each issue showing what was asked for, when, and what was shipped — useful the next time a similar dependency event happens.

Edge Cases & Failure Modes

  • Multiple tenants affected by the same platform event. Each gets its own issue, so each capability owner sees a request scoped to their capability. The operator coordinates timelines across all of them but does not bundle them into a single thread.
  • Capability owner goes silent. Same shape as silence in host-a-capability — there is no formal SLA in either direction. Experience-level handling: the operator can grant an extension only if safe slack exists, but if silence persists past the operative delivery date, step 5 applies.
  • Update cannot be shipped at all (capability fundamentally incompatible with the new offering). This is functionally the same as a missed operative delivery date: the operator opens an eviction issue. The right root response, per the parent capability’s “the capability evolves with its tenants” rule, is to consider whether the platform should keep supporting the old form — but if the external pressure (CVE, hard vendor sunset) makes that impossible, eviction is the honest outcome.
  • CVE with active exploit shortens the timeline aggressively. The operator may file the issue with very little slack, and the extension in step 4 may be much smaller than for a routine sunset — or unavailable entirely. The journey shape is unchanged; the deadlines just compress.
  • Update reveals a new requirement that the platform doesn’t yet offer. Hand off into the host-a-capability change-later loop’s “new offering needed” branch — the platform-update issue stays open while the new offering is added, then resumes at step 3.
  • Multiple overlapping platform updates against the same tenant. The operator opens one issue per forcing event, even for the same tenant, so each deadline and external reason stays legible. If the remediation overlaps, the issues cross-link and the operator coordinates them together.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability — by name, so the lineage is traceable:

  • Eviction is allowed when needs and capabilities diverge — but fall-behind cases work with the tenant. This UX is the operationalization of that carve-out. The default outcome is “we update together,” not “we evict.” Eviction enters this journey only via the missed-final-date branch and only as a separate, linked issue.
  • Eviction threshold. A missed operative delivery date is the operational signal that continuing to accommodate this tenant would cross the 2×-maintenance-budget or reproducibility threshold. The numeric threshold lives with the KPI; this UX inherits whatever it currently is.
  • The capability evolves with its tenants. Before transitioning to eviction in step 5, the operator considers whether the platform should keep supporting the older form — sometimes the right answer is to absorb the maintenance, not push the tenant forward. That choice is constrained by what the external pressure actually allows (a CVE generally rules it out; a vendor sunset announced years ahead may not).
  • Operator-only operation. The operator is the only person who can see that a tenant has fallen behind, because cross-tenant visibility lives only with the operator. There is no automated tenant-side warning system surfacing this from the platform to the capability owner today.
  • KPI: 2-hr/week operator maintenance budget. A tenant that routinely needs hand-holding through these updates — repeatedly missing deadlines, repeatedly needing extensions — is consuming disproportionate operator time and crosses into the eviction-threshold rule on its own merits, even before any single missed operative delivery date.
  • KPI: 1-hour reproducibility. Implication for this UX: the re-provision step in the inner loop must run through the platform’s existing definitions (now updated to the new offering), not through a per-tenant snowflake patch. If the only way to keep a tenant alive is bespoke manual config, that is itself eviction-threshold material.

Out of Scope

  • The eviction journey itself. When step 5 fires, the operator opens a separate eviction issue and the capability owner’s experience continues in Capability owner moves off the platform after eviction — a sibling UX, not part of this one.
  • Platform-contract changes that aren’t forced by an external dependency event. When the operator decides to change the platform’s contract (retire a packaging form, alter availability characteristics) absent external pressure, that’s the platform-contract-change rollout UX, not this one. The seam: this UX is reactive (something outside the operator’s control forced the update); the contract-change UX is proactive (the operator chose to change something).
  • The capability owner’s side as a primary journey. This UX is written from the operator’s seat. The capability owner’s experience of receiving and responding to one of these issues is captured here as a responder, not as a separate document — it shares enough surface with modify my capability (artifacts, test, iterate, close) that a separate doc would mostly duplicate.
  • Detection of fall-behind itself. How the operator notices a tenant is on a falling-behind component (vendor announcement watching, CVE feeds, manual review) is operational detail, not part of the user experience. This UX starts at the moment the operator has decided to act.
  • Tenant-facing visibility into pending platform updates before the issue is filed. Capability owners do not get an official warning surface ahead of the issue in this UX. If the platform later adds an earlier deprecation or pending-update signal, that signal belongs in Tenant-Facing Observability, not here, and does not replace issue filing as the start of this journey.
  • Routine modify requests. A capability owner shipping a version bump on their own initiative is the change-later loop in host-a-capability, not this UX.

Open Questions

None at this time.

1.5 - Platform-Contract-Change Rollout

The operator proactively changes a term of the platform’s contract — retiring an offering, changing a packaging form, altering availability characteristics — communicates that change to every affected tenant ahead of time, and migrates them all onto the new contract before the old one is retired.

One-line definition: The operator proactively changes a term of the platform’s contract — retiring an offering, changing a packaging form, altering availability characteristics — communicates that change to every affected tenant ahead of time, and migrates them all onto the new contract before the old one is retired.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is the operator — the Owner / Accountable party from the parent capability’s Stakeholders. The capability owners are responders in this journey, not initiators. As with host-a-capability and operator-initiated-tenant-update, this UX is written as if the operator and the capability owners were separate people: the role boundary is treated as real even though today both hats are worn by the same person.

  • Role: Operator. Sole administrator of the platform; the only person who can change the platform’s contract and the only person who can see across tenants to know which ones are affected.
  • Context they come from: They have decided to change a term of the platform’s contract — retire an offering, change a packaging form, alter availability characteristics, alter platform-imposed constraints. The decision has already been made and is not forced by external pressure. They have flushed out the technical details of the change and (where applicable) prepared a migration guideline for tenants. Where the change replaces an offering with a new one, the replacement offering has already been implemented and is running on the platform alongside the old one.
  • What they care about here: Getting every affected tenant migrated onto the new contract by a hard deadline, without surprising anyone, while honoring the evergreen contract promise — change is communicated ahead of time and tenants are migrated, not sprung on.

Goal

“I want to change a term of the platform’s contract — retire an offering, change a packaging form, alter availability characteristics — communicate that change to every affected tenant ahead of time, and have them all migrated onto the new contract before the old one is retired, without surprising anyone.”

Entry Point

The operator arrives at this experience having chosen to change the contract. The deciding (the why — cost, simplification, security posture, no longer wanting to maintain two runtimes, etc.) is upstream of this UX and not part of it. What they have in hand at step 0:

  • The full technical details of the change — what term is changing, what it is becoming, or that it is being removed entirely.
  • A migration guideline for tenants, where applicable (i.e. when a replacement offering exists and tenants need to repackage or reconfigure against it).
  • The replacement offering, if one exists, already implemented and running on the platform. Building the replacement is a precondition of this journey, not a step inside it.
  • Knowledge of which currently-hosted tenants are using the affected term.

The operator’s state of mind is “I have decided this is changing; how do I get everyone moved over by a date I’m choosing, without anyone being surprised?”

The seam with operator-initiated-tenant-update (UX #2) is sharp: that journey is reactive (an external event — vendor sunset, CVE, EOL — forced the update and dictated the deadline). This journey is proactive (the operator chose the change and is choosing the deadline).

Journey

1. File a “platform contract change” umbrella issue

The operator opens a single umbrella issue against the infra repo using the platform contract change issue type. This is a distinct issue type, separate from onboard my capability, modify my capability, and platform update required. The distinct type is the signal to capability owners that this is the operator changing the rules — not an externally-forced update and not optional cleanup.

A single umbrella issue is used (rather than one issue per tenant, as in UX #2) because the change applies identically to everyone, the migration guideline is shared, and tenants benefit from cross-tenant visibility — a clarifying question one tenant asks may be the answer another tenant needed.

The umbrella issue tags every affected capability owner and contains:

  • What term is changing, and what it is changing to (or that it is being removed entirely).
  • The migration guideline, if applicable.
  • The hard deadline by which all migrations must complete and after which the old form will be removed. Because this UX has no externally-imposed date to inherit, the operator picks a deadline that gives every affected tenant at least two full status-update cycles before cutoff: one cycle to acknowledge and start, and one cycle to finish or surface blockers while there is still time to respond.
  • The reason for the change. Even though the operator chose it, capability owners deserve to know why (cost, simplification, security posture, etc.) so they can plan and so the trail makes sense to readers later.
  • The status-update cadence the operator has chosen for this rollout (see step 3).

What the operator perceives at this point: the umbrella issue is filed and every affected capability owner has been notified. They wait for acknowledgments.

2. Capability owners acknowledge in-thread

Each tagged capability owner is required to acknowledge the change in-thread. Silence is not acceptable in an umbrella issue — silence in a multi-tenant thread is ambiguous (did they see it? are they planning?), so explicit acknowledgment is the contract.

Clarifying questions are asked in the umbrella thread, not in side channels, so that answers are visible to every other affected tenant at the same time. The operator answers questions as they come.

The deadline is not negotiable per-tenant. Capability owners do not get to ask for a slip — the deadline applies uniformly to everyone or it isn’t a deadline. (Whether the operator may globally push the deadline if the migration guideline turns out to be insufficient is covered in Edge Cases.)

3. Tenants migrate via separate modify my capability issues

Each affected tenant ships its migration as a separate modify my capability issue, linking back to the umbrella issue for context. The umbrella thread tracks acknowledgments, cross-tenant questions, and the global deadline; each modify issue tracks the actual artifact handoff / provision / test / close inner loop for one tenant. This keeps the umbrella thread readable as a coordination surface rather than a sprawling multi-tenant inner loop.

During the rollout window, the platform serves both the old and the new form of the contract concurrently — the replacement offering runs alongside the old one so tenants have time to migrate at their own pace within the deadline. The exception is a full offering removal: when there is no replacement, there is nothing to run alongside, and the change is effectively all-or-nothing at the deadline.

The operator posts status updates on a regular schedule in the umbrella thread. The cadence is chosen by the operator at the time the umbrella issue is filed and is sized to the overall timeline — daily for a roughly-week-long rollout, weekly for a roughly-month-long rollout, and so on. The current snapshot lives in the umbrella issue body so a reader landing cold can immediately see the latest state, and each scheduled update is also posted as a thread comment so the history of the rollout remains visible to watchers over time. Each update carries the same metrics: how many tenants are still on the old form, how many have migrated, which modify issues are open, and how much time remains until the deadline. Status updates are how every party — operator and capability owners alike — sees rollout progress without having to chase it.

4. Deadline arrives

On the hard deadline:

  • For each tenant whose migration completed: the modify issue is closed in the normal way and that tenant is now on the new form.
  • The old form is removed from the platform regardless of whether anyone is still on it. Any tenant that has not migrated by the deadline is now broken on a removed offering — which is exactly why the operator must ensure laggards are moved into eviction before this point if it is clear they will not make it.
  • For each tenant that did not migrate by the deadline: the operator opens a separate eviction issue per laggard tenant, linking back to the umbrella issue for context. The eviction issue carries its own eviction date and is governed by the parent capability’s eviction journey (to be defined as its own UX).
  • The umbrella issue is closed. Its job ends here — every affected tenant has either completed migration (their modify issue closed) or has an eviction issue in-flight (linked from the umbrella). Subsequent activity for laggards lives on their respective eviction issues, not on the umbrella.

Flow Diagram

flowchart TD
    Start([Operator has decided to change<br/>a contract term; replacement<br/>offering already implemented]) --> File[Operator files 'platform contract change'<br/>umbrella issue, tags all affected tenants]
    File --> Ack[Each capability owner acknowledges<br/>in-thread; questions answered in-thread]
    Ack --> Modify[Each tenant ships a separate<br/>'modify my capability' issue,<br/>linked to the umbrella]
    Modify --> Concurrent[Old + new forms run concurrently<br/>during the rollout window<br/>except for full removals]
    Concurrent --> Status[Operator posts scheduled status<br/>updates with migration metrics<br/>in the umbrella thread]
    Status --> Deadline{Deadline<br/>arrives}
    Deadline --> Migrated[Migrated tenants:<br/>'modify' issue closes]
    Deadline --> Laggards[Non-migrated tenants:<br/>operator opens a separate<br/>eviction issue per laggard,<br/>linked to umbrella]
    Deadline --> Remove[Old form is removed<br/>from the platform]
    Migrated --> CloseUmbrella[Umbrella issue closed]
    Laggards --> CloseUmbrella
    Remove --> CloseUmbrella
    CloseUmbrella --> Done([Contract change has shipped])

Success

When the umbrella issue closes, the operator walks away with:

  • The contract change has shipped — the old form is gone from the platform, the new form is the only form.
  • Every affected tenant has either migrated onto the new contract or has an eviction issue in-flight; no tenant is silently broken on a removed offering.
  • The evergreen contract promise was honored: the change was announced ahead of time with a migration guideline and a hard deadline, no tenant was surprised at retirement, and tenants were given a coordinated window in which both old and new ran concurrently (except for full removals, where concurrency is impossible).
  • A trail across the umbrella issue, the per-tenant modify issues, and any linked eviction issues — showing what changed, why, who migrated when, and who didn’t. Useful the next time a contract change ships.

Edge Cases & Failure Modes

  • Capability owner does not acknowledge in the umbrella. Experience-level handling: the operator chases — in-thread mention, direct ping, separate message as the deadline approaches. If no acknowledgment arrives by the deadline, the missing acknowledgment is treated as non-engagement and the laggard branch (eviction issue per tenant) applies. Acknowledgment is required, but the consequence of withholding it is the same as failing to migrate.
  • Migration guideline turns out to be wrong or insufficient mid-rollout. Two sub-cases:
    • Isolated miss (the guideline doesn’t cover one tenant’s specific case): the fix is tenant-specific, every other tenant can keep migrating without changing their plan, the guideline is amended in the umbrella thread, and the deadline does not move.
    • Big miss (the shared guidance or replacement itself must change for the remaining tenants): the deadline is pushed out and the new deadline is announced in the umbrella thread. The hard-deadline rule still applies to the new date — extension is a global event, not a per-tenant slip.
  • Tenant says outright “we can’t migrate — the new contract makes our capability unviable.” Straight to eviction. The capability owner now has to find a new platform or revamp their capability so that it works with the new contract. The umbrella issue still tracks this tenant via the linked eviction issue at deadline time, but the migration itself is not going to happen.
  • Full offering removal (no replacement to run alongside). Step 3’s “old + new run concurrently” does not apply. The change is all-or-nothing at the deadline. Tenants must be off the offering by the deadline; there is no grace window during which both forms exist. Migration in this case usually means moving to a different offering entirely or moving the workload off-platform — whichever the migration guideline directs.
  • Many tenants miss the deadline at once. This is a signal that the operator picked a deadline that was too aggressive given the size of the change, or sized the status-update cadence poorly for the work involved. The hard-deadline rule still applies — the operator opens an eviction issue per laggard — but the operator should treat the cluster of evictions as a learning event for the next contract-change rollout.
  • Cross-tenant question reveals a conflict in the migration guideline. Same shape as the isolated-miss branch above: amend in-thread, continue. The umbrella thread is the source of truth for the guideline as it evolves during the rollout.
  • Two contract changes are in flight at once and the same tenant is affected by both. Each change still gets its own umbrella issue and the tenant is expected to acknowledge in each thread. If one migration satisfies both changes, the tenant may use one combined modify my capability issue, provided it links back to both umbrellas so each rollout can still be tracked and closed independently.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability — by name, so the lineage is traceable:

  • Evergreen contract. This UX is the operationalization of the evergreen-contract promise made in host-a-capability. Contract changes are communicated ahead of time, tenants are migrated, and no tenant is sprung on. The hard deadline plus the rollout-window concurrency (where applicable) is what “communicated ahead of time and migrated” actually looks like in practice.
  • Operator-only operation. Only the operator can change the contract, and only the operator can see across tenants to know which ones are affected. The umbrella-issue mechanic is consistent with this — it is the operator’s tool, not a capability-owner-driven coordination surface.
  • Tenants must accept the platform’s contract. After this rollout completes, the new contract is the contract every remaining tenant has accepted. Acceptance is implicit in their having migrated; tenants that cannot accept the new contract end up evicted, which is consistent with the parent rule.
  • Eviction is allowed when needs and capabilities diverge. Laggards who miss the deadline are evicted via the parent capability’s eviction journey. This UX feeds eviction; it does not perform it.
  • Eviction threshold. A missed deadline is the operational signal that continuing to accommodate this tenant would either push routine maintenance sustainably above the 2×-budget threshold or break reproducibility (e.g. by forcing the platform to keep the old form running indefinitely just for one tenant). The numeric threshold lives with the KPI; this UX inherits whatever it currently is.
  • The capability evolves with its tenants. There is real tension between this rule and the present UX: this rule says the default response when a tenant needs something is to update the platform rather than push the requirement back. Yet this UX is the operator pushing change toward tenants. The reconciliation: this UX applies when the operator has already decided to change the contract — typically because the cost of continuing to support the old form (maintenance, security posture, complexity) has tipped against keeping it. The migration-guideline + concurrent-rollout shape is how the platform absorbs as much of the cost as it can. But once the deadline is set, it is set.
  • No specific availability or performance SLA. Contract changes that affect availability characteristics are in scope of this UX (the operator may alter availability characteristics under the rules of this rollout). Tenants needing stronger guarantees than the new contract offers are subject to the same eviction path as any other “fundamentally incompatible” case.
  • KPI: 2-hr/week operator maintenance budget. Implication for this UX: the rollout cadence — including status updates and per-tenant modify reviews — must fit within the operator’s weekly budget across the rollout window. A contract change that would clearly blow the budget is a signal to reduce the scope of the change, lengthen the deadline, or stage the rollout, before the umbrella issue is filed.
  • KPI: 1-hour reproducibility. Implication for this UX: the new contract must itself be reproducible from definitions. A contract change that ships with the platform itself stuck on a snowflake configuration to keep both forms running has failed the rule. Concurrent old/new during rollout is fine; permanent dual-form support is not the goal.

Out of Scope

  • Externally-forced updates. Vendor sunset, CVE remediation, runtime EOL — those are reactive updates with deadlines inherited from outside, and they belong in operator-initiated-tenant-update (UX #2). The seam: that UX is reactive; this UX is proactive.
  • Routine modify requests. A capability owner shipping a version bump or new component on their own initiative is the change-later loop in host-a-capability, not this UX.
  • The eviction journey itself. When a laggard misses the deadline, the operator opens a separate eviction issue per tenant. The capability owner’s experience continues in Capability owner moves off the platform after eviction — a sibling UX, not part of this one.
  • The decision to change the contract. Why the operator chose to retire an offering, change a packaging form, or alter availability characteristics is upstream of this UX. The journey starts the moment the operator has decided.
  • Building the replacement offering. Where the contract change replaces an old offering with a new one, the replacement must already be implemented and running on the platform before the umbrella issue is filed. Building it is a precondition, not a step in this UX. (Following on from host-a-capability’s “new offering needed” branch — which was the path by which the new offering may originally have entered the platform.)
  • The capability owner’s responder side as a separate doc. As with UX #2, the capability owner’s experience of receiving and responding to the umbrella issue is captured here as a responder. The actual migration work runs through the existing modify my capability surface, which is already documented in host-a-capability. A separate doc would mostly duplicate.

Open Questions

None at this time.

1.6 - Stand Up the Platform

The operator rebuilds the platform from its definitions — back to ready-to-host-tenants — confidently and verifiably, whether it’s the first build ever, recovery after total loss, or a periodic drill.

One-line definition: The operator rebuilds the platform from its definitions — back to ready-to-host-tenants — confidently and verifiably, whether it’s the first build ever, recovery after total loss, or a periodic drill.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is the operator — the parent capability’s Owner / Accountable party and sole administrator. There are no co-operators in this journey, and the sealed successor credentials are not in play during routine standup.

If a successor has taken over (because the primary operator is unavailable), they run this same journey. The act of breaking the seal and asserting takeover is a separate experience; once they have access to the operator’s context, the rebuild flow is identical. From this UX’s perspective there is one persona — whoever is currently the operator.

  • Role: The operator. Sole party with administrative access to the platform and accountable for it existing and running.
  • Context they come from: Either there is no platform yet (first-ever build) or the platform is gone / being rebuilt in parallel. Either way, what they have in hand is the definitions repo, root-level access to the underlying infrastructure (cloud account, home-lab), and — for disaster recovery — backups of tenant data sitting somewhere reachable.
  • What they care about here: Confidence that the platform really is reproducible from its definitions. Speed matters too (the Reproducibility KPI is 1 hour) but takes second place — a fast rebuild that leaves the operator unsure whether anything was missed is worse than a slower one that finishes verifiably clean.

Goal

“I want to rebuild the platform from nothing back to ready-to-host-tenants — confidently and verifiably, fast enough that the 1-hour KPI holds — so that total loss is recoverable, not catastrophic.”

Confidence beats speed when the two conflict. The operator is rebuilding the substrate that everything else of theirs depends on; a hurried rebuild that they don’t trust is its own kind of failure.

Entry Point

Three triggers converge on this same flow:

  • First-ever build. No platform has existed before; the operator is bringing it into being.
  • Disaster recovery. The platform existed and is now gone (cloud project lost, home-lab destroyed, ransomware, etc.); the operator is rebuilding on top of root-level access that survived the disaster.
  • Drift / reproducibility drill. The operator rebuilds the platform in parallel on scratch infrastructure after every significant platform change — meaning any change that would alter what they are rebuilding, what they must validate, or what they must trust before calling the platform ready again — and at least quarterly to prove the KPI still holds while the live platform keeps serving. The drill is identical to the real flow — only the underlying infrastructure differs.

What the operator has in hand at minute zero:

  • The definitions repo, pulled fresh.
  • Root-level access to the underlying infrastructure (cloud-provider account, home-lab access). Loss of these is not in scope for this UX — they are foundational and must already be in place before the platform can be (re)built.
  • For disaster recovery only: tenant-data backups. Restoring those into newly-provisioned tenants is a separate UX; this UX ends before that begins.

The operator’s state of mind is steady, not panicked: this journey exists precisely so total loss isn’t catastrophic, and a drill rehearses it on purpose.

What is not assumed at entry:

  • Definitions drift. Before any rebuild with prior platform state starts, the operator performs a required preflight drift check against the live platform or the last known-good environment. On a first-ever build, the check is vacuously clean because there is no prior platform state yet. The check passes only when the platform state the operator is treating as real still matches the definitions closely enough that no unexplained differences remain. If drift exists, it must be detected and fixed before this journey begins, not discovered partway through. (See Constraints.)
  • The sealed successor credentials. They stay sealed during routine standup, including DR.

Journey

The rebuild is automated, with manual operator-validation checkpoints between phases. The operator is on standby throughout — watching log output and system-level signals, ready to validate at each checkpoint, but not driving each step by hand.

1. Decide to rebuild and confirm preconditions

The operator decides to rebuild — first build, DR, or scheduled drill — and confirms what they have in hand: a fresh pull of the definitions repo and root-level access to the target infrastructure (the live infra for first-build/DR, scratch infra for a drill). Before they kick anything off, they perform the required preflight drift check whenever prior platform state exists, using the live platform or the last known-good environment as the reference, and confirm the platform they intend to trust still matches the intended definitions closely enough to rebuild from them honestly. If the check fails because unexplained differences remain, they stop and resolve drift before starting the rebuild.

What they perceive: nothing yet on the target infrastructure; a clean definitions repo on their workstation; the underlying provider UIs (cloud console, IPMI) showing the empty starting state.

2. Kick off the top-level rebuild

The operator runs the single top-level entry point that drives the rebuild from the definitions repo. From here on, automation does the work of provisioning; the operator’s job is to validate at each checkpoint.

What they perceive: log output begins streaming. The first phase is underway.

3. Phase 1 — Foundations

Automation provisions the underlying foundations: cloud project / home-lab base, network plumbing including the connectivity between cloud and home-lab. On completion the automation pauses and prints a phase summary.

The operator validates by checking the underlying provider’s UIs (cloud console, home-lab IPMI) and the expected signals for this phase. Only when they are satisfied that the foundations really are in place do they signal continue.

If validation fails, see Edge Cases — Phase fails.

4. Phase 2 — Core platform services

Automation provisions compute, persistent storage, and the platform-provided identity service on top of the foundations. Pauses. The operator validates the same way — provider UIs plus the expected signs that compute, storage, and identity are really available (e.g. the identity service is reachable and issuing tokens) — then signals continue.

5. Phase 3 — Cross-cutting services

Automation provisions backup and observability so they cover the platform itself before any tenant arrives. Pauses. The operator validates that backup is wired in and observability is collecting, then signals continue.

6. Phase 4 — Readiness verification and canary tenant

The platform deploys a purpose-built canary tenant maintained alongside the platform definitions end-to-end. It exists solely to prove the platform can host tenants without coupling readiness to any real tenant’s lifecycle. The trade-off is that this is less representative than using a small real tenant, so it may miss tenant-specific workload quirks; it is preferred anyway because it keeps readiness verification deterministic, disposable, and independent of any real tenant’s lifecycle. The canary is exercised (it should run, be reachable, store and read back data, authenticate against the platform-provided identity service, be picked up by backup and observability), then torn down.

What the operator perceives: a clear pass/fail on the canary. The canary’s success is the readiness signal — “ready to host tenants” is operationally identical to “did host a tenant just now.”

If the canary fails, see Edge Cases — Canary fails.

7. Note the wall-clock and close out

The operator records how long the rebuild took. If it came in under the 1-hour KPI, they’re done — the platform is ready for tenant restoration (a separate UX). If it took longer, the platform is still ready; the operator opens a GitHub issue capturing the cause of the slowdown so it can be analyzed and improved later. Either way, the journey ends here.

Flow Diagram

flowchart TD
    Start([Trigger: first build / DR / drill]) --> Confirm[Run required preflight drift check<br/>when prior state exists and confirm<br/>root-level access is in hand]
    Confirm --> Kickoff[Run top-level rebuild from definitions]
    Kickoff --> P1[Phase 1: Foundations<br/>cloud + home-lab base, networking]
    P1 --> V1{Validate via provider UIs<br/>+ maintained checklist}
    V1 -->|Fails| Halt[Halt, root-cause,<br/>tear down everything,<br/>fix definition, restart]
    V1 -->|OK| P2[Phase 2: Core services<br/>compute, storage, identity]
    P2 --> V2{Validate}
    V2 -->|Fails| Halt
    V2 -->|OK| P3[Phase 3: Cross-cutting<br/>backup, observability]
    P3 --> V3{Validate}
    V3 -->|Fails| Halt
    V3 -->|OK| Canary[Phase 4: Deploy purpose-built<br/>canary tenant, then tear down]
    Canary --> CanaryGreen{Canary green?}
    CanaryGreen -->|No| Halt
    CanaryGreen -->|Yes| Wallclock[Note wall-clock duration]
    Wallclock --> KPI{Under 1 hour?}
    KPI -->|Yes| Ready((Platform ready<br/>to host tenants))
    KPI -->|No| Issue[Open GitHub issue<br/>to analyze the slowdown]
    Issue --> Ready
    Halt --> Kickoff

Success

When the canary comes up green and is cleanly torn down, the operator walks away with:

  • A platform that is ready to host tenants — every platform-provided service has been exercised end-to-end by a purpose-built tenant deployment, not just by self-checks.
  • Confidence in reproducibility. The rebuild ran from the definitions repo, with no manual snowflake configuration, and produced a working platform. The KPI is honestly met (or, if not, the gap is captured for follow-up rather than papered over).
  • A clean handoff to tenant restoration. Any previously-hosted tenants come back via their own restoration journey; the platform-side standup ends cleanly without entangling itself in tenant data.
  • For drills specifically: a renewed assurance that “we can rebuild this in an hour” is a real property, not a hope, because the drill is run after every significant platform change and at least quarterly rather than whenever it feels convenient.

Edge Cases & Failure Modes

  • Phase fails mid-rebuild. The automation hits an error during one of the phases. The operator halts, root-causes the failure, fixes the underlying issue (typically a definition that needs updating), tears down everything that was provisioned so far, and restarts the rebuild from the top. Partial state is itself a snowflake risk and is not trusted. This implies each phase must be reversible — at minimum, “delete everything” must be a viable rollback. (See Constraints.)

  • Preflight drift check fails. The rebuild does not start. The operator treats this as a definitions integrity problem, reconciles the drift, and only re-enters this journey once the required preflight check passes.

  • Definitions are drifted despite the preflight check. Drift is supposed to be prevented by the platform’s enforcement of tracked changes and immutability, and detected/fixed before this journey starts. If drift still surfaces during the rebuild (e.g. the canary fails because something expected by the definitions is missing or inconsistent), the operator treats it as a definitions bug — fix the definition, tear down, restart.

  • 1-hour KPI is missed. The platform is still up and ready for tenants. The operator records the wall-clock and opens a GitHub issue to analyze why it took longer than it should have. The KPI is missed for that rebuild, but the platform doesn’t get blocked from going back into service; KPI improvement is a follow-up concern, not part of this journey.

  • Canary tenant fails to come up. The platform is not ready, regardless of how green every prior phase looked. The operator root-causes the canary failure, fixes the relevant definition, tears down, and restarts. Until the canary is green, the platform is not marked ready for tenants — even if the operator is under time pressure, this rule does not bend.

  • Successor at the keyboard. A successor who has taken over runs this same journey from the operator’s context. The act of breaking the sealed credentials and asserting takeover is a separate UX (not yet defined); once the successor is in, the rebuild flow does not differ.

  • First build has no backups. First-build and DR/drill produce the same platform-side outcome from this UX’s perspective. Tenant data restore is out of scope here, so the absence of backups during a first build is simply a non-event for this journey.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability — by name, so future readers can trace the lineage:

  • KPI: 1-hour reproducibility. This is the journey the KPI is measured against. The 1-hour budget is a target, not a hard fail — missing it does not stop the platform from going into service, but it does generate a tracked follow-up issue. The KPI cannot be honestly evaluated unless drills run this same flow on parallel infrastructure after every significant platform change and at least quarterly.

  • Operator-only operation. No co-operators, no delegated administration, no shared driving of the rebuild. The sealed successor credentials are not used during routine standup, including DR. A successor uses them only after takeover, and from that point operates as “the operator” through this same UX.

  • The platform may span public and private infrastructure. Phase 1 (foundations) explicitly crosses cloud and home-lab boundaries — the rebuild is not a single-environment affair. Connectivity between the two is part of the foundation, not an afterthought.

  • Reproducibility beats vendor independence beats minimizing operator effort (the parent capability’s stated tiebreaker). Manifests here as the rule that partial state is not trusted: tearing down and restarting from scratch on any phase failure is more operator effort than incremental fix-and-resume, but it is what reproducibility honesty requires.

  • Operator succession. Successor takeover converges on this same UX — sealed credentials grant access to the operator’s context, after which the rebuild flow is identical. The seal-breaking event itself is a separate journey.

  • No specific availability or performance SLA. The journey ends at “ready to host tenants” — what tenants experience after that is governed by the platform’s normal availability characteristics, not by this UX.

  • Tracked changes and immutability across all platform UXs. The required preflight drift check is only meaningful if every UX that can introduce platform state enforces tracked changes and immutability rather than allowing ad-hoc modification. This is a property the platform’s definitions and operations must hold, not a step that invents drift policy on its own — but this UX is the one that refuses to proceed until that policy is verified.

  • Each phase must be reversible. Implied by the “phase fails → tear down everything and restart” edge-case rule. The platform’s definitions must support a clean teardown of any partially-provisioned state. “Delete everything and start over” must be a viable, reliable option at every checkpoint.

  • Default hosting target for the operator’s capabilities. Readiness cannot be declared from infrastructure self-checks alone; the platform has to prove it can actually host a tenant. That is why this UX requires a purpose-built canary tenant maintained with the platform definitions.

Out of Scope

  • Tenant data restoration. Bringing previously-hosted tenants’ data back into newly-provisioned tenants is a separate UX. This journey ends at “platform is ready to host tenants” — full stop.

  • Re-onboarding tenants after rebuild. Each tenant’s return is governed by its own journey (likely a variant of Host a Capability, possibly seeded by a backup-restore step). Not handled here.

  • Migration to new underlying infrastructure. Moving the platform to a different cloud account or different home-lab hardware while the old one is still running is a different journey (the old platform serves traffic while the new one comes up). Out of scope until migration becomes a realistic case worth defining.

  • Sealed-credential takeover by the successor. The act of breaking the seal, asserting authority, and gaining access to the operator’s context belongs in its own UX. This UX picks up after takeover, where the successor is operating as the operator.

  • The broader drift-management process. This UX requires a preflight drift check before rebuild, but the wider machinery that continuously enforces tracked changes, detects drift between rebuilds, and maintains the last known-good reference is a cross-cutting concern, not the focus of this journey.

  • Loss of root-level foundations (cloud account itself, all home-lab access). These are assumed in place before the platform was deployed in the first place. Recovery from their loss is not part of the platform’s capability.

Open Questions

None at this time.

1.7 - Tenant-Facing Observability

A capability owner with a live tenant checks whether their hosted capability is healthy — either pulling the view themselves or being pushed an alert when something crosses a threshold they set.

One-line definition: A capability owner with a live tenant checks whether their hosted capability is healthy — either pulling the view themselves or being pushed an alert when something crosses a threshold they set.

Parent capability: Self-Hosted Application Platform

Persona

The actor is a capability owner whose capability has already been onboarded via Host a Capability and is currently running on the platform. As with that UX, this is written as if the capability owner were a separate person from the operator, even though today they are the same human wearing different hats.

  • Role: Capability owner of a live tenant. They are not operating the platform; they are operating their capability, which happens to be hosted here.
  • Context they come from: Their capability is live and serving its end users. They are not in the middle of onboarding or modifying — that’s a different journey. They want to know how their thing is doing right now, or they have just been pinged that something is wrong.
  • What they care about here: Knowing the health of their capability without depending on end users to report problems first, and without having to interrupt the operator to ask.

Goal

“I want to know whether my hosted capability is healthy right now — and I want the platform to ping me if it isn’t — so I find out before my end users do, and I can tell whether the problem is mine to fix or the platform’s.”

Two arrival modes share this goal: proactive pull (capability owner goes looking) and reactive push (an alert reaches them). The view they reach is the same in both cases.

Entry Point

Two distinct entries, converging on the same view.

Pull entry. The capability owner opens the observability offering’s tenant view. They might be doing this:

  • Routinely (e.g. before promoting a new release of their capability to end users).
  • Reactively, because an end user reported something looked off and they want to confirm.
  • Out of curiosity / habit.

They reach it by authenticating to the shared observability offering. After login they land directly in their own tenant’s view and stay confined there for the rest of the session. There is no separate URL per tenant — the same offering serves everyone, but capability owners do not browse across tenants or switch into an operator-wide view.

Push entry. The platform’s alerting reaches them by email. They are pulled away from whatever they were doing and now have a concrete “your capability looks unhealthy” signal in hand.

In both cases their access was provisioned automatically as part of the original onboard my capability flow (step 5 of Host a Capability) — observability is part of being hosted, not an add-on they request later.

Journey

1. Access is already in place (set up during onboarding)

By the time the capability owner has a live tenant, they already have:

  • A working login to the observability offering, scoped to their tenant.
  • Email alerting wired to the address they use for platform communication.
  • A platform-standard health bundle for their capability: availability, latency, error rate, resource saturation, and restart / deployment events.
  • A clear contract about trust: the tenant view is the source of truth for current health, while email alerts are a best-effort nudge that helps them notice trouble sooner.

Nothing in this UX requires them to set any of that up. Threshold tuning happens inside the observability offering, but any request to expand the signal bundle or add new delivery channels goes through modify my capability — not this journey.

2. (Pull mode) Capability owner opens the observability view

They authenticate to the observability offering and land on a view scoped to their tenant. They see the current state of the platform-standard health bundle for their capability: whether it is up, how quickly it is responding, how often it is failing, whether it is under resource pressure, and whether it has recently restarted or been redeployed.

What they perceive: a current-state read of their capability’s health, plus enough recent history to tell whether something is trending bad. They cannot see other tenants, and there is no mode-switch that broadens their scope — only the operator can do that.

3. (Pull mode) Capability owner tunes thresholds, if needed

While in the offering, the capability owner can self-serve their alert thresholds — the values that, when crossed, will fire an email alert to them. Thresholds are their call: the platform does not prescribe what’s unhealthy enough to wake them up.

This is the one self-service surface the platform exposes to capability owners. Everything else still goes through GitHub issues; thresholds are an exception because they are a tuning knob the capability owner needs to iterate on without operator involvement.

If the observability offering knows email delivery is degraded for this tenant, it says so in the tenant view. What the capability owner perceives: do not treat silence from email as reassurance until the delivery path is healthy again; use the pull view as the authoritative answer in the meantime.

4. (Push mode) An alert reaches the capability owner

A signal crossed a threshold they set. The platform sends an email alert. The alert names which signal and which capability — enough for them to start without opening anything else.

What they perceive: their capability is unhealthy enough that they wanted to be told. They now have to figure out whose problem it is.

5. Root-cause — is this the tenant or the platform?

The capability owner investigates. They have two possible conclusions:

5a. It’s the tenant. The signals point at their capability — their code, their data, their config. They handle it the way they would handle any problem with their capability: fix on their side, ship a new artifact via modify my capability if it requires a deployment, or operate within the running tenant if the tools to do so exist.

5b. It’s the platform. The signals point at something below their capability — the host is gone, networking is broken, the storage offering is degraded. They look for an open operator-side issue tracking it. If the operator has already opened one, they watch that issue; the operator owns the fix, and the capability owner’s role from here is to stay aware so they can communicate to their own end users. If no such issue exists yet, the operator will probably open one shortly (the operator gets the same signals); the capability owner does not need to file anything themselves.

6. Resolution

Either:

  • They fix their side of it and signals return to healthy. The alert (if there was one) does not need to be acknowledged — the platform stops alerting because the threshold is no longer crossed.
  • The operator fixes the platform side of it and signals return to healthy. The operator-side issue closes. The capability owner has been a passive watcher.

In either case the capability owner walks away with the same end state: their capability is healthy again, and they knew about the unhealth without an end user telling them.

Flow Diagram

flowchart TD
    Onboarded([Capability is live — observability access<br/>and email alerts provisioned during onboarding]) --> Trigger{What brought<br/>them here?}
    Trigger -->|Routine check / end user reported| Pull[Open observability view —<br/>see tenant-scoped signals]
    Trigger -->|Push alert from platform| Alert[Receive email alert:<br/>which signal, which capability]
    Pull --> Tune{Want to adjust<br/>thresholds?}
    Tune -->|Yes| Self[Self-serve threshold change<br/>in the observability offering]
    Tune -->|No| Read[Read the signals]
    Self --> Read
    Read --> Healthy{Healthy?}
    Healthy -->|Yes| Done([Walk away — capability is fine])
    Healthy -->|No| RootCause
    Alert --> RootCause[Root-cause: tenant or platform?]
    RootCause -->|Tenant| FixSelf[Fix on their side —<br/>via 'modify my capability' if needed]
    RootCause -->|Platform| Watch[Watch the operator's issue —<br/>operator opens one, capability owner observes]
    FixSelf --> Recover([Signals recover])
    Watch --> Recover

Success

A successful experience looks like:

  • The capability owner learned about a health problem before their end users had to tell them, or confirmed health proactively before promoting a change.
  • They could tell, from the signals alone, whether the problem was theirs or the platform’s — without having to interrupt the operator to ask.
  • If it was theirs, they fixed it through the channels they already use (modify issue, in-tenant tools, redeploy).
  • If it was the platform’s, they had something concrete to watch (the operator’s issue) and could relay status to their own end users.
  • They understood that the tenant view was authoritative and email was an acceleration path, so silence from email was never the only evidence they relied on.
  • They did not have to set anything up to make this work — onboarding put it in place.

Edge Cases & Failure Modes

  • Alert fatigue / ignored alerts. A capability owner who stops responding to their own alerts is not the platform’s problem — alerts are a courtesy; tenant health is tenant responsibility. The platform keeps emitting; what the capability owner does with them is their call.
  • Threshold set too tight, capability owner spammed. Self-serve thresholds means the capability owner can fix this themselves. The platform does not intervene to “save them from themselves.”
  • Threshold set too loose, real problems missed. Same — their call, their consequence. The platform’s defaults (whatever the observability offering ships with) provide a starting point.
  • Operator hasn’t opened a platform-side issue yet when the capability owner is investigating. The capability owner does not need to file one themselves. The operator gets the same signals and will open one. If they don’t and the problem persists, that is an operator-side failure, not a capability-owner-side action.
  • Capability owner suspects the platform but signals look fine for the platform. They surface this on a modify my capability issue or a comment to the operator — same surface they would use for anything ambiguous. This UX does not introduce a new issue type for “I think it’s you, not me.”
  • Alert delivery is broken (email bounces, mailbox rule hides it, etc.). The capability owner does not treat silence from email as proof of health; the pull view remains the source of truth. If the offering knows delivery is failing, the tenant view shows alerting as degraded so the capability owner understands email is currently unavailable as a nudge.
  • Capability owner wants more than email alerts or wants a broader signal bundle. Goes through modify my capability, not this UX — it’s a contract change about what the platform delivers to the tenant, even if a small one.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability’s Business Rules and Success Criteria:

  • Operator-only operation. The capability owner is not an operator. Their access is scoped to their own tenant; the operator is the only role that sees across tenants. The one self-service surface (threshold tuning) does not violate this — it adjusts only their own email alerts, not platform configuration.
  • Direct outputs include observability. The parent capability lists observability as a direct output: “the operator can tell whether each tenant is up and healthy without the tenant having to instrument that itself.” This UX extends that same plumbing to the capability owner with tenant-scoped data access — observability is offered as an offering, with cross-tenant visibility kept to the operator.
  • No direct end-user access to the platform. The capability owner’s end users do not get observability access. This view stops at the capability owner.
  • Tenants must accept the platform’s contract. The signals available are the platform-standard health bundle — availability, latency, error rate, resource saturation, and restart / deployment events. Capability owners do not ask the platform to instrument arbitrary tenant-specific metrics as part of this UX.
  • The capability evolves with its tenants. If multiple tenants need observability beyond the standard health bundle, the right response is to expand the offering’s category — not to push instrumentation back onto the tenant.
  • KPI: 2-hr/week operator maintenance budget. Implication: the alerting path must not produce so many false positives that the operator is constantly fielding “is this me or you?” questions from capability owners. Self-serve thresholds and “operator gets the same signals” are both pressure-reliefs on this — capability owners can tune their noise themselves, and they do not need to escalate “is this the platform?” questions to the operator because they can read the signals directly.

Out of Scope

  • Changing the signal bundle or adding non-email alert channels. Both go through Host a Capability’s modify loop, not here. They are contract changes about what the platform delivers.
  • End-user-facing observability. End users of a tenant do not get a “is the thing I use up?” view from the platform. If a tenant wants a status page for their end users, that is a feature of the tenant capability, not the platform.
  • Operator-side observability. The operator’s view across all tenants is its own surface, used during operator-driven journeys (rebuild, contract rollout, eviction decisions). This UX is strictly the capability owner’s slice.
  • Platform-side incident management. When the capability owner concludes “this is the platform’s problem and I’ll watch the operator’s issue,” what the operator does inside that issue is operator workflow — not part of this UX.
  • Threshold-tuning best practices. This UX provides the surface for self-serve threshold tuning; it does not document what thresholds a capability owner should pick. That belongs in the observability offering’s own documentation.

Open Questions

None at this time.

2 - Business Requirements

Business requirements extracted from the Self-Hosted Application Platform capability and its user experiences. Each requirement links back to its source. Technical requirements and decisions belong in tech-requirements.md and ADRs, not here.

Living document. This is regenerated from the capability and UX docs on demand. Numbering is append-only — once a BR is assigned, it keeps that number forever, even if removed (mark removed ones explicitly). Technical requirements cite BR-NN, so renumbering would silently break provenance.

Review gate. Set reviewed_at: in the frontmatter to today’s ISO date once you have read and edited this document. The define-technical-requirements skill will refuse to extract TRs until reviewed_at is newer than the file’s last modification.

Parent capability: Self-Hosted Application Platform

How to read this

Each requirement is forced by the capability or a user experience — it states, in business or user-outcome terms, what the system must guarantee. Decisions about the technical translation (cadences, durability levels, protocols) belong in tech-requirements.md. Decisions about how (which database, which library, which provider) belong in adrs/. If something in this list reads like a technical constraint or a chosen solution rather than a business demand, flag it for review.

Requirements

BR-01: Provide a single default hosting target for the operator’s capabilities

Source: Capability §Purpose & Business Outcome

Requirement: The platform must be the default place where the operator’s capabilities run, so that “where does this run?” is a solved question rather than re-litigated per capability. Any capability the operator defines must be eligible to run here unless it is explicitly exempted.

Why this is a requirement, not a TR or decision: This is the first stated business outcome of the capability and the rationale for the capability existing at all. It does not name a technology — it sets the demand that the platform exist and absorb every capability by default.

BR-02: Platform must be reproducible from its definitions

Source: Capability §Purpose & Business Outcome · UX: Stand Up the Platform §Journey

Requirement: The platform itself must be rebuildable from its definitions, with no manual snowflake configuration, so that a total loss does not mean a permanent loss of the platform. Any state the platform depends on must be expressible as part of the definitions.

Why this is a requirement, not a TR or decision: Reproducibility is one of the capability’s named outcomes and is the operative test of “self-hosted.” It does not specify how (which IaC tool, which packaging form) — only the demand that the platform be rebuildable from authoritative inputs.

BR-03: Operator must retain end-to-end control over platform components

Source: Capability §Business Rules & Constraints

Requirement: Wherever the platform uses third-party components (vendor services, public-cloud offerings), the operator must retain control of configuration, data, and the ability to leave. Vendor lock-in that prevents departure is unacceptable.

Why this is a requirement, not a TR or decision: “Self-hosted” is defined in the capability as operator-controlled end-to-end, not as forbidding all vendors. The BR sets the demand (retain control + ability to leave), not the implementation choice.

BR-04: Platform-level investments must accrue to all tenants

Source: Capability §Purpose & Business Outcome

Requirement: Improvements made at the platform level (resiliency, observability, backup, security) must benefit every tenant capability rather than be re-solved per tenant.

Why this is a requirement, not a TR or decision: This is one of the four stated outcomes the capability promises. It demands a property of the platform’s offerings (shared, not per-tenant), without choosing how that sharing is achieved.

BR-05: Only the operator may administer the platform

Source: Capability §Business Rules & Constraints

Requirement: There must be no co-operators, no delegated administration, and no shared day-to-day administration of the platform. The operator is the sole administrator.

Why this is a requirement, not a TR or decision: The capability’s “Operator-only operation” rule states this in absolute terms. It is a forced constraint — every UX is shaped around the operator being the only one with administrative reach.

BR-06: End users of tenants must have no direct access to the platform

Source: Capability §Business Rules & Constraints · UX: Move Off the Platform After Eviction §Constraints Inherited · UX: Move Off the Platform After Eviction §Journey

Requirement: End users of tenant capabilities reach the tenant, not the platform. The platform must have no notion of “end users” of itself, no UI for them, and no communication channel to them — including during eviction. Capability owners, not the platform, are responsible for notifying their own end users of any tenant lifecycle change (such as an impending shutdown).

Why this is a requirement, not a TR or decision: The capability explicitly excludes direct end-user access; the eviction UX reinforces it (the platform never tells end users “this tenant has been retired”) and assigns the notification responsibility to the capability owner. The BR forbids a class of behavior, not a specific implementation.

BR-07: A designated successor must be able to take over operation if the primary operator becomes unavailable

Source: Capability §Business Rules & Constraints · UX: Stand Up the Platform §Constraints Inherited

Requirement: The platform must support a designated successor operator who holds sealed/escrowed credentials and a runbook sufficient to keep the platform running if the primary operator becomes unavailable. Successor credentials are not used for routine operation.

Why this is a requirement, not a TR or decision: Operator succession is one of the capability’s business rules and the standup UX assumes a successor can run the rebuild flow. The BR demands that takeover be possible; how the seal works is a downstream concern.

BR-08: Platform must produce on-demand exportable archives of tenant data while healthy

Source: Capability §Business Rules & Constraints · UX: Move Off the Platform After Eviction §Journey

Requirement: While the platform is up, every tenant’s users (via the capability owner) must be able to retrieve their content as a portable archive without operator involvement. Export availability is conditional only on the platform being healthy.

Why this is a requirement, not a TR or decision: This is half of the capability’s “Operator succession” rule (the other half is the successor) and the central mechanism of the eviction UX. It states what the user must be able to obtain, not how the export is implemented. Pairs with BR-09, which forbids gaps in export-tooling coverage.

BR-09: Export tooling must exist for every kind of data the platform hosts

Source: UX: Move Off the Platform After Eviction §Edge Cases

Requirement: Export tooling must be a core platform feature, available for every data shape the platform hosts. There must be no tenant whose data shape lacks an export path at the time of eviction.

Why this is a requirement, not a TR or decision: The eviction UX explicitly states this cannot-happen-by-design property and treats any gap as a platform bug. The BR forbids a class of failure rather than naming a tool.

BR-10: Exports must include platform-produced verification material

Source: UX: Move Off the Platform After Eviction §Journey

Requirement: Each export the platform produces must be accompanied by a checksum/hash and total size in bytes, so the capability owner can verify integrity. Semantic correctness validation remains the capability owner’s responsibility; the platform’s verification is the ceiling of what it can offer on the user’s behalf.

Why this is a requirement, not a TR or decision: The eviction UX makes this guarantee explicit — the platform produces the integrity envelope; the user judges semantic correctness. The BR demands the envelope, not a specific hash function.

BR-11: Tenant data must remain retrievable for 30 days after eviction in a read-only state

Source: UX: Move Off the Platform After Eviction §Journey

Requirement: From the eviction date forward, the platform must hold the tenant’s data in an export-only, read-only state for 30 days, during which the export tool must continue to work. After 30 days, the platform must stop offering any tenant-accessible copy of that data.

Why this is a requirement, not a TR or decision: The eviction UX defines the 30-day window as a hard tenant-facing guarantee. It is a business commitment to the departing capability owner, not a technical translation.

BR-12: Export-tooling defects must pause the post-eviction retention countdown

Source: UX: Move Off the Platform After Eviction §Edge Cases

Requirement: If a failure rooted in the platform’s export tooling or data hosting prevents a clean export, the operator must pause that tenant’s retention-window countdown until a clean export can be produced. Failures rooted in the capability owner’s own validation steps must not pause the countdown.

Why this is a requirement, not a TR or decision: This is the only carve-out in the eviction UX’s otherwise-hard 30-day rule, and it allocates accountability — the platform absorbs slippage caused by its own defects, never the user’s. That is a business commitment.

BR-13: Tenants must declare resource needs, packaging form, identity choice, and availability expectations up front

Source: Capability §Business Rules & Constraints · UX: Host a Capability §Constraints Inherited

Requirement: To be hosted, a tenant must arrive packaged in the form the platform accepts, with declared resource needs, an identity-service choice, and acceptance of the platform’s current availability characteristics. The declarations are made in the tech design and reviewed before approval.

Why this is a requirement, not a TR or decision: The capability’s “Tenants must accept the platform’s contract” rule and the host-a-capability UX both make this the price of admission. The BR demands the declaration; what shape “packaging” takes is a downstream decision.

BR-14: Tenant onboarding must require explicit operator authorization

Source: Capability §Triggers & Inputs · UX: Host a Capability §Journey

Requirement: No capability may begin running on the platform without the operator explicitly authorizing it. There must be no self-service onboarding path.

Why this is a requirement, not a TR or decision: The capability lists this as a precondition; the UX’s “approved” comment is its operationalization. It is a control demand, not a technical translation.

BR-15: All capability-owner ↔ platform engagement must occur through a single, recorded, asynchronous issue-thread workflow

Source: UX: Host a Capability §Journey · UX: Migrate Existing Data §Journey · UX: Operator-Initiated Tenant Update §Journey · UX: Platform-Contract-Change Rollout §Journey · UX: Move Off the Platform After Eviction §Entry Point

Requirement: Every operator/capability-owner exchange (onboarding, modification, migration, platform-driven update, contract change, eviction) must occur on a single recorded, asynchronous-by-default issue thread that both parties can read and append to. There must be no self-service portal and no other front door, and exchanges must not happen over ephemeral channels (chat, email-only, voice) where the trail is lost.

Why this is a requirement, not a TR or decision: The UXes demand the properties — single thread, recorded, asynchronous, no other front door — without naming a tracker. Choosing a specific tracker (e.g. GitHub Issues) is a downstream decision recorded in an ADR, not here.

BR-16: Issue types must distinguish review scopes legibly

Source: UX: Host a Capability §Journey · UX: Migrate Existing Data §Journey · UX: Operator-Initiated Tenant Update §Journey · UX: Platform-Contract-Change Rollout §Journey

Requirement: Distinct issue types must exist for the distinct conversations the platform has: onboarding a capability, modifying a hosted capability, migrating data, operator-initiated forced updates, platform contract changes, and eviction. The type itself signals the operator’s review scope and the capability owner’s expectations.

Why this is a requirement, not a TR or decision: Each UX names its issue type explicitly and explains why it is distinct. It is a coordination demand on the platform, not a tooling decision.

BR-17: The platform contract must be evergreen for already-hosted tenants

Source: UX: Host a Capability §Journey · UX: Platform-Contract-Change Rollout §Constraints Inherited

Requirement: A capability owner must not be required to re-accept the platform’s contract on each modify request. Changes to the platform’s contract are the platform’s responsibility to communicate ahead of time and migrate tenants through; the contract is never sprung on a tenant during a modify request.

Why this is a requirement, not a TR or decision: The host-a-capability UX states the evergreen property; the contract-change rollout UX is its operationalization. It is a promise to the user, not a technical translation.

BR-18: Capability owners must be able to update tenant needs after onboarding via a delta-only review

Source: UX: Host a Capability §Journey

Requirement: Once a tenant is live, its capability owner must be able to file a modify request that the operator reviews scoped to the delta only — not as a full re-evaluation of the tenant.

Why this is a requirement, not a TR or decision: The host-a-capability UX’s change-later loop is built around this property. It is a user-experience demand on the modify path, not a tooling choice.

BR-19: Platform must work with tenants on fall-behind cases rather than evict

Source: Capability §Business Rules & Constraints · UX: Operator-Initiated Tenant Update §Journey

Requirement: When a tenant’s components have fallen behind what the platform supports, the default operator response must be to bring the tenant current rather than evict. Eviction in fall-behind cases must occur only as a downstream consequence of a missed operative delivery date, never as the first response.

Why this is a requirement, not a TR or decision: The capability’s eviction rule carves out this behavior explicitly; the operator-initiated-tenant-update UX is the carve-out’s operationalization. It is a user-facing commitment, not a technical translation.

BR-20: Forced-update issues must carry the external reason and inherited deadline

Source: UX: Operator-Initiated Tenant Update §Journey

Requirement: When the operator opens a platform update required issue, it must name the external pressure forcing the change (vendor sunset, CVE, EOL) and the deadline inherited from that pressure. Each forcing event gets its own issue, even when the same tenant is hit by multiple events at once.

Why this is a requirement, not a TR or decision: The UX makes both properties explicit and motivates them — the capability owner needs to see why and by when. It is a transparency demand.

BR-21: Extensions to inherited deadlines must be bounded by the external pressure’s safe slack

Source: UX: Operator-Initiated Tenant Update §Journey

Requirement: When a capability owner cannot ship within an inherited deadline, any extension granted must be sized to the slack the external pressure actually allows — never invented by the operator independent of that pressure. If the pressure leaves no safe slack, extensions must be refused.

Why this is a requirement, not a TR or decision: The UX explicitly states this constraint. It is a control on operator discretion to prevent the platform from absorbing risk it cannot honestly carry.

BR-22: A missed operative delivery date must result in a separate, linked eviction issue

Source: UX: Operator-Initiated Tenant Update §Journey

Requirement: When a capability owner misses the operative date for a forced update (the original inherited deadline or an agreed extension), the operator must open a separate eviction issue linked back to the update issue, and close the update issue as superseded. Eviction must not be re-policed inside the update flow.

Why this is a requirement, not a TR or decision: The UX prescribes this exact split, with the rationale that update-flow scope and eviction-flow scope must remain distinct. It is a coordination demand on the platform.

BR-23: Operator-driven contract changes must be communicated ahead of time via a single umbrella issue

Source: UX: Platform-Contract-Change Rollout §Journey

Requirement: When the operator chooses to retire an offering, change a packaging form, or alter availability characteristics, the change must be announced via a single umbrella issue tagging every affected capability owner, containing what is changing, what it is changing to, the deadline, the reason, and the migration guideline (where applicable).

Why this is a requirement, not a TR or decision: The contract-change rollout UX names this shape explicitly and motivates the umbrella over per-tenant issues. It is the operationalization of the evergreen promise.

BR-24: Contract-change deadlines must give every affected tenant at least two status-update cycles

Source: UX: Platform-Contract-Change Rollout §Journey

Requirement: When the operator picks a contract-change deadline, it must allow every affected tenant at least two full status-update cycles before cutoff — one to acknowledge and start, one to finish or surface blockers with time still to respond.

Why this is a requirement, not a TR or decision: The UX states this minimum explicitly. It is a fairness commitment that bounds operator discretion.

BR-25: Contract-change deadlines must not be negotiable per-tenant

Source: UX: Platform-Contract-Change Rollout §Journey

Requirement: Capability owners must not be able to negotiate per-tenant slips of a contract-change deadline. The deadline applies uniformly; only a global extension (covering every affected tenant) is available, and only when the migration guideline itself proves insufficient.

Why this is a requirement, not a TR or decision: The UX makes this rule absolute. It enforces that the deadline remains a deadline rather than degrading into a per-tenant negotiation.

BR-26: Capability owners must explicitly acknowledge contract-change umbrella issues

Source: UX: Platform-Contract-Change Rollout §Journey

Requirement: Each tagged capability owner on a contract-change umbrella issue must explicitly acknowledge the change in-thread. Silence in a multi-tenant thread is treated as non-engagement and feeds the same laggard branch as failing to migrate.

Why this is a requirement, not a TR or decision: The UX states the acknowledgment requirement and its consequence. It is an engagement contract, not a technical mechanism.

BR-27: During contract-change rollout, old and new forms must run concurrently when a replacement exists

Source: UX: Platform-Contract-Change Rollout §Journey

Requirement: When a contract change replaces an old offering with a new one, the platform must serve both forms concurrently throughout the rollout window. Full offering removals (no replacement) are exempt — the change is all-or-nothing at the deadline.

Why this is a requirement, not a TR or decision: The UX prescribes the concurrent rollout window and names the carve-out. It is a user-facing commitment that gives tenants room to migrate at their own pace.

BR-28: Replacement offerings must be implemented and running before a contract-change umbrella issue is filed

Source: UX: Platform-Contract-Change Rollout §Entry Point

Requirement: Where a contract change replaces an old offering with a new one, the replacement must already be implemented and running on the platform alongside the old one before the umbrella issue is filed. Tenants must never be asked to migrate against an unbuilt replacement.

Why this is a requirement, not a TR or decision: The UX states this as a precondition of the journey. It is a quality-of-rollout commitment.

BR-29: Operator must post regular status updates throughout a contract-change rollout

Source: UX: Platform-Contract-Change Rollout §Journey

Requirement: During a contract-change rollout the operator must post status updates on a regular schedule (cadence sized to the timeline), with the current snapshot in the umbrella issue body and each scheduled update also as a comment. Each update must report how many tenants are still on the old form, how many have migrated, which modify issues are open, and how much time remains.

Why this is a requirement, not a TR or decision: The UX prescribes both the cadence shape and the metrics. It is a transparency demand on rollout coordination.

BR-30: At the contract-change deadline, the old form must be removed and laggards must transition to eviction

Source: UX: Platform-Contract-Change Rollout §Journey

Requirement: On the contract-change deadline, the old form must be removed regardless of remaining tenants on it. For each tenant that has not migrated, the operator must open a separate eviction issue (linked to the umbrella) and the umbrella must close. No tenant may be silently broken on a removed offering.

Why this is a requirement, not a TR or decision: The UX prescribes this exact closeout behavior. It is the inverse of the evergreen promise — once communicated and given time, the deadline is real.

BR-31: Eviction must be allowed when needs and capabilities fundamentally diverge

Source: Capability §Business Rules & Constraints · UX: Move Off the Platform After Eviction §Persona

Requirement: The platform must be able to decline continued hosting for a tenant whose requirements it cannot meet — specialized hardware, regulatory constraints, an availability target stronger than the platform offers. Eviction is initiated by the operator, not the capability owner.

Why this is a requirement, not a TR or decision: The capability defines this rule and the eviction UX operationalizes it. It is a control demand on what the platform is allowed to refuse.

BR-32: Eviction-date negotiation must occur upstream of the eviction journey

Source: UX: Move Off the Platform After Eviction §Entry Point

Requirement: By the time the eviction issue is filed, the eviction date must already be agreed and not subject to renegotiation inside the eviction journey. The 30-day post-eviction retention is the only post-date slack and is fixed.

Why this is a requirement, not a TR or decision: The UX states this as a hard wall. It is a coordination commitment that protects both parties from re-litigation.

BR-33: Eviction issues must contain the date, the reason, and a link to export tooling with documentation

Source: UX: Move Off the Platform After Eviction §Entry Point

Requirement: An eviction issue must carry exactly the eviction date, the reason for eviction, and a link to the export tool with documentation describing how to use it and the export shape. Nothing else is required of the issue.

Why this is a requirement, not a TR or decision: The UX names these contents and treats the issue as self-sufficient. It is a content commitment to the departing user.

BR-34: Tenant compute and network must be torn down on the eviction date

Source: UX: Move Off the Platform After Eviction §Journey

Requirement: On the eviction date, compute and network for the tenant must be torn down. Tenant data then enters the export-only, read-only state covered by BR-11 for the duration of the retention window — no further writes by anyone.

Why this is a requirement, not a TR or decision: The UX prescribes this teardown distinctly from the data-state guarantee. Compute/network teardown is what makes the dataset stable for export; the read-only data window itself is BR-11’s commitment, referenced here rather than restated.

BR-35: Capability owner — not the platform — must notify their own end users of eviction

Status: Removed on 2026-04-28 — absorbed into BR-06.

The platform-no-communication-with-end-users rule was already absolute in BR-06; the eviction-context clarifier and the capability-owner-notifies content are now folded into BR-06 directly. Number retained per the doc’s append-only rule so existing TR citations (if any) stay valid.

BR-36: Platform must offer a one-shot migration-process runner for capability-owner-supplied jobs

Source: UX: Migrate Existing Data §Goal · UX: Migrate Existing Data §Constraints Inherited

Requirement: The platform must provide a runner for one-time migration jobs that the capability owner writes and packages. The platform runs the process; it does not write, debug, or shepherd it.

Why this is a requirement, not a TR or decision: The migration UX is built around this offering and is explicit about the seam — platform runs, owner authors. It is a service commitment that bounds platform responsibility.

BR-37: Migration jobs must be packaged in the same form as any other tenant component

Source: UX: Migrate Existing Data §Constraints Inherited

Requirement: The BR-13 packaging requirement applies to migration jobs without exception — the contract must not relax for migration. A process that cannot be packaged in the form the platform accepts cannot be run by the platform.

Why this is a requirement, not a TR or decision: The UX states this no-carve-out constraint explicitly. BR-13 establishes the packaging form; this BR forbids relaxing it for the migration case, which is the only place a relaxation might plausibly be argued for.

BR-38: Migration jobs must declare their re-run contract and any temporary spikes up front

Source: UX: Migrate Existing Data §Journey

Requirement: A migration request must declare whether the process is safe to re-run against an already-populated destination (or requires a wiped destination), and any temporary migration-only spike beyond the tenant’s steady-state footprint. Approval of spikes is bounded by what the platform can accommodate.

Why this is a requirement, not a TR or decision: The UX names both declarations as part of the operator’s review scope. It is a content demand on what tenants must communicate, not a technical translation.

BR-39: Migration peak footprint must not exceed twice the destination tenant’s steady-state footprint

Source: UX: Migrate Existing Data §Journey

Requirement: The peak temporary footprint of a migration (steady-state plus declared spike) must be no more than 2× the destination tenant’s steady-state compute and storage. If either dimension exceeds that threshold, the request must be rejected as written; the capability owner is asked to split, reduce, or resize the tenant first.

Why this is a requirement, not a TR or decision: The UX states the 2× limit as a hard review rule. It bounds the burden one tenant’s migration may place on the platform.

BR-40: Concurrent migrations across tenants must be supported

Source: UX: Migrate Existing Data §Journey

Requirement: The platform must support multiple migrations running at once across different tenants without changing each tenant’s experience of their own journey. Tenants must not expect exclusive use of the migration runner.

Why this is a requirement, not a TR or decision: The UX specifies this property explicitly. It is a capacity commitment that prevents migrations from serializing.

BR-41: Recovery from migration failure must follow the capability owner’s plan, not a platform-prescribed model

Source: UX: Migrate Existing Data §Journey

Requirement: When a migration job fails or its output fails validation, the next step must be whatever plan the capability owner provides (wipe-and-retry, resume, accept partial, abandon). The platform must not auto-clean, auto-retry, or prescribe a recovery model.

Why this is a requirement, not a TR or decision: The UX places the recovery decision squarely with the data owner. It is an allocation-of-responsibility commitment.

BR-42: A migration job artifact must be torn down on completion

Source: UX: Migrate Existing Data §Journey

Requirement: Once a migration job is closed (successful or abandoned), the platform must tear down the job. The platform must not retain it; re-running later means filing a fresh migration issue.

Why this is a requirement, not a TR or decision: The UX states the one-shot lifespan and tear-down explicitly. It is a lifecycle commitment that prevents migrations from accumulating into unmanaged state.

BR-43: Platform must provide secret management for tenant-supplied credentials referenced by their components

Source: UX: Migrate Existing Data §Journey

Requirement: The platform must offer a secret-management surface that capability owners can populate independently of the operator, so their components and migration processes can reference credentials by name without leaking the secrets through engagement-thread comments.

Why this is a requirement, not a TR or decision: The migration UX assumes such an offering and operationalizes its use. It is a capability-level demand for handling credentials safely.

BR-44: Each tenant must be provided compute, persistent storage, network reachability, identity, backup/DR, and observability

Source: Capability §Outputs & Deliverables

Requirement: For each hosted tenant, the platform must provide compute (a place for the application to run), persistent storage durable to the platform’s defined standard, network reachability both internal and external, identity and authentication for the tenant’s end users, backup and disaster recovery for tenant data, and observability that lets the operator and capability owner tell whether the tenant is healthy.

Why this is a requirement, not a TR or decision: This is the capability’s stated direct outputs. It is the inventory of what every tenant must receive, named in business terms. The identity entry has a tenant-choice carve-out captured separately in BR-46 (BYO identity); this BR commits to availability of the inventory, not to the platform being the sole source of identity.

BR-45: Platform-provided identity service must support the “lost credentials cannot be recovered” property

Source: Capability §Business Rules & Constraints · UX: Host a Capability §Constraints Inherited

Requirement: Any identity option the platform offers to tenants must be capable of honoring a Signal-style “lost credentials cannot be recovered” property. An identity option that cannot honor this property is not eligible to be the platform-provided identity service.

Why this is a requirement, not a TR or decision: The capability rule names this property explicitly because at least one tenant requires it. It is a forced constraint on the identity offering, not a vendor selection.

BR-46: Tenants must be able to bring their own identity if they choose

Source: Capability §Outputs & Deliverables · Capability §Triggers & Inputs

Requirement: Tenants must have the option to bring their own identity service rather than use the platform-provided one. Their decision is recorded in their tech design, not at onboarding time.

Why this is a requirement, not a TR or decision: The capability lists BYO identity as a tenant choice and the host-a-capability UX confirms it is recorded upstream of onboarding. It is a flexibility commitment.

BR-47: Platform must be rebuildable to “ready to host tenants” within 1 hour

Source: Capability §Success Criteria & KPIs · UX: Stand Up the Platform §Goal

Requirement: Starting from no platform at all (with definitions repo and root-level access in hand), the platform must be rebuildable to a ready-to-host-tenants state within 1 hour. The KPI is a target — exceeding it does not block the platform from going into service, but it must be tracked as a follow-up.

Why this is a requirement, not a TR or decision: This is the stated Reproducibility KPI of the capability and the standup UX is the journey it is measured against. It is a business commitment to recovery speed.

BR-48: Rebuild readiness must be validated end-to-end by a purpose-built canary tenant

Source: UX: Stand Up the Platform §Journey

Requirement: Standup must conclude with the deployment, exercise, and teardown of a purpose-built canary tenant maintained alongside the platform definitions. “Ready to host tenants” must be demonstrated by hosting a tenant — not declared from infrastructure self-checks alone.

Why this is a requirement, not a TR or decision: The standup UX defines this as the binding readiness signal. It is a confidence demand that infrastructure-only checks would not satisfy.

BR-49: Each rebuild phase must support clean teardown of partial state

Source: UX: Stand Up the Platform §Edge Cases · UX: Stand Up the Platform §Constraints Inherited

Requirement: Every phase of the rebuild must be reversible — “delete everything provisioned so far and start over” must be a viable, reliable option at every checkpoint. Partial state must not be trusted across a phase failure.

Why this is a requirement, not a TR or decision: The standup UX prescribes this property as part of “phase fails → tear down everything and restart.” It is a reliability demand on the rebuild flow.

BR-50: A reproducibility drill must run after every significant platform change and at least quarterly

Source: UX: Stand Up the Platform §Entry Point · UX: Stand Up the Platform §Constraints Inherited

Requirement: The reproducibility KPI must be honestly evaluated by running a parallel rebuild drill on scratch infrastructure after every significant platform change (any change that would alter what is rebuilt, what must be validated, or what must be trusted) and at least quarterly while the live platform keeps serving.

Why this is a requirement, not a TR or decision: The standup UX names this cadence explicitly as the integrity check on the KPI. It is a discipline commitment, not a tool.

BR-51: Platform must enforce tracked changes and immutability so drift can be detected before rebuild

Source: UX: Stand Up the Platform §Entry Point · UX: Stand Up the Platform §Constraints Inherited

Requirement: Every platform UX that can introduce platform state must enforce tracked changes and immutability. The standup journey must perform a preflight drift check whenever prior platform state exists; the check must pass (no unexplained differences) before rebuild begins.

Why this is a requirement, not a TR or decision: The standup UX prescribes the preflight drift check and is explicit that drift must be prevented and detected outside the rebuild flow. It is an integrity commitment.

BR-52: Rebuild must span both public and private infrastructure as part of foundations

Source: Capability §Business Rules & Constraints · UX: Stand Up the Platform §Journey

Requirement: The platform may span public-cloud and home-lab infrastructure, and rebuild must establish the foundations — including connectivity between the two — as part of the standard standup flow. Cross-environment connectivity is foundational, not an afterthought.

Why this is a requirement, not a TR or decision: The capability allows the span; the standup UX makes Phase 1 explicitly cross-environment. It is a scope demand on what the rebuild must produce.

BR-53: Tenant-facing observability must include a platform-standard health bundle

Source: UX: Tenant-Facing Observability §Journey

Requirement: Each capability owner with a live tenant must receive, automatically, a tenant-scoped view of a platform-standard health bundle: availability, latency, error rate, resource saturation, and restart/deployment events. Capability owners must not have to instrument their own capability to see these signals.

Why this is a requirement, not a TR or decision: The observability UX defines the bundle and names automatic provisioning. It is a content commitment of the observability offering.

BR-54: Capability owners must be able to self-serve their own alert thresholds

Source: UX: Tenant-Facing Observability §Journey

Requirement: Within the observability offering, the capability owner must be able to tune the thresholds at which alerts fire to them, without operator involvement. The platform must not prescribe what counts as unhealthy enough to alert on.

Why this is a requirement, not a TR or decision: The observability UX names this as the one self-service surface and motivates it as a maintenance-budget pressure-relief. It is an authority demand — the user decides their own alerting.

BR-55: Platform must push alerts to capability owners when their thresholds are crossed

Source: UX: Tenant-Facing Observability §Journey

Requirement: When a tenant signal crosses a capability-owner-set threshold, the platform must send an alert to the capability owner that names which signal and which capability. The alert path is a best-effort nudge, not the source of truth.

Why this is a requirement, not a TR or decision: The observability UX names email as the current channel but the BR captures the demand (push alerts on threshold crossings, name the signal and capability). The channel is a downstream decision.

BR-56: Tenant view must indicate degraded alert delivery when known

Source: UX: Tenant-Facing Observability §Journey · UX: Tenant-Facing Observability §Edge Cases

Requirement: When the observability offering knows its alert delivery to a tenant is degraded, the tenant view must surface that fact, so silence from the alert path is not mistaken for evidence of health. The pull view must remain authoritative for current health.

Why this is a requirement, not a TR or decision: The UX states this property explicitly and motivates it as a trust commitment. It is a transparency demand.

BR-57: Tenant observability access must be scoped to the tenant; cross-tenant visibility is operator-only

Source: UX: Tenant-Facing Observability §Entry Point · UX: Tenant-Facing Observability §Constraints Inherited

Requirement: A capability owner authenticated to the observability offering must land directly in their own tenant’s view and stay confined there for the rest of the session. There must be no mode-switch that broadens scope; only the operator sees across tenants.

Why this is a requirement, not a TR or decision: The UX names this isolation property and the capability’s operator-only rule reinforces it. It is a confidentiality commitment.

BR-58: Tenant observability access must be provisioned automatically as part of onboarding

Source: UX: Host a Capability §Journey · UX: Tenant-Facing Observability §Entry Point

Requirement: A capability owner whose onboarding has closed must already have a working login to the observability offering and a wired alert-delivery address — without filing a separate request. Observability is part of being hosted.

Why this is a requirement, not a TR or decision: Both UXes assume this is true at the moment a tenant is live. It is an integration commitment between onboarding and observability.

BR-59: Routine operator maintenance must remain within 2 hours per week

Source: Capability §Success Criteria & KPIs

Requirement: The total routine operation of the platform — across all hosted tenants and platform-internal work — must take no more than 2 hours per week of the operator’s time. If maintenance regularly exceeds this, the platform must be simplified, not grown.

Why this is a requirement, not a TR or decision: This is the Operator maintenance budget KPI. It is a hard upper bound on what the platform may demand of its operator and is referenced in nearly every UX as a pressure constraint.

BR-60: A tenant whose accommodation would push routine maintenance sustainably above twice the maintenance budget must be evictable

Source: Capability §Business Rules & Constraints · UX: Operator-Initiated Tenant Update §Journey · UX: Migrate Existing Data §Constraints Inherited · UX: Platform-Contract-Change Rollout §Constraints Inherited

Requirement: When continuing to accommodate a tenant would push routine maintenance sustainably above 2× the maintenance budget, or break reproducibility (require manual snowflake configuration), the platform must be able to evict that tenant. Either condition alone must be sufficient grounds.

Why this is a requirement, not a TR or decision: The capability defines the eviction threshold and several UXes name it as the operative trigger. It is the control that prevents the maintenance budget from being eroded indefinitely.

BR-61: Tenant adoption must be measured against implemented capabilities, with explicit-loss capture

Source: Capability §Success Criteria & KPIs · UX: Host a Capability §Edge Cases

Requirement: Adoption is measured by counting only implemented capabilities (deployed and serving end users in production) — defined-only or designed-only capabilities are neutral. An implemented capability that runs elsewhere counts negatively, and a tenant lost because the operator went silent must be recorded explicitly on the issue rather than being silently dropped.

Why this is a requirement, not a TR or decision: The capability defines the KPI’s mechanic; the host-a-capability UX defines the explicit-loss recording. It is a measurement-discipline commitment.

BR-62: Operating cost must remain proportional to delivered convenience and resiliency

Source: Capability §Success Criteria & KPIs

Requirement: Total operating cost must remain within what the operator considers acceptable given the convenience and resiliency the platform delivers. There is no fixed dollar target; the test is whether the operator would still choose to run the platform knowing the bill.

Why this is a requirement, not a TR or decision: This is the Cost stays proportional to value KPI. It is a business commitment that bounds investment without prescribing a number.

BR-63: Buy-vs-build trade-offs must be judged on convenience, resiliency, and cost only

Source: Capability §Business Rules & Constraints

Requirement: When the platform decides between buying and building a component, the inputs must be convenience, resiliency, and cost. Operator skill development must not influence the trade-off; “I want to learn this” is not, on its own, a valid reason to choose build over buy.

Why this is a requirement, not a TR or decision: The capability rule explicitly forbids skill-development as an input. It is a control on decision-making, not a technical translation.

BR-64: When a tenant needs something the platform does not yet provide, the default response must be to evolve the platform

Source: Capability §Business Rules & Constraints · UX: Host a Capability §Journey · UX: Migrate Existing Data §Constraints Inherited

Requirement: When a tenant capability requires something the platform does not yet offer, the default response must be to update the platform to provide it — bounded by the reproducibility and maintenance KPIs. The platform must not push the requirement back onto the tenant as a first response, but is not obligated to grow without bound.

Why this is a requirement, not a TR or decision: The capability defines this rule and the host-a-capability UX’s “new offering needed” branch operationalizes it. It is the rule that keeps the platform tenant-aligned without making it infinitely extensible.

BR-65: All tenant data — including platform-held backups — must be permanently deleted at the end of the 30-day post-eviction retention window

Source: UX: Move Off the Platform After Eviction §Journey · UX: Move Off the Platform After Eviction §Success

Requirement: When the 30-day post-eviction retention window ends, the platform must permanently delete the tenant’s data across every tier it controls — the tenant-accessible export-only copy and any deeper backup-tier copies. No residual platform-held copy of an evicted tenant’s data may survive day 30 in any tier, and no operator-only access path may persist past that point.

Why this is a requirement, not a TR or decision: BR-11 commits only to the tenant-accessible-copy side; this BR closes the symmetric question for the platform’s own backup tier. It is a privacy commitment to the departing tenant — the eviction date plus 30 days is the last day any platform-controlled copy of their data exists, full stop. The BR forbids residual copies as a class; how backups are pruned (retention-policy mechanics, deletion verification) is a downstream concern.

Open Questions

  • Volunteered-but-parked technical translations. None volunteered during this extraction. Placeholder so re-extractions have a home for things like specific cadences, durability levels, or protocols that surface during conversation.

3 - Technical Requirements

Technical requirements derived from the Self-Hosted Application Platform capability’s business requirements, with the capability and UX docs as supporting context. Each TR cites the BR-NN it derives from. Decisions belong in ADRs, not here.

Living document. This is regenerated from business-requirements.md (and the capability/UX docs) on demand. Numbering is append-only — once a TR is assigned, it keeps that number forever, even if removed (mark removed ones explicitly). ADRs cite TR-NN, so renumbering would silently break provenance.

Review gate. Set reviewed_at: in the frontmatter to today’s ISO date once you have read and edited this document. The plan-adrs skill will refuse to enumerate decisions until reviewed_at is newer than the file’s last modification.

Parent capability: Self-Hosted Application Platform Business requirements: business-requirements.md

How to read this

Each TR is forced — by a BR (the primary case), by a prior shared ADR, or by a repo-wide constraint. It says what the technical solution must do, not how. Decisions about how (which database, which protocol, which library) belong in adrs/, not here. If something in this list reads like a chosen solution rather than a constraint, flag it for review. If something has no BR or inherited-constraint source, raise a missing BR back to extract-business-requirements.

Requirements

TR-01: Platform state must be entirely expressible as version-controlled definitions

Source: BR-02 · BR-51 · UX: Stand Up the Platform §Constraints Inherited

Requirement: Every piece of platform runtime state — each offering, every per-tenant binding, every shared piece of configuration the platform depends on — must be expressible in a tracked-changes definitions repository. Anything modifiable outside that repository is drift, and any UX that introduces platform state must route through the same recorded-change surface.

Why this is a TR, not a BR or decision: BR-02 demands reproducibility; BR-51 demands tracked changes and immutability so drift can be detected. The technical translation is that the definitions repository is the only authoritative surface for platform-modifying writes. Which repository, which tracked-changes mechanism, and which immutability discipline are downstream decisions.

TR-02: Platform must expose a single top-level rebuild entry point that runs end-to-end from definitions in ≤60 minutes

Source: BR-02 · BR-47 · UX: Stand Up the Platform §Journey

Requirement: A single operator-invocable entry point must drive the rebuild from a fresh pull of the definitions repository, sequence the foundations → core services → cross-cutting → canary phases automatically, and be capable of completing within 60 minutes of wall-clock time on the target infrastructure when run end-to-end. Manual checkpoints between phases are permitted; manual driving of each step is not.

Why this is a TR, not a BR or decision: BR-47 is the 1-hour rebuild target; BR-02 is the demand that rebuild is from definitions only. The TR is the operative property — one entry point, automated, time-bounded — without naming a specific automation tool, language, or orchestrator.

TR-03: Rebuild Phase 1 must establish foundations across both public-cloud and home-lab environments and the connectivity between them

Source: BR-52 · UX: Stand Up the Platform §Journey

Requirement: The first rebuild phase must provision the public-cloud-side and home-lab-side foundations and the cross-environment connectivity between them, before any later phase proceeds. Single-environment standup (public-only or home-lab-only) is not a supported rebuild outcome.

Why this is a TR, not a BR or decision: BR-52 asserts the platform may span both environments and that connectivity is part of foundations, not an afterthought. The TR forces foundations-phase scope to include both sides plus the link; choosing the specific cloud, home-lab hardware, or tunnel mechanism is downstream.

TR-04: Each rebuild phase must support a deterministic, definitions-driven teardown of all state it produced

Source: BR-49 · UX: Stand Up the Platform §Edge Cases

Requirement: Every rebuild phase must expose a deterministic, definitions-driven teardown that removes every resource the phase produced, callable at every checkpoint. “Delete everything provisioned so far and start over” must be a viable, reliable option at each phase boundary. Partial state must not be carried across a phase failure into the next phase.

Why this is a TR, not a BR or decision: BR-49 demands that partial state never be trusted across a phase failure. The TR is the operative property — every phase has a clean teardown — without prescribing a teardown mechanism.

TR-05: Rebuild flow must perform a preflight drift check that fails closed when prior platform state exists and unexplained differences remain

Source: BR-51 · UX: Stand Up the Platform §Entry Point

Requirement: Before any rebuild begins, the platform must compare current platform state against a last-known-good reference and refuse to proceed if unexplained differences remain. On a first-ever build the check is vacuously satisfied; in every other case the check must pass before later phases run.

Why this is a TR, not a BR or decision: BR-51 demands that drift be detected outside the rebuild flow rather than discovered partway through it. The TR makes the preflight check a property of the rebuild entry point. The mechanism by which “last-known-good reference” is captured and compared is a downstream decision.

TR-06: Rebuild flow must be runnable on parallel/scratch infrastructure without affecting the live platform

Source: BR-50 · UX: Stand Up the Platform §Entry Point

Requirement: The same definitions and the same rebuild entry point must be invocable against scratch infrastructure to support post-significant-change drills and at-least-quarterly drills, without touching live platform state. Drill mode and live mode must differ only in the underlying target.

Why this is a TR, not a BR or decision: BR-50 demands honest evaluation of the reproducibility KPI via parallel rebuilds. The TR forces drill-vs-live parity at the entry-point level; how target selection is parameterized is a downstream decision.

TR-07: Platform must include a purpose-built canary tenant maintained alongside the definitions, used as the rebuild’s binding readiness signal

Source: BR-48 · UX: Stand Up the Platform §Journey

Requirement: A canary tenant must be maintained alongside the platform definitions and must be deployed, exercised end-to-end against every platform-provided service, and torn down by the rebuild’s final phase. Readiness must not be declared on infrastructure self-checks alone; the canary’s pass/fail is the binding signal.

Why this is a TR, not a BR or decision: BR-48 names the canary as the readiness mechanism. The TR makes the canary a first-class artifact of the platform definitions. What the canary’s workload looks like, and which signals it must produce, are downstream decisions.

TR-08: Platform must accept tenant components — including migration jobs — only in a single pre-declared packaging form, with no carve-outs

Source: BR-13 · BR-37 · UX: Migrate Existing Data §Constraints Inherited

Requirement: Exactly one packaging form is admissible to the platform. Any tenant component, including migration job artifacts, must arrive in that form to be runnable; the migration path must not relax it. Components that cannot be packaged this way cannot run on the platform.

Why this is a TR, not a BR or decision: BR-13 commits the platform to a packaging form; BR-37 forbids relaxing it for migration. The TR forces single-form admission; the actual form (container image, OCI bundle, archive, etc.) is an ADR.

TR-09: Onboarding must require machine-readable declarations of resource needs, packaged artifact, identity choice, and availability acceptance before any provisioning is possible

Source: BR-13 · BR-46 · UX: Host a Capability §Constraints Inherited

Requirement: A tenant onboarding submission must, before provisioning is possible, include machine-readable declarations of (a) the tenant’s resource needs (compute, storage, network), (b) the packaged artifact in the platform’s accepted form, (c) the identity choice — platform-provided or BYO — recorded in the tech design, and (d) acceptance of the platform’s current availability characteristics. Approval binds the runtime to those declarations.

Why this is a TR, not a BR or decision: BR-13 names the four declarations as the price of admission; BR-46 makes the identity choice one of them. The TR makes the declarations a hard precondition of the provisioning gate; the schema and review surface are downstream decisions.

TR-10: Tenant runtime must not be provisioned without an explicit operator-issued authorization signal tied to the onboarding artifact

Source: BR-14 · UX: Host a Capability §Journey

Requirement: Provisioning of a new tenant runtime must be gated on a per-tenant authorization signal issued by the operator’s identity and bound to the specific onboarding submission. There must be no provisioning path that bypasses this gate, and there must be no self-service onboarding path.

Why this is a TR, not a BR or decision: BR-14 forbids self-onboarding and demands explicit authorization. The TR forces a control point on the provisioning surface; how the authorization signal is represented (issue comment, signed approval, etc.) is downstream.

TR-11: Onboarding flow must support a “new offering needed” hold and resume without requiring the capability owner to refile

Source: BR-64 · UX: Host a Capability §Journey

Requirement: When an onboarding tenant requires an offering the platform does not yet provide, the onboarding record must be holdable in a pending state and resumable from that point once the offering is added — without the capability owner refiling, restarting, or re-accepting the contract. The hold is bounded by the reproducibility (TR-02) and maintenance-budget (TR-54) limits.

Why this is a TR, not a BR or decision: BR-64 makes platform evolution the default response to a tenant need; the host-a-capability UX names the hold as the operationalization. The TR forces the hold-and-resume property on the onboarding flow without prescribing how the pending state is represented.

TR-12: BYO-identity declarations must produce a tenant runtime with no platform-side binding to the platform-provided identity offering

Source: BR-46 · UX: Host a Capability §Constraints Inherited

Requirement: A tenant whose declaration is “BYO identity” must be provisionable without the platform-provided identity offering being wired in for end-user authentication. The platform’s responsibility for that tenant’s identity is limited to network reachability to the chosen external identity service.

Why this is a TR, not a BR or decision: BR-46 commits the platform to BYO as a real option. The TR forces the provisioning flow to honor the choice without coupling tenants to the platform-provided service; which external services are reachable is downstream.

TR-13: Platform-provided identity offering must support a no-recovery credential property

Source: BR-45 · Capability §Business Rules

Requirement: The platform-provided identity offering must, per tenant electing it, support a configuration where no actor (operator, platform, third-party vendor) can recover a lost end-user credential — no reset email, no recovery code, no admin override. An identity option that cannot honor this configuration is not eligible to be the platform-provided service.

Why this is a TR, not a BR or decision: BR-45 forces the property because at least one tenant requires it. The TR is the operative constraint on the offering’s surface. Which identity service is selected to satisfy it is an ADR.

TR-14: All platform administrative interfaces must reject any principal other than the operator

Source: BR-05 · Capability §Business Rules

Requirement: Every platform-administrative surface — provisioning, deprovisioning, contract change, eviction issuance, secret rotation, drift reconciliation, etc. — must authenticate the caller as the operator’s identity (or, when invoked, the sealed successor’s). No delegated-administrator role, co-operator role, or shared admin credential exists.

Why this is a TR, not a BR or decision: BR-05 makes operator-only operation absolute. The TR closes the surface at the authentication layer rather than restating the rule. Which authentication mechanism is chosen is downstream.

TR-15: Platform offerings must expose no end-user-addressable surface

Source: BR-06 · UX: Move Off the Platform After Eviction §Constraints Inherited

Requirement: Platform offerings — observability, secret management, export tool, identity, migration runner, etc. — must expose no UI, API endpoint, or notification channel addressable by tenant end users. Authenticated principals on platform offerings are limited to operator and capability-owner roles; communication to end users about tenant lifecycle (including eviction) is the capability owner’s responsibility, not the platform’s.

Why this is a TR, not a BR or decision: BR-06 forbids any end-user surface on the platform. The TR translates this into the platform’s per-offering surface design. What the operator and capability-owner surfaces actually look like is downstream.

TR-16: Platform must hold a sealed/escrowed successor credential set sufficient to assume full operator authority

Source: BR-07 · Capability §Business Rules · UX: Stand Up the Platform §Persona

Requirement: A sealed credential set must exist that, when invoked by the designated successor, grants full operator authority — including running the rebuild flow and exercising every administrative interface covered by TR-14. The seal must be unsealable by the successor without participation from the primary operator. Routine operations must not exercise these credentials.

Why this is a TR, not a BR or decision: BR-07 forces successor capability and the seal-vs-routine distinction. The TR forces the credential-set property. The specific seal mechanism (password manager handoff, physical envelope, escrow service) is an ADR.

TR-17: Each tenant must receive provisioned compute, persistent storage, internal/external network reachability, identity (or BYO binding), backup/DR, and observability — implemented as shared platform offerings

Source: BR-04 · BR-44 · Capability §Outputs

Requirement: For every approved tenant, the platform must provision and operate, for the tenant’s lifetime, the full inventory: compute, persistent storage, internal and external network reachability, identity binding (platform-provided or per TR-12 BYO), backup with disaster recovery for tenant data, and observability. Each must be implemented as a shared platform offering consumed by every tenant — not duplicated per tenant.

Why this is a TR, not a BR or decision: BR-44 lists the inventory; BR-04 demands platform-level investments accrue to every tenant. The TR fixes the inventory and the shared-offering shape. Specific durability levels, network protocols, and backup retention windows are downstream.

TR-18: Third-party components admissible to the platform must allow control of configuration, data export, and credential revocation/rotation without vendor cooperation

Source: BR-03 · Capability §Business Rules

Requirement: Any third-party component the platform integrates must allow the operator to (a) read and modify configuration through the platform’s tracked-changes surface, (b) export platform-held data in a portable form, and (c) revoke or rotate platform-held credentials without vendor cooperation. Components that fail any of these are not admissible.

Why this is a TR, not a BR or decision: BR-03 forbids vendor lock-in that prevents departure. The TR turns “self-hosted” into a per-component admissibility test. Which vendors are chosen is downstream.

TR-19: All operator/capability-owner engagement must occur on a single durable, append-only, ordered thread per lifecycle event, accessible asynchronously

Source: BR-15 · UX: Host a Capability §Journey

Requirement: Every operator/capability-owner exchange (onboarding, modify, migration, forced update, contract change, eviction) must occur on exactly one durable engagement thread per event, append-only and ordered, accessible asynchronously to both parties, with the full history preserved. Ephemeral channels (chat, voice, email-only) are not acceptable as the channel of record.

Why this is a TR, not a BR or decision: BR-15 demands single-thread, recorded, asynchronous engagement. The TR fixes the channel properties without choosing a tracker.

TR-20: Engagement channel must distinguish onboarding, modify, migration, forced-update, contract-change, and eviction at the type level

Source: BR-16 · BR-22

Requirement: The engagement channel must support categorization that legibly separates onboarding, modification, data migration, operator-initiated forced update, platform-contract change, and eviction. Issue types must not collapse review scopes; an eviction triggered by a missed forced-update or contract-change deadline must always be a separate, linked issue from the issue that motivated it.

Why this is a TR, not a BR or decision: BR-16 demands distinct issue types per review scope; BR-22 forbids re-policing eviction inside the update flow. The TR is the typing constraint on the engagement surface; the type names and tracker semantics are downstream.

TR-21: Modify-request review must surface only the delta from the tenant’s currently-accepted declarations

Source: BR-17 · BR-18 · UX: Host a Capability §Journey

Requirement: The modify flow must support reviewing the proposed delta from the tenant’s currently-accepted declarations, without requiring the capability owner to re-accept the platform contract or the operator to re-evaluate the tenant’s full prior state.

Why this is a TR, not a BR or decision: BR-17 makes the contract evergreen; BR-18 makes modify review delta-only. The TR forces the property on the modify-review surface; how the delta is computed and displayed is downstream.

TR-22: Forced-update issues must record external pressure name and inherited deadline; one issue per forcing event per affected tenant

Source: BR-19 · BR-20 · UX: Operator-Initiated Tenant Update §Journey

Requirement: The forced-update issue type must require fields for (a) the external pressure forcing the change (vendor sunset, CVE, EOL) and (b) the deadline inherited from that pressure. When the same tenant is hit by multiple unrelated forcing events at once, each event must produce its own issue, even when remediation overlaps. Forced-update issues must remain open across multiple artifact handoffs and not progress toward eviction until the operative delivery date is missed.

Why this is a TR, not a BR or decision: BR-20 demands the two fields and the per-event split; BR-19 forbids early eviction. The TR fixes the issue-type schema and the lifecycle property; what tracker enforces the schema is downstream.

TR-23: Forced-update flow must record both inherited and any extended operative date, and the extension’s external-slack justification, both queryable by the eviction trigger

Source: BR-21 · UX: Operator-Initiated Tenant Update §Journey

Requirement: The forced-update issue must record the original inherited deadline, any negotiated extended operative date, and the external slack that justifies any extension. Both dates must be queryable as inputs to the eviction trigger. Extensions exceeding the named safe slack must be refused; if the external pressure leaves no safe slack, no extension is offered.

Why this is a TR, not a BR or decision: BR-21 bounds extensions by external slack. The TR makes the bounds machine-checkable. The shape of the slack record is downstream.

TR-24: Contract-change rollout must be initiated as a single multi-recipient umbrella issue carrying change, replacement, deadline, reason, and migration guideline

Source: BR-23 · UX: Platform-Contract-Change Rollout §Journey

Requirement: The umbrella-issue type must support a single artifact tagging every affected capability owner and carrying (a) what is changing, (b) what it is changing to (or that it is being removed), (c) the deadline, (d) the reason, and (e) the migration guideline where applicable. Per-tenant fanout for the rollout coordination itself is forbidden in this flow.

Why this is a TR, not a BR or decision: BR-23 names the umbrella shape and its contents. The TR fixes the issue-type schema; how multi-recipient tagging is implemented is downstream.

TR-25: Contract-change deadlines must be at least 2× the chosen status-update cadence after filing

Source: BR-24 · UX: Platform-Contract-Change Rollout §Journey

Requirement: The umbrella issue must record both the deadline and the operator-chosen status-update cadence. The interval between filing and the deadline must be no less than two full cadence cycles. Combinations of cadence and deadline that violate this must be rejected by the rollout flow.

Why this is a TR, not a BR or decision: BR-24 demands at least two status cycles before cutoff. The TR makes the relationship machine-checkable; the cadence values themselves are operator decisions per rollout.

TR-26: Contract-change deadline must be a single global value; per-tenant overrides are not supported

Source: BR-25 · UX: Platform-Contract-Change Rollout §Journey

Requirement: The umbrella issue type must store exactly one deadline applicable uniformly to all tagged tenants. There is no schema for per-tenant deadline overrides; only a global extension covering every tagged tenant may modify the deadline value.

Why this is a TR, not a BR or decision: BR-25 forbids per-tenant slips. The TR closes off the schema-level path to one. How global extensions are reflected in-thread is downstream.

TR-27: Umbrella issues must track per-tenant acknowledgment state; at deadline, the rollout flow must atomically remove the old form, close migrated modify issues, file linked eviction issues per laggard, and close the umbrella

Source: BR-26 · BR-30 · UX: Platform-Contract-Change Rollout §Journey

Requirement: Each tagged tenant must have an acknowledgment state on the umbrella issue. At the deadline, the rollout flow must atomically (a) remove the old offering from the platform regardless of remaining occupants, (b) close the migrated tenants’ modify issues in the normal way, (c) file a separate eviction issue per laggard tenant (including non-acknowledgers) linked to the umbrella, and (d) close the umbrella. No tenant may be silently broken on a removed offering.

Why this is a TR, not a BR or decision: BR-26 demands explicit acknowledgment; BR-30 prescribes the deadline closeout shape. The TR consolidates them into the rollout flow’s atomic close behavior; how atomicity is achieved is downstream.

TR-28: Replacement offering must already be a live, hosted offering on the platform before the umbrella issue may be filed

Source: BR-28 · UX: Platform-Contract-Change Rollout §Entry Point

Requirement: The contract-change flow must refuse to file an umbrella issue when (a) the change replaces an old offering with a new one and (b) the replacement is not yet a live, hosted offering on the platform. Full-removal contract changes (no replacement) are exempt from this gate.

Why this is a TR, not a BR or decision: BR-28 makes the precondition absolute. The TR makes it a filing gate; how “live” is verified is downstream.

TR-29: Platform must support running an old offering and its replacement concurrently for the rollout window when a replacement exists

Source: BR-27 · UX: Platform-Contract-Change Rollout §Journey

Requirement: For replacement-style contract changes, the platform must support tenants running on the old offering and the new offering simultaneously throughout the rollout window. The old offering must be removable on the deadline regardless of remaining occupants. Full-removal changes (no replacement) are exempt.

Why this is a TR, not a BR or decision: BR-27 commits to concurrent rollout windows. The TR forces the dual-form runtime property; whether concurrency is achieved by side-by-side instances, traffic splitting, or other means is downstream.

TR-30: Rollout view must produce, on the operator-chosen cadence, both a refreshed in-issue snapshot and a thread comment carrying tenants-on-old, tenants-migrated, open modifies, and time-remaining

Source: BR-29 · UX: Platform-Contract-Change Rollout §Journey

Requirement: On the operator-chosen cadence, the contract-change rollout flow must (a) refresh the umbrella issue body with the current snapshot — tenants on the old form, tenants migrated, open modify issues, time remaining — and (b) post a thread comment carrying the same metrics. Both surfaces must be present so a reader landing cold and a watcher tracking history see consistent rollout state.

Why this is a TR, not a BR or decision: BR-29 demands both the live snapshot in the issue body and a historical-comment trail. The TR fixes the dual-surface property and the metric set; how the snapshot is computed and rendered is downstream.

TR-31: Eviction issuance must be operator-only; eviction issues must be locked to their date at filing; required content is exactly date, reason, and link to export-tool documentation

Source: BR-31 · BR-32 · BR-33 · UX: Move Off the Platform After Eviction §Entry Point

Requirement: Filing an eviction issue must be restricted to the operator role; capability owners must have no path to initiate eviction. The eviction date must be set at filing and must not be mutable by either party afterward. Required content is exactly (a) eviction date, (b) reason, (c) link to the export-tool documentation; no other field is required.

Why this is a TR, not a BR or decision: BR-31, BR-32, and BR-33 together fix the issue’s authorship, immutability, and contents. The TR consolidates all three into a single constraint on the eviction-issue schema and authorization gate; the issue-type implementation is downstream.

TR-32: On the eviction date, tenant compute and network reachability must be deprovisioned and tenant data must transition to a read-only, export-only state

Source: BR-34 · UX: Move Off the Platform After Eviction §Journey

Requirement: On the eviction date, the platform must (a) deprovision the tenant’s compute and network reachability so end users can no longer reach it, and (b) transition tenant data to a read-only state in which no actor — including the capability owner, the operator, and tenant components — can write to it, while the export tool continues to function.

Why this is a TR, not a BR or decision: BR-34 prescribes the day-zero state transition. The TR is the operative property; whether read-only is enforced via permissions, snapshots, immutable storage, or another mechanism is downstream.

TR-33: Tenant data must remain readable via the export tool for 30 days post-eviction; on day 30, all tenant data must be permanently deleted across every storage tier the platform controls, with deletion verifiable to the operator

Source: BR-11 · BR-65 · UX: Move Off the Platform After Eviction §Journey

Requirement: For 30 days after the eviction date, tenant data must remain accessible only through the export tool. On day 30 the platform must permanently delete the tenant’s data across every storage tier it controls — the tenant-accessible export-only copy and any deeper backup-tier copies — with the deletion verifiable to the operator. No residual platform-controlled copy of an evicted tenant’s data may survive day 30 in any tier, and no operator-only access path may persist past that point.

Why this is a TR, not a BR or decision: BR-11 commits to the tenant-accessible side; BR-65 closes the symmetric question for backup-tier copies. The TR consolidates both into the cross-tier deletion property; how deletion is performed and verified per tier is downstream.

TR-34: Per-tenant 30-day retention countdown must be operator-pausable, with the pause distinguishing platform-side defects from capability-owner-side issues

Source: BR-12 · UX: Move Off the Platform After Eviction §Edge Cases

Requirement: The post-eviction 30-day retention countdown must be operator-pausable per tenant. The pause/resume action must record which class triggered it — platform-side defect (pauses the clock) or capability-owner-side issue (does not pause) — and the action must be auditable. Resumption must restart the remaining retention window, not the full 30 days.

Why this is a TR, not a BR or decision: BR-12 carves out exactly this pause behavior and allocates accountability. The TR makes the pause a controllable property of the retention-clock surface; the audit-record format is downstream.

TR-35: Platform must expose a per-tenant export tool callable without operator participation throughout the tenant’s hosted lifetime and the post-eviction retention window

Source: BR-08 · Capability §Business Rules · UX: Move Off the Platform After Eviction §Journey

Requirement: The platform must expose, per tenant, an export-tool invocation that produces a portable archive of that tenant’s data. The invocation must be available throughout the tenant’s hosted lifetime and across the 30-day post-eviction retention window without operator participation, and must be re-invocable on demand any number of times. The platform need not retain previously-generated archives between invocations.

Why this is a TR, not a BR or decision: BR-08 forces the on-demand, no-operator-needed export property. The TR fixes the invocation surface and re-invokability without prescribing an archive format.

TR-36: Each export must be accompanied by a platform-produced content checksum/hash and total byte count

Source: BR-10 · UX: Move Off the Platform After Eviction §Journey

Requirement: Every export artifact produced by the platform must be paired with (a) a content checksum or hash and (b) a total byte count, both produced by the platform at export time and delivered alongside the artifact. Semantic correctness validation remains the capability owner’s responsibility; the platform’s verification is bounded to the integrity envelope.

Why this is a TR, not a BR or decision: BR-10 forces the verification envelope. The TR fixes the two integrity outputs; which hash function is chosen is an ADR.

TR-37: Tenant admission must verify that an export-tool path covers every data shape the tenant will introduce

Source: BR-09 · UX: Move Off the Platform After Eviction §Edge Cases

Requirement: Tenant admission must verify that an export-tool path covers every data shape the tenant will introduce. A gap in export-tooling coverage must be treated as a platform defect that blocks admission until closed; admission may not proceed on the assumption that the gap can be filled later.

Why this is a TR, not a BR or decision: BR-09 forbids gaps in export coverage at eviction time. The TR moves the verification earlier — into admission — so eviction never discovers a gap. How coverage is enumerated and verified is downstream.

TR-38: Platform must offer a one-shot job runner distinct from long-running tenant components, with progress visible through standard observability

Source: BR-36 · UX: Migrate Existing Data §Journey

Requirement: The platform must offer a job-runner offering — distinct from the long-running tenant component runtime — that executes a packaged artifact end-to-end against a single tenant, exposes progress through the platform’s standard observability surfaces, and is bounded in lifetime by a single migration request. The platform runs the job; it does not write, debug, or shepherd it.

Why this is a TR, not a BR or decision: BR-36 commits to a one-shot job-runner offering. The TR fixes the offering’s separation from long-running runtime and the progress-visibility property; the runner’s implementation is downstream.

TR-39: Migration requests must declare re-run safety and any temporary-spike footprint up front

Source: BR-38 · UX: Migrate Existing Data §Journey

Requirement: A migration request must, at filing, declare (a) whether the migration process is safe to re-run against an already-populated destination tenant or requires a wiped destination, and (b) any temporary footprint spike beyond the destination tenant’s steady-state. Approval is bounded by available platform capacity for the declared spike.

Why this is a TR, not a BR or decision: BR-38 names both declarations as part of the operator’s review scope. The TR fixes the migration-issue schema; the schema’s representation is downstream.

TR-40: Migration approval must reject any request whose declared peak (steady-state plus spike) exceeds 2× the destination tenant’s steady-state in either compute or storage

Source: BR-39 · UX: Migrate Existing Data §Journey

Requirement: The migration review flow must reject — without negotiation — any request where steady-state plus declared spike exceeds 2× the destination tenant’s steady-state compute or storage. Resolution requires the capability owner to split the migration, reduce the spike, or resize the tenant first via the modify flow.

Why this is a TR, not a BR or decision: BR-39 makes the 2× cap a hard review rule. The TR makes the rule machine-checkable in the review surface; how steady-state is measured is downstream.

TR-41: Migration runner must support concurrent migrations across distinct tenants without serialization or per-tenant exclusivity

Source: BR-40 · UX: Migrate Existing Data §Journey

Requirement: The migration runner must support multiple migrations running concurrently across different tenants without serializing them or coupling their progress. Tenants must not depend on exclusive use of the runner for their own migration to proceed.

Why this is a TR, not a BR or decision: BR-40 commits to concurrent migrations. The TR forces the no-serialization property; how concurrency is implemented (shared infrastructure, per-tenant isolation, queueing) is downstream.

TR-42: Migration runner must not auto-clean, auto-retry, or auto-progress on job failure; subsequent action must be operator-driven against an explicit capability-owner plan

Source: BR-41 · UX: Migrate Existing Data §Journey

Requirement: On migration job failure or invalid output, the runner must hold the tenant data in whatever state the failed job left it. The next action — wipe-and-retry, resume, accept partial, abandon — must be operator-driven against a plan the capability owner provides on the issue. The platform must not auto-clean, auto-retry, or auto-prescribe a recovery model.

Why this is a TR, not a BR or decision: BR-41 places the recovery decision squarely with the data owner. The TR forbids the runner from acting on its own; the plan-record format is downstream.

TR-43: Migration runner must deprovision job artifacts on issue closure; re-running requires fresh job creation

Source: BR-42 · UX: Migrate Existing Data §Journey

Requirement: On migration issue closure (success or abandonment), the runner must remove all per-job artifacts. Subsequent re-runs must require a fresh migration issue and fresh approval; the platform must not retain a migration job past closure.

Why this is a TR, not a BR or decision: BR-42 fixes the one-shot lifespan. The TR makes the teardown an obligation of the closure flow; what counts as a “per-job artifact” is bounded by the runner’s design.

TR-44: Platform must offer a secret-management surface populated by capability owners and consumed by their components, with secret values not readable by any non-consuming party

Source: BR-43 · UX: Migrate Existing Data §Journey

Requirement: The platform must offer a secret-management surface where capability owners deposit credentials referenced by name from their tenant components and migration processes. Secret values must not appear in engagement-thread comments or in any operator-facing surface, and must not be readable by any party other than the platform components that consume them on the tenant’s behalf. Population must be doable by the capability owner without operator involvement.

Why this is a TR, not a BR or decision: BR-43 commits to this surface and motivates it as a leak-prevention measure for credentials. The TR makes the secrecy property and capability-owner population first-class. The implementation (key store, secret manager) is an ADR.

TR-45: Tenant-facing observability must expose, automatically per tenant, the platform-standard health bundle: availability, latency, error rate, resource saturation, and restart/deployment events

Source: BR-44 · BR-53 · UX: Tenant-Facing Observability §Journey

Requirement: For each live tenant, observability must surface — without capability-owner instrumentation — at minimum: availability, latency, error rate, resource saturation, and restart/deployment events. The bundle must be present from the moment the tenant goes live and must remain present for the tenant’s lifetime.

Why this is a TR, not a BR or decision: BR-53 defines the bundle’s content and the no-tenant-instrumentation property. The TR fixes both. Specific signal definitions, sample rates, and visualization shape are downstream.

TR-46: Capability owners must be able to mutate their own tenant’s alert thresholds without operator participation; cross-tenant threshold mutation is operator-only

Source: BR-54 · BR-57 · UX: Tenant-Facing Observability §Journey

Requirement: The observability offering must allow each capability owner to mutate alert thresholds for the signals on their own tenant, without operator involvement. Mutation of cross-tenant or platform-wide thresholds must be limited to the operator role.

Why this is a TR, not a BR or decision: BR-54 makes thresholds the one self-service surface for capability owners; BR-57 keeps cross-tenant scope operator-only. The TR forces the role-scoped mutation property on the threshold surface.

TR-47: On threshold crossings, observability must push an alert naming both the signal and the capability

Source: BR-55 · UX: Tenant-Facing Observability §Journey

Requirement: When a tenant signal crosses a capability-owner-set threshold, the observability offering must send an alert to the capability owner’s registered delivery address. The alert payload must name (a) which signal crossed and (b) which capability is affected. The alert path is best-effort; the pull view is authoritative.

Why this is a TR, not a BR or decision: BR-55 commits to threshold-driven push alerts and the content. The TR fixes the property and payload contents; the delivery channel is an ADR.

TR-48: Tenant view must surface alert-delivery health when degradation is detectable, while remaining the authoritative read of current health

Source: BR-56 · UX: Tenant-Facing Observability §Journey

Requirement: When the observability offering detects that alert delivery to a tenant is degraded, the tenant view must surface that fact so silence on the alert path is not interpreted as evidence of health. The pull view must remain authoritative for current health regardless of alert-path state.

Why this is a TR, not a BR or decision: BR-56 demands both the degradation indicator and the pull-authoritative property. The TR fixes them; how degradation is detected is downstream.

TR-49: Authentication to the observability offering must place a capability owner directly in their tenant scope with no mode-switch broadening it

Source: BR-57 · UX: Tenant-Facing Observability §Entry Point

Requirement: A non-operator authenticated session on the observability offering must land in the authenticated capability owner’s tenant scope and remain confined to it for the session’s lifetime. There must be no UI or API path that broadens scope to another tenant or to a cross-tenant view from a non-operator session.

Why this is a TR, not a BR or decision: BR-57 names the isolation property and the operator-only carve-out. The TR forces the session-scope property without choosing an authorization mechanism.

TR-50: Onboarding closure must produce a working observability login and a wired alert-delivery address as part of provisioning

Source: BR-58 · UX: Host a Capability §Journey · UX: Tenant-Facing Observability §Entry Point

Requirement: Closure of an onboarding issue must produce, as part of the same provisioning flow, a working observability login for the capability owner and a wired alert-delivery address. The capability owner must not need to file a separate request to obtain either.

Why this is a TR, not a BR or decision: BR-58 demands automatic provisioning of observability access at onboarding. The TR forces the bundle-with-onboarding property; the specific identity and delivery-channel mechanics are downstream.

TR-51: Onboarding-issue close-out must support a “lost — operator silence” outcome distinct from approved and declined, recorded in-thread

Source: BR-61 · UX: Host a Capability §Edge Cases

Requirement: The onboarding flow must support recording, at issue closure, three distinct terminal outcomes: approved (live tenant), declined (host elsewhere), and lost-to-operator-silence. Each must be a first-class queryable outcome, recorded in-thread, so adoption metrics can distinguish silent-loss from any other failure mode.

Why this is a TR, not a BR or decision: BR-61 prescribes the explicit-loss capture. The TR makes the outcome a first-class artifact rather than free-form text. The label and query surface are downstream.

TR-52: Eviction trigger must be invocable on the basis of either the maintenance-budget condition or the reproducibility-break condition; either alone is sufficient

Source: BR-60 · Capability §Business Rules

Requirement: The eviction-issuance flow must accept either grounds — projected routine maintenance sustainably exceeding 2× the maintenance-budget KPI, or any required snowflake configuration that cannot be expressed as definitions — as sufficient justification. The recorded grounds must be queryable for later review; both conditions need not be present together.

Why this is a TR, not a BR or decision: BR-60 makes either condition independently sufficient. The TR makes the grounds-record property explicit on the eviction surface; how the conditions are measured is downstream.

TR-53: Platform must produce queryable per-component cost data sufficient for the operator to judge cost-vs-value

Source: BR-62 · Capability §Success Criteria

Requirement: The platform must produce queryable per-component cost data on a regular cadence, sufficient for the operator to judge whether continuing operation is worth its bill. There is no fixed numeric target; the operator is the judge. Per-tenant attribution where attributable is desirable but not required by this TR.

Why this is a TR, not a BR or decision: BR-62 commits the platform to a cost-judgment surface without naming a target. The TR fixes the queryable-cost property; refresh cadence, granularity, and per-tenant attribution are downstream.

TR-54: Platform’s expected weekly operator-facing work, summed across the currently-hosted tenant set, must be designed to fit within a 2-hour weekly budget

Source: BR-59 · Capability §Success Criteria

Requirement: The set of routine operator-facing surfaces (alert handling, status updates, modify reviews, periodic checks) must be designed such that the platform’s expected weekly operator work, summed across the currently-hosted tenant set, fits within a 2-hour weekly budget. New surfaces whose costs are not predictable enough to bound this way must not be added without redesign or scope reduction.

Why this is a TR, not a BR or decision: BR-59 fixes the maintenance budget. The TR is the design-time obligation that follows: every operator-facing surface is bounded by its share of the 2-hour weekly envelope. How the budgeting is performed is downstream.

TR-55: Platform-managed resources must use the universal resource identifier standard

🗑️ removed on 2026-04-28 — sourced only to ADR-0006, which was deleted when the repo’s existing ADRs were cleared in preparation for the new capability development workflow. Number is reserved and will not be reused.

TR-56: Platform APIs must use the standard API error response format

🗑️ removed on 2026-04-28 — sourced only to ADR-0007, which was deleted when the repo’s existing ADRs were cleared in preparation for the new capability development workflow. Number is reserved and will not be reused.

Open Questions

Things volunteered as solutions during extraction (parked for the ADR stage), or constraints the capability/UX docs don’t yet make explicit.

  • Buy-vs-build decision discipline (BR-63). BR-63 constrains the decision process, not the runtime system — it forbids citing operator-skill development as a justification in buy-vs-build trade-offs. It is not surfaced as a TR because there is no runtime obligation it forces; it is parked here for the ADR stage so per-component selection ADRs cite convenience/resiliency/cost evidence.
  • Cost-data refresh cadence and granularity (TR-53). BR-62 demands cost-vs-value judgment but doesn’t quantify “regular,” “queryable,” or how granular per-component cost must be. Treat as ADR input alongside the observability-offering decisions.
  • Status-update cadence sizing rules (TR-30, TR-25). BR-29 prescribes a regular cadence “sized to the timeline” and BR-24 imposes the ≥2-cycle deadline rule, but neither fixes a procedure for picking the cadence. Treat as ADR input or per-rollout operator guidance.
  • Last-known-good reference for preflight drift check (TR-05). The standup UX names “the live platform or the last known-good environment” as the comparison surface but does not specify the form of “last known-good” (snapshot ID, signed manifest, etc.). Treat as ADR input for the drift-detection design.
  • Topology adoption (TR-17, TR-03). The current repo pattern places an Internet-facing edge layer (with mutual-authentication and traffic-control duties) in front of a private home-lab environment connected to a public-cloud environment through a secure cross-environment tunnel. Whether the platform formally inherits this shape — or selects a different one for the cross-environment foundations — is an ADR decision; this TR doc deliberately does not assume the inherited pattern, and the specific vendors that currently realize each layer are out of scope here.
  • Maintained checklist (UX: Stand Up the Platform). The standup UX references a “maintained checklist” used during phase validation. Its shape is unspecified; capture as ADR input alongside the rebuild-flow design.
  • Public-cloud account vs. home-lab boundary in TR-18. Whether a public-cloud account itself counts as a “third-party component” for the BR-03 admissibility test (read/modify config, export data, revoke credentials without vendor cooperation) is ambiguous; the cloud is named in BR-52 as part of foundations and in the capability rules as allowed. Treat as ADR input when picking specific cloud-side components.