This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

User Experiences

End-to-end user journeys for the Self-Hosted Application Platform capability.

This section documents the user experiences for the Self-Hosted Application Platform capability — the end-to-end journeys taken by the actors named in the parent capability’s Stakeholders, in pursuit of the outcomes the capability promises.

1 - Host a Capability

A capability owner brings a fully-designed capability onto the platform, gets it provisioned and live, and continues to evolve its needs over time.

One-line definition: A capability owner brings a fully-designed capability onto the platform, gets it provisioned and live, and continues to evolve its needs over time.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is a capability owner — one of the people named in the parent capability’s Primary actors. Although the capability doc notes this is currently the operator wearing a different hat, this UX is written as if the capability owner were a separate person from the platform operator. The role boundary is treated as real: there is an interface, a handoff, and a contract between them.

  • Role: Capability owner. They have just finished defining one of the operator’s capabilities — its UX docs and its tech design are both complete. The tech design picked this platform as the host.
  • Context they come from: They are not building the platform; they are a customer of it. They arrive with a capability doc, a tech design that calls out which components must run on the platform, and (ideally, but not strictly required) a mapping from those components to specific platform offerings.
  • What they care about here: Getting their capability running on a controlled, reproducible substrate, declaring its needs once, understanding what they are signing up for, and having a clear path to change those needs later — without onboarding becoming a multi-day project for either side of the handoff.

Goal

“I want my capability running on the platform — with its compute, storage, network, and identity needs declared once, the platform’s contract understood, and a clear path to update those needs later — and I want it to stay running healthily as my capability evolves.”

This is a lifecycle goal, not just an onboarding one: the change-later branch lives in the same journey as the initial onboarding because it shares the same persona, surface, and contract.

Entry Point

The capability owner arrives at this experience having just finished the tech-design phase of their capability. Specifically:

  • Their capability’s UX docs are complete.
  • Their tech design is complete and explicitly designates this platform as the host for the components that need to run somewhere.
  • Their decision about whether to use platform-provided identity or bring their own has already been made and recorded in the tech design itself — it is not a fresh question at onboarding.

What they have in hand: the capability doc and the tech design. Nothing else is required. A tech design that already names specific platform offerings per component is nice; one that doesn’t can have those gaps filled during onboarding.

Their state of mind depends on what they’re asking for:

  • Fully confident if every component in their tech design maps to an offering the platform already provides.
  • Semi-confident if some component requires something the platform may or may not be able to support (e.g. their capability needs GPU compute, which the platform may never provide because no GPUs are installed and buying them is out of scope).

Journey

The capability owner’s journey is a single end-to-end flow with three branches that can occur during operator review (approved as-is, new-offering needed, declined) and one re-entry loop for changing requirements after going live.

1. File an “onboard my capability” issue on GitHub

The capability owner opens an issue against the infra repo using the onboard my capability issue type. GitHub issues are the only channel for engaging the platform — there is no self-service portal and no other front door — and this is the issue type for onboarding. They link or attach the capability doc and the tech design.

What they perceive: the issue is filed, and now they wait. There is no response-time guarantee — this is personal-scale, async by default.

2. Operator review on the issue

The operator reviews the tech design with a deliberately narrow scope:

  • Does each platform-hosted component align with an existing platform offering?
  • Are there any components that would require a new platform offering to be added?

What the capability owner perceives: clarifying questions appear as comments on the issue, and possibly a meeting if the operator deems it necessary. They answer the questions in-thread.

3. Resolution — one of three branches

3a. Approved as-is. The operator comments “approved” on the issue. That comment is the moment the capability owner knows hosting is real. There is no separate contract-acceptance step at this point: the contract was accepted by virtue of the tech design already conforming to it (declared resource needs, identity choice, packaging, availability expectations).

3b. New offering needed. The operator agrees the right answer is to add a new platform offering to support the capability, and that the offering is still within the platform’s intended scope — meaning the platform can add it while keeping the offering reproducible within the parent capability’s Reproducibility KPI and routine operation within the Operator maintenance budget KPI. The operator does not commit to a timeline. The capability owner waits. While they wait, there is nothing for them to do on their side. Eventually the operator returns and the journey resumes at step 3a.

3c. Declined — host elsewhere. The operator closes the issue with a comment explaining why the request cannot be supported. That can be because it is simply impossible (e.g. the platform will never have GPUs because the hardware cannot be added), or because it is only technically possible and would require the platform to grow into an offering the operator does not want to carry as routine scope — specifically, one the platform could not keep reproducible within the parent capability’s Reproducibility KPI or operate within the Operator maintenance budget KPI. The capability owner now knows this capability has to be hosted somewhere else; the journey ends here.

4. Hand off packaged artifacts

For each component in the tech design that needs to be deployed, the capability owner provides a packaged artifact in the form the platform accepts. The capability owner does the packaging themselves; they do not hand over raw source for the operator to package.

What they perceive: they post or link the artifacts on the issue and wait.

5. Wait while the operator provisions

While the operator is actually wiring up compute, storage, networking, identity, backup, and observability for the new tenant, the capability owner does nothing. They are not pinged for DNS choices or secrets. They simply wait until the operator asks them to test.

6. Test on request

The operator comments asking the capability owner to test the deployed capability. The capability owner exercises it however they would normally validate that their capability works (this is their judgment — the platform doesn’t prescribe a test plan).

  • If something is wrong (the deployment doesn’t work right, networking can’t reach it, an artifact failed to deploy as-given), the capability owner comments on the issue and the two iterate back-and-forth in comments until it works.
  • If everything works, the capability owner says so on the issue.

7. Operator closes the issue

The operator closes the onboarding issue. The capability is now live on the platform.

8. Change-later loop (re-entry)

When the capability owner needs something different — more storage, a new external endpoint, a new component, a routine version bump, retirement of a component — they file a different issue type: modify my capability (distinct from the onboarding type, and the distinction is meaningful to the capability owner because the operator’s review scope differs).

Operator review on a modify issue covers only the delta, not a full re-evaluation. The platform contract is evergreen — the capability owner does not re-accept it on each modification. If the platform’s own contract changes, the operator is responsible for communicating the change ahead of time and migrating existing tenants; it is never sprung on the capability owner during a modify request.

The flow from issue → review → branches → artifact handoff → test → close repeats.

Flow Diagram

flowchart TD
    Start([Tech design complete & names this platform]) --> File[File 'onboard my capability' issue on GitHub]
    File --> Review[Operator reviews tech design:<br/>alignment to offerings + new-offering needs]
    Review --> Decision{Outcome}
    Decision -->|Approved as-is| Approved[Operator comments 'approved']
    Decision -->|New offering needed| Wait[Wait — no timeline guarantee]
    Decision -->|Declined| Decline[Issue closed with explanation —<br/>host elsewhere. Journey ends.]
    Wait --> Approved
    Approved --> Handoff[Capability owner hands off<br/>packaged artifacts on the issue]
    Handoff --> Provision[Wait while operator provisions]
    Provision --> Test[Operator asks capability owner to test]
    Test --> Works{Works?}
    Works -->|No| Iterate[Comment back-and-forth on the issue]
    Iterate --> Test
    Works -->|Yes| Close[Operator closes the issue —<br/>capability is live]
    Close --> Live((Hosted))
    Live -->|Needs change later| Modify[File 'modify my capability' issue]
    Modify --> ReviewDelta[Operator reviews delta only;<br/>contract is evergreen]
    ReviewDelta --> Decision
    Live -->|Operator initiates eviction| Eviction[Operator raises eviction issue<br/>with eviction date — see Edge Cases]

Success

When the onboarding issue closes, the capability owner walks away with:

  • Their capability is running on infrastructure they trust to be reproducible and operator-controlled.
  • The operator knows exactly what they signed up to host — needs were declared in the tech design and reviewed before approval.
  • A known, low-friction path back when needs change: file a modify my capability issue and run the same loop.
  • No surprises: there is no hidden ongoing obligation on their side beyond filing issues for changes.

For change-later iterations, success looks the same in miniature: the delta is reviewed, deployed, tested, and closed without re-litigating the entire capability.

Edge Cases & Failure Modes

  • Test step fails after provisioning. Capability owner sees their capability isn’t working post-deploy. Experience-level handling: the issue stays open and the two iterate via comments until the deployment works. The journey doesn’t reset to the start; it loops between test and operator action.
  • Operator goes silent / issue stalls. There is no response-time guarantee, so some waiting is normal. The signal that the silence has gone on too long is not a timer; it is the capability owner explicitly commenting that they are withdrawing the request and hosting elsewhere because they can no longer wait (or closing the issue saying so). Experience-level handling: that outcome is recorded on the issue itself and counts as a lost tenant against the parent capability’s Tenant adoption KPI. When the operator returns, the response is to acknowledge the loss in-thread and close the issue if it is still open — not to let the thread silently rot.
  • Handed-off artifact is broken or undeployable. Symmetric with the test-fails case: comment back-and-forth on the issue until a working artifact is in place.
  • New offering requested but no commitment. The capability owner’s request to add a new offering is accepted in principle but with no timeline. They wait indefinitely. If they cannot wait, they say so on the issue and host elsewhere; that is a tracked Tenant adoption KPI loss, not invisible churn.
  • Capability is evicted later. This is operator-initiated, not capability-owner-initiated, so it is not a step inside this journey. From the capability owner’s perspective: at some point the operator opens an eviction issue tagging them and naming the eviction date. The capability owner now knows they must move off the platform by that date. Eviction is governed by the parent capability’s Eviction threshold rule (the request would push routine maintenance sustainably above 2× the maintenance budget, or break reproducibility).
  • Operator-driven update because tenant components fell behind. Out of scope for this UX — see Out of Scope.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability’s Business Rules and Success Criteria — by name, so future readers can trace the lineage:

  • Operator-only operation. There is no self-service onboarding flow. The journey’s only engagement surface is a GitHub issue the capability owner files, which the operator personally services. No co-operator or delegated administration appears anywhere in the journey.
  • Tenants must accept the platform’s contract. Contract acceptance is implicit in the tech-design submission: the design declares resource needs, identity choice, packaging form, and availability expectations conforming to the platform’s contract. There is no explicit “I accept” gate — the design is the acceptance.
  • Identity service honors tenant credential-recovery rules. Whichever identity option is named in the capability owner’s tech design must be one the platform actually offers. The platform-provided identity service must be capable of honoring “lost credentials cannot be recovered.” If a capability needs that property and bring-your-own is chosen, it is the capability owner’s responsibility to honor it themselves.
  • Eviction threshold. The operator may raise eviction when routine accommodation would exceed 2× the operator-maintenance-budget KPI or break the reproducibility KPI. This UX surfaces eviction only as an external operator-initiated event affecting the capability owner — see Edge Cases.
  • The capability evolves with its tenants. The “new offering needed” branch in step 3 is the operationalization of this rule: the default response when a tenant needs something the platform doesn’t yet provide is to consider expanding the platform, not to refuse the tenant. But the operator is not obligated to grow the platform without bound. A request is declined once satisfying it would require a new ongoing offering the platform could not keep reproducible within the Reproducibility KPI or operate within the Operator maintenance budget KPI, even if the offering is technically buildable.
  • No specific availability or performance SLA. The journey does not include any negotiation of availability targets — tenants accept whatever the platform’s current implementation offers. A capability owner needing stronger guarantees should not have arrived here (their tech design would have picked a different host).
  • KPI: Tenant adoption. A capability owner who explicitly gives up on onboarding because the operator stayed silent too long is counted as a lost tenant, not waved away as “they changed their mind.” The signal is the GitHub issue itself: they say they are hosting elsewhere because waiting no longer works for them. The response is to leave that loss recorded in-thread and close the issue, so the KPI reflects what actually happened.
  • KPI: 1-hour reproducibility. Implication for this UX: provisioning during step 5 must be done by running the platform’s existing definitions, not by the operator hand-rolling per-tenant snowflake configuration. If onboarding requires bespoke manual config that cannot be captured as definitions, the platform itself has fallen out of compliance with this KPI — and the right response is to update the platform’s definitions, not to tolerate the snowflake.
  • KPI: 2-hr/week operator maintenance budget. Implication for this UX: change-later iterations (step 8) must remain quick enough that running them does not eat the operator’s weekly budget across all hosted tenants. A tenant whose modify requests routinely cost disproportionate operator time crosses into the eviction-threshold rule. The same KPI also bounds the admission of new offerings: “technically possible” is still a decline if the resulting routine platform scope would no longer fit inside this budget.

Out of Scope

  • Data migration of an existing tenant. Bringing data from a prior vendor or local install into the newly-provisioned tenant is a separate UX, not covered here. This UX is strictly about provisioning the capability on the platform.
  • Operator-initiated tenant updates (“your component has fallen behind”). When the operator notices a tenant’s components have aged out of platform support, the operator initiates the conversation — that is a different journey with the operator as the primary actor and the capability owner as the responder. It belongs in its own UX doc.
  • Running-tenant observability for the capability owner. This onboarding journey provisions observability as part of bringing the tenant live, but it does not cover the later “is my thing healthy right now?” monitoring journey itself. That ongoing experience belongs in Tenant-Facing Observability, not here.
  • Platform-side standup or rebuild. The operator standing up the platform from scratch is one of the parent capability’s other triggers, not this UX.
  • The capability owner’s tech-design phase. The decision to use this platform was made before this journey starts. How that decision is made (build vs. buy, host-here vs. host-elsewhere) is a tech-design concern, not a hosting-UX concern.

Open Questions

None at this time.

2 - Migrate Existing Data Into a Newly-Provisioned Tenant

A capability owner whose capability is already onboarded and running on the platform brings their existing end-user data over from the prior host by handing off a one-time migration process for the platform to run.

One-line definition: A capability owner whose capability is already onboarded and running on the platform brings their existing end-user data over from the prior host by handing off a one-time migration process for the platform to run.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is the same capability owner described in Host a Capability. They are not a different role for this journey — they are mid-lifecycle, having already completed onboarding, and now coming back to deal with one specific concern: their pre-existing data.

  • Role: Capability owner. Their capability is already onboarded and live on the platform — compute, storage, network, identity, observability are all provisioned and running per the closed onboarding issue. The tenant is empty: no end-user data is in it yet.
  • Context they come from: Their capability has historical data living somewhere else — on a vendor (e.g. a hosted Plex provider), on a local install, on a previous self-hosted setup. End users are still on that old host. The capability owner is running the new tenant and the old host concurrently during this period; cutting end users over is their concern, deliberately separate from this UX.
  • What they care about here: Getting their existing data into the new tenant intact, so that when they decide to cut end users over, the new tenant is not a fresh-start regression. They want to do this with a defined, repeatable mechanism rather than ad-hoc — the operator’s 2hr/week maintenance budget depends on migrations not becoming bespoke projects.

Goal

“I want my existing end-user data moved from my old host into my new tenant on the platform — using a migration process I wrote, run by the platform on my behalf — so that when I cut my users over (on my own schedule), the new tenant has everything they expect.”

This is a one-shot goal per migration: when the data has landed and the capability owner has validated it, the migration job is torn down. There is no ongoing sync.

Entry Point

The capability owner arrives at this experience after Host a Capability has fully completed for their tenant — onboarding issue closed, tenant live and empty. They have a parallel, still-running deployment of their capability on a prior host (vendor or self-managed), and they have written a migration process — a one-time job that reads from the prior host and writes into the new tenant via the new tenant’s normal interfaces — packaged in the form the platform accepts (same packaging as any other capability component).

What they have in hand:

  • A reference to the closed onboarding issue (so the destination tenant is unambiguous).
  • A packaged migration process artifact.
  • Credentials needed by the migration process to talk to the old host.
  • A rough sense of resource needs for the migration job (compute, network egress to the old host, expected runtime).

State of mind: pragmatic. They know this is bespoke to their capability — the platform is providing a runner for a process they wrote, not a magic mover.

Journey

1. Register old-host credentials with the platform secret management offering

Before filing the issue, the capability owner registers any credentials their migration process needs (to read from the old host) with the platform’s secret-management offering. The migration process artifact will reference these by name; the secrets themselves do not appear on the issue.

What they perceive: standard usage of the platform’s secret-management offering. This step exists outside the issue and is the capability owner’s responsibility to complete before handoff.

2. File a “migrate my data” issue on GitHub

The capability owner opens an issue against the infra repo using the migrate my data issue type — distinct from onboard my capability and modify my capability because the operator’s review scope and the lifecycle (one-shot, torn down on completion) differ. The issue contains:

  • A link to the closed onboarding issue (identifying the destination tenant).
  • A description of the source (old host, format, rough data volume).
  • The packaged migration process artifact (or a link to it).
  • A declaration of the migration job’s resource needs (compute, storage, network reachability — including egress to the old host), including any temporary migration-only spikes beyond the tenant’s steady-state footprint, and the names of the secrets it expects to read from the platform’s secret-management offering.
  • A declaration of the migration process’s re-run contract: whether it is safe to run against an already-populated destination tenant, or whether the destination must be wiped / empty before each run.

What they perceive: the issue is filed. They wait, async, just like onboarding.

3. Operator review on the issue

The operator reviews the migration request with a deliberately narrow scope — the delta the platform is being asked to support for this one-shot job. Specifically, the operator confirms with the capability owner:

  • Resources: the migration’s peak temporary footprint — the destination tenant’s steady-state compute and storage footprint plus any migration-only spike declared on the issue — is no more than 2x the destination tenant’s steady-state compute and storage footprint, and it fits within the platform’s currently available migration-process capacity. If either compute or storage exceeds that threshold, the operator rejects the request as written and asks the capability owner to split the migration into smaller runs, reduce the spike, or resize the tenant first via modify my capability.
  • Network: the migration job has the egress reachability it needs to talk to the old host, and ingress to the destination tenant’s storage interfaces.
  • Credentials: the named secrets are registered and the migration process is wired to read them correctly.
  • Re-run contract: the issue is explicit about whether retries or later top-up migrations can run against existing data, or whether each run requires an empty / wiped destination.

What the capability owner perceives: clarifying questions appear as comments on the issue. They answer in-thread. There is no review of the migration process’s internal logic — that is the capability owner’s domain. The operator is reviewing what the platform must provide to run it, not whether it does the right thing.

4. Operator onboards and starts the migration job

Once the review converges, the operator wires up the one-time migration job using the platform’s migration-process offering and starts it. The capability owner does nothing during this step — same as the provisioning step in host-a-capability. They simply wait for the migration job to be running.

Concurrent migrations across different tenants are supported. The capability owner should not expect exclusive use of the migration-process offering; if other tenants are migrating at the same time, their own journey still looks the same.

5. Capability owner observes the running job

While the migration job runs, the capability owner watches it through the platform’s observability — the same observability surface every other platform offering exposes to its tenant. They can see whether the job is making progress, whether it has errored, and whatever signals their migration process emits.

What they perceive: visibility into their own job, on their own time. There is no operator handholding during the run. Long migrations (hours, days) are normal — there is no SLA, just observability.

6. Operator reports the job’s terminal state on the issue

When the migration job finishes — successfully or with an error — the operator reports the terminal state on the issue and asks the capability owner to validate.

7. Resolution — one of two branches

7a. Success — capability owner validates data presence. The capability owner verifies the data landed correctly, per their capability’s own definition of correct (open the app, check counts, spot-check records — their judgment, not the platform’s). When they’re satisfied, they say so on the issue.

7b. Failure — capability owner provides the plan for next steps. If the migration job errored, or if validation reveals the data is incomplete or wrong, the capability owner is responsible for deciding what happens next — because this is their data and their migration process. Possible plans they may propose on the issue:

  • Wipe the destination tenant’s storage and re-run with a fixed migration process (re-handoff a new artifact).
  • Resume from where it failed (only viable if their migration process supports this).
  • Accept the partial state and run a follow-up migration for the remainder.
  • Abandon this migration attempt entirely.

The platform does not prescribe a recovery model. The operator executes whatever next-step plan the capability owner provides, looping back through the appropriate earlier step (re-handoff → re-review → re-run, or just re-run).

8. Operator tears down the migration job and closes the issue

Once the capability owner confirms validation success, the operator tears down the one-time migration job (it is not retained — re-running later means filing a fresh migrate my data issue) and closes the issue.

The new tenant now holds the migrated data. Cutting end users over from the old host to the new tenant is the capability owner’s separate concern, outside this UX.

Flow Diagram

flowchart TD
    Start([Onboarding complete; tenant live & empty]) --> Secrets[Register old-host credentials with<br/>platform secret-management offering]
    Secrets --> File[File 'migrate my data' issue<br/>linking the closed onboarding issue]
    File --> Review[Operator confirms resources,<br/>network, and credentials with CO]
    Review --> Run[Operator onboards and starts<br/>the one-time migration job]
    Run --> Observe[CO observes job via platform observability]
    Observe --> Terminal[Operator reports terminal state on issue]
    Terminal --> Validate{CO validates data?}
    Validate -->|Yes — data is present and correct| Teardown[Operator tears down migration job<br/>and closes the issue]
    Validate -->|No — failure or incomplete data| Plan[CO provides plan for next steps]
    Plan --> Branch{Plan}
    Branch -->|Re-handoff fixed artifact| Review
    Branch -->|Re-run as-is| Run
    Branch -->|Abandon| Teardown
    Teardown --> Done((Data migrated;<br/>cutover is CO's concern))

Success

When the issue closes, the capability owner walks away with:

  • Their existing end-user data sitting inside the new tenant, validated by them against their own capability’s definition of correctness.
  • A clean platform state: the one-time migration job is torn down, leaving only the tenant and its data behind.
  • Confidence that when they decide to cut their end users over, the new tenant will not look like a regression.
  • A known, repeatable path if they ever need to migrate again (file another migrate my data issue and declare the process’s re-run contract again).

Edge Cases & Failure Modes

  • Migration job errors out partway, leaving partial data in the tenant. Experience-level handling: the operator reports the error on the issue; the capability owner provides the plan (wipe-and-retry, resume, accept partial, abandon). The platform does not auto-clean — the data belongs to the capability owner and they decide what to do with it.
  • Validation reveals data is wrong even though the job reported success. Same as above — capability owner provides the plan. This is treated identically to a job-level failure from the journey’s perspective.
  • Migration takes far longer than the capability owner expected. Experience-level handling: there is no SLA, and the capability owner can see what is happening through the platform’s observability. They can decide whether to let it run or to file a plan to abort and re-approach.
  • Migration job needs more resources than declared (storage too small in the tenant, more compute, etc.). Experience-level handling: temporary migration-only spikes are allowed only if declared up front and approved during review, and approval is bounded by the step-3 rule that the migration’s peak temporary footprint can be at most 2x the destination tenant’s steady-state compute and storage footprint. If the real job exceeds what was declared, the operator surfaces this on the issue; the capability owner may need to file a separate modify my capability issue against the destination tenant first (e.g., to enlarge storage), split the migration into smaller runs, or re-file the migration with a corrected declaration. The two issues are explicitly distinct because they touch different review scopes.
  • Old host becomes unavailable mid-migration (vendor outage, account suspended, etc.). Experience-level handling: the migration job will fail; same as any other failure — capability owner provides the plan. The platform makes no attempt to resume on the capability owner’s behalf.
  • Capability owner registered the wrong secrets, or the migration process can’t authenticate to the old host. Same as any other failure mode — surfaces during the run, capability owner adjusts and the issue iterates.
  • Another tenant is migrating at the same time. Experience-level handling: no special branch. Concurrent migrations are part of the offering; the capability owner still files the same issue, waits through the same review, and observes only their own job.
  • Capability owner wants to re-run the migration months later (e.g., to top up data accumulated on the old host since the first migration). The experience is still: file a fresh migrate my data issue. The previous migration job is gone; the new one is a separate one-shot, and the capability owner must explicitly declare whether the process is safe against existing data or whether the destination must be wiped first.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability’s Business Rules and Success Criteria — by name, so future readers can trace the lineage:

  • Operator-only operation. As with host-a-capability, the only engagement surface is a GitHub issue the capability owner files; the operator personally services it. The capability owner has no direct access to start, stop, or observe migration jobs except through the platform’s observability surface, which is itself an offering the operator runs.
  • Tenants must accept the platform’s contract. The migration process is packaged in the same form the platform accepts for any tenant component — the contract does not relax for migration. A migration process that cannot be packaged this way cannot be run by the platform. Declaring the process’s resource needs and re-run contract up front is part of that contract.
  • The capability evolves with its tenants. The existence of a migration-process offering — a platform-provided one-shot-job runner with the platform’s standard observability — is itself an instance of this rule. The platform extends to support a need (migrating in pre-existing data) that tenants have, rather than refusing tenants whose data already exists somewhere.
  • Identity service honors tenant credential-recovery rules. Indirectly relevant: if the migration includes user-account or credential references from the old host, the capability owner’s migration process must produce data that respects whatever identity properties their capability requires (e.g. for self-hosted personal media storage, the “lost credentials cannot be recovered” property must still hold post-migration). This is the capability owner’s responsibility, embedded in their migration process — the platform does not enforce it.
  • KPI: 1-hour reproducibility. The migration offering itself must be reproducible from definitions, like every other offering. A specific migration job is per-tenant and not part of the platform’s reproducible state — it is a one-shot artifact that ceases to exist after teardown.
  • KPI: 2-hr/week operator maintenance budget. A migration that demands disproportionate operator time across the issue’s review-run-iterate loop pressures this budget. Repeated failed migrations from the same capability owner — or migrations that require the operator to deeply understand the capability owner’s data to make progress — would cross into the eviction-threshold rule’s territory.
  • Eviction threshold. Sustained migration friction is a possible (if unusual) path into eviction. The platform offers to run a migration process; it does not offer to write one, debug it, or shepherd a problem capability through repeated attempts.
  • No specific availability or performance SLA. No SLA on migration completion either. Migrations take however long they take; the capability owner sees progress through observability and decides what to do about long-running jobs. Supporting concurrent migrations does not imply exclusive capacity or a completion-time guarantee for any one tenant’s job.
  • Operator succession. The migration job’s lifespan is bounded — it exists only between steps 4 and 8 of this journey. If the operator becomes unavailable mid-migration, the successor’s takeover responsibility is to keep the platform running, not to finish in-flight migration jobs. A mid-migration tenant simply has a stalled job; the capability owner provides a plan when a successor (or recovered operator) is back.

Out of Scope

  • Cutting end users over from the old host to the new tenant. This is a capability-owner concern, deliberately outside the platform’s view. The capability owner runs old + new concurrently and cuts over on their own schedule using whatever mechanisms their capability provides for end users.
  • Ongoing sync or replication between the old host and the new tenant. This UX is one-shot. A capability that needs continuous sync is a different capability (and likely a different UX, if it ever exists).
  • Writing or debugging the capability owner’s migration process. The platform runs what is handed to it. Logic correctness, source-format handling, schema translation, and idempotency belong to the capability owner.
  • Helping the capability owner pull data out of the old host. The migration process must speak to the old host on its own. The platform does not maintain adapters or know about specific vendors.
  • Validation of data correctness. Per Move Off the Platform After Eviction, the platform provides bytes faithfully but does not validate semantic correctness. The same applies in reverse here — the capability owner is the only judge of “did the data land correctly.”
  • Rollback to the old host. The capability owner is already running the old host concurrently; “rollback” simply means they don’t cut over. There is no platform-side rollback because there was nothing to roll back from — end users were never on the new tenant during the migration window.

Open Questions

None at this time.

3 - Move Off the Platform After Eviction

A capability owner whose capability has been evicted gets their data out cleanly and walks away with no obligations and no tenant-accessible copy left on the platform once the retention window closes.

One-line definition: A capability owner whose capability has been evicted gets their data out cleanly and walks away with no obligations and no tenant-accessible copy left on the platform once the retention window closes.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is a capability owner whose capability has been evicted — a Primary actor (initiator) from the parent capability’s Stakeholders, on the way out. As elsewhere in this capability’s UX docs, the role is treated as separate from the operator’s even though today both hats are worn by the same person.

  • Role: Capability owner. The party who originally onboarded a capability onto the platform via host-a-capability, has been hosting it for some period, and is now being removed.
  • Context they come from: The parting is amicable. Eviction was triggered by a divergence the platform legitimately cannot meet — specialized hardware, regulatory constraints, an availability target stronger than the platform offers — not by a missed deadline in the operator-initiated-tenant-update flow. Negotiation over the eviction date has already happened upstream, before this UX begins. The capability owner accepts that they are leaving and has agreed to the date.
  • What they care about here: A clean exit. By the eviction date their capability is fully off the platform, their data is in their hands in a portable form they can verify, and nothing remains available for them to retrieve from the platform after the retention window ends. They are not asking the platform to help them figure out where to run next — that is their problem to solve.

Goal

“By the time the platform is finished with my capability, I have my data, I know it’s complete, and I have nothing left to chase down here.”

Entry Point

The capability owner arrives at this experience because the operator has filed an eviction issue against the infra repo tagging them. The issue contains exactly:

  • The eviction date (already negotiated upstream — not up for renegotiation in this journey).
  • The reason for eviction (so it is on the record and the parting stays amicable).
  • A link to the platform’s export tooling, with documentation on how to use it and what the export shape looks like for their tenant.

That is all the issue carries. The capability owner’s state of mind is “the date is set, I know where the export tool is, I have a window of time to get my data out and walk away cleanly.”

Journey

The journey runs in three phases keyed off the eviction date: a pre-eviction window where the tenant is still live, the eviction date itself when compute and network resources go away, and a 30-day grace window where data is held in an export-only, read-only state before tenant data is permanently deleted across all tiers at day 30.

Phase A — Before the eviction date (tenant still live)

1. Read the eviction issue and the export documentation

The capability owner reads the issue, follows the link to the export tooling, and reads its documentation. They learn what the export will produce — file layout, formats, what is included, what is not — and roughly how long an export of their dataset will take to run. No back-and-forth with the operator is expected here; the issue and the docs are meant to be self-sufficient.

2. Notify their own end users

The capability owner tells their end users that the capability is going away on the eviction date — separately from the platform, on whatever channel they use with their users. The platform plays no role here; end users of a tenant capability are not visible to the platform and the platform does not communicate with them. (See No direct end-user access to the platform in Constraints.)

3. Run the export and verify it themselves

The capability owner kicks off the export using the platform’s export tool. What they perceive is an archive of their tenant’s data, produced for them to download then and there, plus a checksum/hash and total size in bytes that the platform produces alongside it. Validation that the export is complete and correct is the capability owner’s responsibility, not the platform’s. Only the capability owner knows their data well enough to say “yes, this is all of it and it is intact.” The platform offers checksum/hash and total size as the ceiling of what it can verify on the capability owner’s behalf — anything beyond that (record counts, schema integrity, business invariants) is theirs.

4. (Optional) Run the export iteratively

Because end users may still be writing data while the tenant is live, an export taken in Phase A is not necessarily the final export. The capability owner may run multiple exports across Phase A — one early to validate that the tooling produces something usable, another later to capture more recent writes. Whether they do this is their call; the platform supports it because the export tool simply runs whenever invoked. Each run is ephemeral: if they want to keep an export, they download it when it is produced. The platform does not keep a history of prior exports around for them.

Phase B — The eviction date

5. Compute and network resources are torn down; the tenant stops serving

On the eviction date the operator deprovisions the tenant’s compute, network, and other live resources. From the capability owner’s seat: the tenant is no longer reachable by their end users. The data persists, but only in an export-only, read-only state — no further writes can occur, by anyone. A comment is posted on the eviction issue confirming the cutover and the start of the 30-day retention window.

What the capability owner perceives: the issue gets a status comment, and they now know their dataset is frozen. If they had not finished extracting data before this point, they still have 30 days — but the dataset they extract from now on is the final one.

Phase C — Post-eviction (30-day retention window)

6. Run the export of record (if not already taken)

In Phase C the export tool still works, but now against a stable, read-only snapshot. For capability owners with more data than they could extract during Phase A, or for those who deliberately deferred to avoid racing live writes, this is when the definitive export is pulled. As in Phase A, the generated export artifact is ephemeral: they re-run the same export tool, get back an archive plus checksum/hash and size, and must download it when it is produced rather than assuming the platform will keep that generated file around for later pickup. If they miss that download, they can run the export tool again at any point within the 30-day retention window and validate the newly generated archive the same way they validated in Phase A.

For capability owners who already pulled what they needed in Phase A, Phase C is a safety net — “I forgot a thing, let me grab it” — rather than the main event.

7. Walk away

Once the capability owner is satisfied they have everything, they comment on the issue indicating they are done. The operator closes the issue. After 30 days from the eviction date, the platform permanently deletes the tenant’s data — both the tenant-accessible copy and any platform-held backup-tier copies — regardless of whether the capability owner ever closed the loop. No residual copy survives day 30 in any tier the platform controls. There is no “are you sure?” — the 30-day clock is hard.

Flow Diagram

flowchart TD
    Start([Eviction issue filed by operator<br/>date already negotiated]) --> Read[Read issue + export tooling docs]
    Read --> Notify[Notify own end users<br/>off-platform]
    Notify --> ExportLive[Run export against live tenant<br/>verify checksum / size / contents]
    ExportLive --> Iter{More writes expected<br/>before eviction date?}
    Iter -->|Yes| ExportLive
    Iter -->|No| Wait[Wait for eviction date]
    Wait --> Cutover[Eviction date:<br/>compute/network torn down<br/>data → read-only<br/>comment posted on issue]
    Cutover --> PhaseC{Need more data<br/>from final snapshot?}
    PhaseC -->|Yes| ExportFinal[Run export against frozen snapshot<br/>download now + verify]
    PhaseC -->|No, already complete| Done
    ExportFinal --> Done[Comment 'done' on issue;<br/>operator closes it]
    Done --> RetentionEnds([30 days post-eviction:<br/>all tenant data permanently deleted<br/>including backup-tier copies])

Success

When the journey ends cleanly, the capability owner walks away with:

  • A verified, complete archive of their tenant’s data, sized and checksummed by the platform, validated by them.
  • A clear paper trail on the eviction issue showing the date, the reason, and confirmation that they pulled what they needed.
  • Nothing left to chase down on the platform. After the 30-day window the platform permanently deletes the tenant’s data across every tier it controls — no tenant-accessible copy and no deeper backup-tier copy survives.
  • An amicable ending. The operator filed the issue, the platform held the data the agreed amount of time, and the capability owner left under their own power. The relationship is intact for whatever comes next.

Edge Cases & Failure Modes

  • Capability owner asks for more time after the eviction date. Hard wall. The negotiation over the eviction date happened upstream of this journey; once that date is set, it is the date. The 30-day post-eviction retention is the only post-date slack and it is fixed.
  • Export takes longer than 30 days to actually run on a very large dataset. Same hard wall — the capability owner had Phase A plus 30 days of Phase C to extract; if that is not enough, they had advance warning during eviction-date negotiation and should have raised it then. The platform does not extend the retention window for slow extracts.
  • Export comes back wrong (checksum mismatch, missing files, corruption visible to the capability owner). The capability owner reports the problem on the eviction issue so that thread remains the coordination record. This is the one exception to the 30-day hard wall: if the failure is shown to be in the platform’s export tooling or its data hosting, the operator pauses that tenant’s retention-window countdown for removal of tenant-accessible data until the platform-side issue is resolved and a clean export has been produced, so the capability owner can continue exporting during that pause. No separate restoration SLA is promised in this UX; the issue stays open until the capability owner can pull a clean export. Failures rooted in the capability owner’s own validation steps do not pause that retention-window countdown.
  • Export tooling does not exist for this tenant’s data shape at the time of eviction. Cannot happen by design — export tooling is a core platform feature, present for every kind of data the platform hosts. If a hole is discovered, that is itself a platform bug, handled the same way as the previous bullet (eviction issue remains open, that tenant’s retention-window countdown for removal of tenant-accessible data is paused).
  • Capability owner ignores the issue entirely and never extracts anything. No special handling. The 30-day clock runs, tenant-accessible data is removed, the issue is closed by the operator. The capability owner may have made themselves whole through other means (their own backups, accepting the loss); the platform does not chase them.
  • End users keep hitting the tenant after the eviction date. They get whatever connection failure the underlying infra produces. The capability owner is responsible for having warned their end users; the platform does not present a “this tenant has been retired” page or otherwise communicate with end users — end users belong to the capability, and from the platform’s seat, the capability is the end user.
  • Capability owner wants to come back later (re-onboard the same capability after the divergence is resolved). That is a new host-a-capability journey, not a continuation of this one. It is not blocked, but nothing about this UX preserves state to make it easier.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability — by name:

  • Eviction is allowed when needs and capabilities diverge. This UX is the operationalization of the amicable form of that rule: the divergence is real (specialized hardware, regulatory constraint, availability target the platform cannot meet) and the parting is mutual. The fall-behind variant of eviction is handled separately via operator-initiated-tenant-update.
  • No direct end-user access to the platform. End users of the tenant capability are not visible to the platform and are not communicated with by the platform during eviction. Notification of end users is purely the capability owner’s responsibility.
  • Operator succession — on-demand exportable archives. The same export mechanism that the parent capability promises for operator-succession scenarios is what powers this journey. Export tooling is therefore not bespoke to eviction; it is a core platform feature that exists at all times for every tenant. This UX simply consumes it.
  • Operator-only operation. The capability owner has no administrative access during this journey. Everything they do — running exports, leaving comments — is done through the same surfaces an end-state non-operator has. The operator is the one who deprovisions resources and closes the issue.
  • Affected parties (end users of the tenant capability). End users feel this journey indirectly: their access to the capability ends on the eviction date. The platform does not surface this to them — the capability owner does, separately, on their own channels.
  • KPI: 2-hr/week operator maintenance budget. Implication: this journey must not require the operator to do bespoke per-tenant work. The export tool is generic and runs on demand; the operator’s only routine touchpoints are filing the issue, posting the cutover comment, and closing the issue at the end. A tenant whose eviction would require custom export work is itself a sign the platform’s export tooling has a gap that needs fixing — handled as a platform bug, not as an operator-effort overrun.
  • KPI: 1-hour reproducibility. Implication: the data formats produced by the export tool, and the way they relate to the platform’s definitions, should be expressible as part of the platform itself, not as snowflake per-tenant logic. (Standing the platform up should not require remembering “and here is the special export path for tenant X.”)

Out of Scope

  • The eviction-decision journey itself. Why the operator decided to evict, and the conversation that established the eviction date, happens before this UX. By the time this UX begins, the issue is filed, the date is set, and both parties have agreed.
  • The fall-behind eviction path. Eviction triggered by a missed extended date in operator-initiated-tenant-update is a different shape (less amicable, possibly compressed timelines). It enters a separate journey not covered here, even though the mechanics of getting data out via the export tool may overlap.
  • Helping the capability owner figure out where to run next. The platform does not point at alternative hosts, port the capability’s runtime, or assist with migration. The export tool produces data; the rest is the capability owner’s problem.
  • Application/runtime/configuration migration tooling. Only data export is provided. Capability code, container images, configuration, secrets management at the destination — none of this is the platform’s concern.
  • Re-onboarding the same capability later. If the capability owner wants to come back, that is a fresh host-a-capability journey with no special path inherited from having previously been here.
  • Operator’s side of this journey. This UX is written from the capability owner’s seat. The operator’s experience (filing the issue, deprovisioning on the date, posting the cutover comment, closing the issue, watching the 30-day clock) is captured here as a responder, not as a separate document.

Open Questions

None at this time.

4 - Operator-Initiated Tenant Update

The operator notices a hosted tenant’s components have fallen behind what the platform supports, opens the conversation, and works with the capability owner to bring them current — without evicting.

One-line definition: The operator notices a hosted tenant’s components have fallen behind what the platform supports, opens the conversation, and works with the capability owner to bring them current — without evicting.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is the operator — the Owner / Accountable party from the parent capability’s Stakeholders. The capability owner is a responder in this journey, not the initiator. As with host-a-capability, this UX is written as if the operator and the capability owner were separate people: the role boundary is treated as real, even though today both hats are worn by the same person.

  • Role: Operator. Sole administrator of the platform; the only person who can see across tenants and notice that one of them has aged out of what the current platform offers.
  • Context they come from: They have just learned that something in the platform itself must change — a cloud provider is sunsetting a service the platform depends on; a CVE has landed against a platform component; a runtime version the platform offers is being retired upstream. The change forces an update on every tenant still using the affected component.
  • What they care about here: Getting affected tenants migrated with their capability owners, on a timeline driven by the real external pressure, without burning down the “we work with you, we don’t evict for fall-behind” promise — and without letting the situation drag past the point where the platform itself becomes unsafe or unsupportable.

Goal

“I want every tenant still on the falling-behind component to be moved onto what the platform now supports, on a timeline that fits the external pressure that forced this — and I want to do it by working with each capability owner rather than evicting them.”

Entry Point

The operator arrives at this experience because of a platform-level dependency event that is not under their control:

  • A cloud provider has announced a sunset date for a service the platform uses.
  • A CVE has been disclosed against a platform component, so the platform itself must update — and any tenant pinned to the affected component must update with it.
  • An upstream runtime, library, or base image the platform offers is reaching end-of-life.

The deadline is therefore inherited from the external event, not invented by the operator. The operator’s state of mind is “I have to do this anyway; how many tenants am I dragging through it with me, and what do they each need to ship?”

What they have in hand: knowledge of which platform offering is changing, by when, and which currently-hosted tenants are using it.

There is no formal tenant-facing pending-update view ahead of this moment. If the platform ever adds an earlier deprecation or pending-update signal for capability owners, that signal would live in Tenant-Facing Observability rather than in this operator-side journey. The operator-filed issue remains the first official signal that this journey has begun.

Journey

1. File a “platform update required” issue per affected tenant

For each affected tenant, the operator opens an issue against the infra repo using the platform update required issue type. This is a distinct issue type from onboard my capability and modify my capability — the distinct type is the signal to the capability owner that this is not optional cleanup, it is a required update with a real deadline behind it.

If the same tenant is hit by two unrelated forcing events at roughly the same time, the operator opens separate platform update required issues — one per event — and cross-links them if the remediation overlaps. The forcing event, reason, and deadline stay distinct even when the same code change may help satisfy more than one thread.

The issue tags the capability owner and contains:

  • What is falling behind (the specific platform offering / component / version).
  • What it is being replaced by, or what the new platform-supported version is.
  • The shape of the update being asked for — repackage against a new runtime, swap a dependency, rebuild against a new base, etc.
  • The deadline, with the external reason for it (sunset date, CVE remediation window, EOL date).

What the operator perceives at this point: the issue is filed and the capability owner has been notified. They wait for acknowledgment.

2. Capability owner acknowledges and plans

The capability owner reads the issue, asks any clarifying questions in-thread, and indicates whether the requested shape of update is feasible within the deadline. The operator answers questions as they come.

If the capability owner needs more time than the inherited deadline allows, the conversation moves into step 4 (slack negotiation) before any artifacts are handed off. Otherwise it proceeds to step 3.

3. Run the modify inner-loop

From here the mechanics are identical to the modify my capability journey:

  • The capability owner hands off updated packaged artifacts on the issue.
  • The operator re-provisions against the platform’s new offering.
  • The operator asks the capability owner to test.
  • They iterate in comments until it works.
  • The operator closes the issue.

The inner loop is the same surface; only the initiator and the issue type differ. End-user impact during the test/redeploy step is the same as a routine modify — typically a brief outage during cutover, nothing more.

4. Negotiate slack against the inherited deadline

If the capability owner cannot ship within the inherited deadline, the operator and capability owner first determine whether the external pressure leaves any safe slack at all. If it does, they negotiate an extended delivery date in the issue thread. The extension is not unbounded — the operator sets it based on how much slack the external pressure actually allows (a CVE with a known exploit allows much less slack than a vendor sunset announced 18 months out).

If the inherited deadline leaves no safe slack, the operator declines the extension and the original inherited deadline remains the operative date. The capability owner still gets the chance to ship against that date; they just do not get more time.

Whether extended or not, the date the operator and capability owner are now working against is recorded clearly on the issue. The journey then resumes at step 3.

5. Tip into eviction (after the last workable date is missed)

If the capability owner misses the operative delivery date — either the original inherited deadline when no extension was possible, or an agreed extended delivery date when one was — the operator opens a separate eviction issue (per the parent capability’s eviction journey — to be defined as its own UX) that links back to this issue for context. The eviction issue carries its own eviction date.

This update issue is then closed as superseded by the eviction. The journey ends here from the operator’s side; the capability owner’s experience continues in the Capability owner moves off the platform after eviction UX.

The decision to evict is governed by the parent capability’s Eviction threshold rule: continuing to accommodate this tenant would either push routine maintenance sustainably above 2× the operator-maintenance-budget KPI, or break the reproducibility KPI by leaving the platform stuck on a snowflake configuration to keep one tenant alive. A missed operative delivery date is the operational signal that the threshold has been crossed; it is not eviction-by-policy for being late.

Flow Diagram

flowchart TD
    Start([Platform dependency event:<br/>vendor sunset / CVE / EOL]) --> File[Operator files 'platform update required'<br/>issue per affected tenant]
    File --> Ack[Capability owner acknowledges<br/>and asks clarifying questions]
    Ack --> Feasible{Feasible by<br/>inherited deadline?}
    Feasible -->|Yes| Modify[Run modify inner-loop:<br/>artifacts → provision → test → close]
    Feasible -->|No| Slack{Safe slack for<br/>extension?}
    Slack -->|Yes| Extend[Negotiate extended delivery date<br/>recorded on the issue]
    Slack -->|No| NoExtend[No extension available;<br/>original deadline stands]
    Extend --> Modify
    NoExtend --> Modify
    Modify --> MetDeadline{Delivered by<br/>operative date?}
    MetDeadline -->|Yes| Done([Issue closed — tenant current])
    MetDeadline -->|No| Evict[Operator opens separate eviction issue,<br/>links back to this one]
    Evict --> Closed([This issue closed —<br/>superseded by eviction])

Success

When the issue closes cleanly, the operator walks away with:

  • Every affected tenant is now running on what the platform currently supports — no stragglers pinned to the retired offering.
  • The “we work with you, we don’t evict for fall-behind” promise was honored: each capability owner was given the chance to ship the update, with extension where the inherited deadline didn’t fit and safe slack existed.
  • The platform is free to actually retire the old offering, since there are no tenants left on it. The external pressure that started the whole journey can now be fully addressed.
  • A trail on each issue showing what was asked for, when, and what was shipped — useful the next time a similar dependency event happens.

Edge Cases & Failure Modes

  • Multiple tenants affected by the same platform event. Each gets its own issue, so each capability owner sees a request scoped to their capability. The operator coordinates timelines across all of them but does not bundle them into a single thread.
  • Capability owner goes silent. Same shape as silence in host-a-capability — there is no formal SLA in either direction. Experience-level handling: the operator can grant an extension only if safe slack exists, but if silence persists past the operative delivery date, step 5 applies.
  • Update cannot be shipped at all (capability fundamentally incompatible with the new offering). This is functionally the same as a missed operative delivery date: the operator opens an eviction issue. The right root response, per the parent capability’s “the capability evolves with its tenants” rule, is to consider whether the platform should keep supporting the old form — but if the external pressure (CVE, hard vendor sunset) makes that impossible, eviction is the honest outcome.
  • CVE with active exploit shortens the timeline aggressively. The operator may file the issue with very little slack, and the extension in step 4 may be much smaller than for a routine sunset — or unavailable entirely. The journey shape is unchanged; the deadlines just compress.
  • Update reveals a new requirement that the platform doesn’t yet offer. Hand off into the host-a-capability change-later loop’s “new offering needed” branch — the platform-update issue stays open while the new offering is added, then resumes at step 3.
  • Multiple overlapping platform updates against the same tenant. The operator opens one issue per forcing event, even for the same tenant, so each deadline and external reason stays legible. If the remediation overlaps, the issues cross-link and the operator coordinates them together.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability — by name, so the lineage is traceable:

  • Eviction is allowed when needs and capabilities diverge — but fall-behind cases work with the tenant. This UX is the operationalization of that carve-out. The default outcome is “we update together,” not “we evict.” Eviction enters this journey only via the missed-final-date branch and only as a separate, linked issue.
  • Eviction threshold. A missed operative delivery date is the operational signal that continuing to accommodate this tenant would cross the 2×-maintenance-budget or reproducibility threshold. The numeric threshold lives with the KPI; this UX inherits whatever it currently is.
  • The capability evolves with its tenants. Before transitioning to eviction in step 5, the operator considers whether the platform should keep supporting the older form — sometimes the right answer is to absorb the maintenance, not push the tenant forward. That choice is constrained by what the external pressure actually allows (a CVE generally rules it out; a vendor sunset announced years ahead may not).
  • Operator-only operation. The operator is the only person who can see that a tenant has fallen behind, because cross-tenant visibility lives only with the operator. There is no automated tenant-side warning system surfacing this from the platform to the capability owner today.
  • KPI: 2-hr/week operator maintenance budget. A tenant that routinely needs hand-holding through these updates — repeatedly missing deadlines, repeatedly needing extensions — is consuming disproportionate operator time and crosses into the eviction-threshold rule on its own merits, even before any single missed operative delivery date.
  • KPI: 1-hour reproducibility. Implication for this UX: the re-provision step in the inner loop must run through the platform’s existing definitions (now updated to the new offering), not through a per-tenant snowflake patch. If the only way to keep a tenant alive is bespoke manual config, that is itself eviction-threshold material.

Out of Scope

  • The eviction journey itself. When step 5 fires, the operator opens a separate eviction issue and the capability owner’s experience continues in Capability owner moves off the platform after eviction — a sibling UX, not part of this one.
  • Platform-contract changes that aren’t forced by an external dependency event. When the operator decides to change the platform’s contract (retire a packaging form, alter availability characteristics) absent external pressure, that’s the platform-contract-change rollout UX, not this one. The seam: this UX is reactive (something outside the operator’s control forced the update); the contract-change UX is proactive (the operator chose to change something).
  • The capability owner’s side as a primary journey. This UX is written from the operator’s seat. The capability owner’s experience of receiving and responding to one of these issues is captured here as a responder, not as a separate document — it shares enough surface with modify my capability (artifacts, test, iterate, close) that a separate doc would mostly duplicate.
  • Detection of fall-behind itself. How the operator notices a tenant is on a falling-behind component (vendor announcement watching, CVE feeds, manual review) is operational detail, not part of the user experience. This UX starts at the moment the operator has decided to act.
  • Tenant-facing visibility into pending platform updates before the issue is filed. Capability owners do not get an official warning surface ahead of the issue in this UX. If the platform later adds an earlier deprecation or pending-update signal, that signal belongs in Tenant-Facing Observability, not here, and does not replace issue filing as the start of this journey.
  • Routine modify requests. A capability owner shipping a version bump on their own initiative is the change-later loop in host-a-capability, not this UX.

Open Questions

None at this time.

5 - Platform-Contract-Change Rollout

The operator proactively changes a term of the platform’s contract — retiring an offering, changing a packaging form, altering availability characteristics — communicates that change to every affected tenant ahead of time, and migrates them all onto the new contract before the old one is retired.

One-line definition: The operator proactively changes a term of the platform’s contract — retiring an offering, changing a packaging form, altering availability characteristics — communicates that change to every affected tenant ahead of time, and migrates them all onto the new contract before the old one is retired.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is the operator — the Owner / Accountable party from the parent capability’s Stakeholders. The capability owners are responders in this journey, not initiators. As with host-a-capability and operator-initiated-tenant-update, this UX is written as if the operator and the capability owners were separate people: the role boundary is treated as real even though today both hats are worn by the same person.

  • Role: Operator. Sole administrator of the platform; the only person who can change the platform’s contract and the only person who can see across tenants to know which ones are affected.
  • Context they come from: They have decided to change a term of the platform’s contract — retire an offering, change a packaging form, alter availability characteristics, alter platform-imposed constraints. The decision has already been made and is not forced by external pressure. They have flushed out the technical details of the change and (where applicable) prepared a migration guideline for tenants. Where the change replaces an offering with a new one, the replacement offering has already been implemented and is running on the platform alongside the old one.
  • What they care about here: Getting every affected tenant migrated onto the new contract by a hard deadline, without surprising anyone, while honoring the evergreen contract promise — change is communicated ahead of time and tenants are migrated, not sprung on.

Goal

“I want to change a term of the platform’s contract — retire an offering, change a packaging form, alter availability characteristics — communicate that change to every affected tenant ahead of time, and have them all migrated onto the new contract before the old one is retired, without surprising anyone.”

Entry Point

The operator arrives at this experience having chosen to change the contract. The deciding (the why — cost, simplification, security posture, no longer wanting to maintain two runtimes, etc.) is upstream of this UX and not part of it. What they have in hand at step 0:

  • The full technical details of the change — what term is changing, what it is becoming, or that it is being removed entirely.
  • A migration guideline for tenants, where applicable (i.e. when a replacement offering exists and tenants need to repackage or reconfigure against it).
  • The replacement offering, if one exists, already implemented and running on the platform. Building the replacement is a precondition of this journey, not a step inside it.
  • Knowledge of which currently-hosted tenants are using the affected term.

The operator’s state of mind is “I have decided this is changing; how do I get everyone moved over by a date I’m choosing, without anyone being surprised?”

The seam with operator-initiated-tenant-update (UX #2) is sharp: that journey is reactive (an external event — vendor sunset, CVE, EOL — forced the update and dictated the deadline). This journey is proactive (the operator chose the change and is choosing the deadline).

Journey

1. File a “platform contract change” umbrella issue

The operator opens a single umbrella issue against the infra repo using the platform contract change issue type. This is a distinct issue type, separate from onboard my capability, modify my capability, and platform update required. The distinct type is the signal to capability owners that this is the operator changing the rules — not an externally-forced update and not optional cleanup.

A single umbrella issue is used (rather than one issue per tenant, as in UX #2) because the change applies identically to everyone, the migration guideline is shared, and tenants benefit from cross-tenant visibility — a clarifying question one tenant asks may be the answer another tenant needed.

The umbrella issue tags every affected capability owner and contains:

  • What term is changing, and what it is changing to (or that it is being removed entirely).
  • The migration guideline, if applicable.
  • The hard deadline by which all migrations must complete and after which the old form will be removed. Because this UX has no externally-imposed date to inherit, the operator picks a deadline that gives every affected tenant at least two full status-update cycles before cutoff: one cycle to acknowledge and start, and one cycle to finish or surface blockers while there is still time to respond.
  • The reason for the change. Even though the operator chose it, capability owners deserve to know why (cost, simplification, security posture, etc.) so they can plan and so the trail makes sense to readers later.
  • The status-update cadence the operator has chosen for this rollout (see step 3).

What the operator perceives at this point: the umbrella issue is filed and every affected capability owner has been notified. They wait for acknowledgments.

2. Capability owners acknowledge in-thread

Each tagged capability owner is required to acknowledge the change in-thread. Silence is not acceptable in an umbrella issue — silence in a multi-tenant thread is ambiguous (did they see it? are they planning?), so explicit acknowledgment is the contract.

Clarifying questions are asked in the umbrella thread, not in side channels, so that answers are visible to every other affected tenant at the same time. The operator answers questions as they come.

The deadline is not negotiable per-tenant. Capability owners do not get to ask for a slip — the deadline applies uniformly to everyone or it isn’t a deadline. (Whether the operator may globally push the deadline if the migration guideline turns out to be insufficient is covered in Edge Cases.)

3. Tenants migrate via separate modify my capability issues

Each affected tenant ships its migration as a separate modify my capability issue, linking back to the umbrella issue for context. The umbrella thread tracks acknowledgments, cross-tenant questions, and the global deadline; each modify issue tracks the actual artifact handoff / provision / test / close inner loop for one tenant. This keeps the umbrella thread readable as a coordination surface rather than a sprawling multi-tenant inner loop.

During the rollout window, the platform serves both the old and the new form of the contract concurrently — the replacement offering runs alongside the old one so tenants have time to migrate at their own pace within the deadline. The exception is a full offering removal: when there is no replacement, there is nothing to run alongside, and the change is effectively all-or-nothing at the deadline.

The operator posts status updates on a regular schedule in the umbrella thread. The cadence is chosen by the operator at the time the umbrella issue is filed and is sized to the overall timeline — daily for a roughly-week-long rollout, weekly for a roughly-month-long rollout, and so on. The current snapshot lives in the umbrella issue body so a reader landing cold can immediately see the latest state, and each scheduled update is also posted as a thread comment so the history of the rollout remains visible to watchers over time. Each update carries the same metrics: how many tenants are still on the old form, how many have migrated, which modify issues are open, and how much time remains until the deadline. Status updates are how every party — operator and capability owners alike — sees rollout progress without having to chase it.

4. Deadline arrives

On the hard deadline:

  • For each tenant whose migration completed: the modify issue is closed in the normal way and that tenant is now on the new form.
  • The old form is removed from the platform regardless of whether anyone is still on it. Any tenant that has not migrated by the deadline is now broken on a removed offering — which is exactly why the operator must ensure laggards are moved into eviction before this point if it is clear they will not make it.
  • For each tenant that did not migrate by the deadline: the operator opens a separate eviction issue per laggard tenant, linking back to the umbrella issue for context. The eviction issue carries its own eviction date and is governed by the parent capability’s eviction journey (to be defined as its own UX).
  • The umbrella issue is closed. Its job ends here — every affected tenant has either completed migration (their modify issue closed) or has an eviction issue in-flight (linked from the umbrella). Subsequent activity for laggards lives on their respective eviction issues, not on the umbrella.

Flow Diagram

flowchart TD
    Start([Operator has decided to change<br/>a contract term; replacement<br/>offering already implemented]) --> File[Operator files 'platform contract change'<br/>umbrella issue, tags all affected tenants]
    File --> Ack[Each capability owner acknowledges<br/>in-thread; questions answered in-thread]
    Ack --> Modify[Each tenant ships a separate<br/>'modify my capability' issue,<br/>linked to the umbrella]
    Modify --> Concurrent[Old + new forms run concurrently<br/>during the rollout window<br/>except for full removals]
    Concurrent --> Status[Operator posts scheduled status<br/>updates with migration metrics<br/>in the umbrella thread]
    Status --> Deadline{Deadline<br/>arrives}
    Deadline --> Migrated[Migrated tenants:<br/>'modify' issue closes]
    Deadline --> Laggards[Non-migrated tenants:<br/>operator opens a separate<br/>eviction issue per laggard,<br/>linked to umbrella]
    Deadline --> Remove[Old form is removed<br/>from the platform]
    Migrated --> CloseUmbrella[Umbrella issue closed]
    Laggards --> CloseUmbrella
    Remove --> CloseUmbrella
    CloseUmbrella --> Done([Contract change has shipped])

Success

When the umbrella issue closes, the operator walks away with:

  • The contract change has shipped — the old form is gone from the platform, the new form is the only form.
  • Every affected tenant has either migrated onto the new contract or has an eviction issue in-flight; no tenant is silently broken on a removed offering.
  • The evergreen contract promise was honored: the change was announced ahead of time with a migration guideline and a hard deadline, no tenant was surprised at retirement, and tenants were given a coordinated window in which both old and new ran concurrently (except for full removals, where concurrency is impossible).
  • A trail across the umbrella issue, the per-tenant modify issues, and any linked eviction issues — showing what changed, why, who migrated when, and who didn’t. Useful the next time a contract change ships.

Edge Cases & Failure Modes

  • Capability owner does not acknowledge in the umbrella. Experience-level handling: the operator chases — in-thread mention, direct ping, separate message as the deadline approaches. If no acknowledgment arrives by the deadline, the missing acknowledgment is treated as non-engagement and the laggard branch (eviction issue per tenant) applies. Acknowledgment is required, but the consequence of withholding it is the same as failing to migrate.
  • Migration guideline turns out to be wrong or insufficient mid-rollout. Two sub-cases:
    • Isolated miss (the guideline doesn’t cover one tenant’s specific case): the fix is tenant-specific, every other tenant can keep migrating without changing their plan, the guideline is amended in the umbrella thread, and the deadline does not move.
    • Big miss (the shared guidance or replacement itself must change for the remaining tenants): the deadline is pushed out and the new deadline is announced in the umbrella thread. The hard-deadline rule still applies to the new date — extension is a global event, not a per-tenant slip.
  • Tenant says outright “we can’t migrate — the new contract makes our capability unviable.” Straight to eviction. The capability owner now has to find a new platform or revamp their capability so that it works with the new contract. The umbrella issue still tracks this tenant via the linked eviction issue at deadline time, but the migration itself is not going to happen.
  • Full offering removal (no replacement to run alongside). Step 3’s “old + new run concurrently” does not apply. The change is all-or-nothing at the deadline. Tenants must be off the offering by the deadline; there is no grace window during which both forms exist. Migration in this case usually means moving to a different offering entirely or moving the workload off-platform — whichever the migration guideline directs.
  • Many tenants miss the deadline at once. This is a signal that the operator picked a deadline that was too aggressive given the size of the change, or sized the status-update cadence poorly for the work involved. The hard-deadline rule still applies — the operator opens an eviction issue per laggard — but the operator should treat the cluster of evictions as a learning event for the next contract-change rollout.
  • Cross-tenant question reveals a conflict in the migration guideline. Same shape as the isolated-miss branch above: amend in-thread, continue. The umbrella thread is the source of truth for the guideline as it evolves during the rollout.
  • Two contract changes are in flight at once and the same tenant is affected by both. Each change still gets its own umbrella issue and the tenant is expected to acknowledge in each thread. If one migration satisfies both changes, the tenant may use one combined modify my capability issue, provided it links back to both umbrellas so each rollout can still be tracked and closed independently.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability — by name, so the lineage is traceable:

  • Evergreen contract. This UX is the operationalization of the evergreen-contract promise made in host-a-capability. Contract changes are communicated ahead of time, tenants are migrated, and no tenant is sprung on. The hard deadline plus the rollout-window concurrency (where applicable) is what “communicated ahead of time and migrated” actually looks like in practice.
  • Operator-only operation. Only the operator can change the contract, and only the operator can see across tenants to know which ones are affected. The umbrella-issue mechanic is consistent with this — it is the operator’s tool, not a capability-owner-driven coordination surface.
  • Tenants must accept the platform’s contract. After this rollout completes, the new contract is the contract every remaining tenant has accepted. Acceptance is implicit in their having migrated; tenants that cannot accept the new contract end up evicted, which is consistent with the parent rule.
  • Eviction is allowed when needs and capabilities diverge. Laggards who miss the deadline are evicted via the parent capability’s eviction journey. This UX feeds eviction; it does not perform it.
  • Eviction threshold. A missed deadline is the operational signal that continuing to accommodate this tenant would either push routine maintenance sustainably above the 2×-budget threshold or break reproducibility (e.g. by forcing the platform to keep the old form running indefinitely just for one tenant). The numeric threshold lives with the KPI; this UX inherits whatever it currently is.
  • The capability evolves with its tenants. There is real tension between this rule and the present UX: this rule says the default response when a tenant needs something is to update the platform rather than push the requirement back. Yet this UX is the operator pushing change toward tenants. The reconciliation: this UX applies when the operator has already decided to change the contract — typically because the cost of continuing to support the old form (maintenance, security posture, complexity) has tipped against keeping it. The migration-guideline + concurrent-rollout shape is how the platform absorbs as much of the cost as it can. But once the deadline is set, it is set.
  • No specific availability or performance SLA. Contract changes that affect availability characteristics are in scope of this UX (the operator may alter availability characteristics under the rules of this rollout). Tenants needing stronger guarantees than the new contract offers are subject to the same eviction path as any other “fundamentally incompatible” case.
  • KPI: 2-hr/week operator maintenance budget. Implication for this UX: the rollout cadence — including status updates and per-tenant modify reviews — must fit within the operator’s weekly budget across the rollout window. A contract change that would clearly blow the budget is a signal to reduce the scope of the change, lengthen the deadline, or stage the rollout, before the umbrella issue is filed.
  • KPI: 1-hour reproducibility. Implication for this UX: the new contract must itself be reproducible from definitions. A contract change that ships with the platform itself stuck on a snowflake configuration to keep both forms running has failed the rule. Concurrent old/new during rollout is fine; permanent dual-form support is not the goal.

Out of Scope

  • Externally-forced updates. Vendor sunset, CVE remediation, runtime EOL — those are reactive updates with deadlines inherited from outside, and they belong in operator-initiated-tenant-update (UX #2). The seam: that UX is reactive; this UX is proactive.
  • Routine modify requests. A capability owner shipping a version bump or new component on their own initiative is the change-later loop in host-a-capability, not this UX.
  • The eviction journey itself. When a laggard misses the deadline, the operator opens a separate eviction issue per tenant. The capability owner’s experience continues in Capability owner moves off the platform after eviction — a sibling UX, not part of this one.
  • The decision to change the contract. Why the operator chose to retire an offering, change a packaging form, or alter availability characteristics is upstream of this UX. The journey starts the moment the operator has decided.
  • Building the replacement offering. Where the contract change replaces an old offering with a new one, the replacement must already be implemented and running on the platform before the umbrella issue is filed. Building it is a precondition, not a step in this UX. (Following on from host-a-capability’s “new offering needed” branch — which was the path by which the new offering may originally have entered the platform.)
  • The capability owner’s responder side as a separate doc. As with UX #2, the capability owner’s experience of receiving and responding to the umbrella issue is captured here as a responder. The actual migration work runs through the existing modify my capability surface, which is already documented in host-a-capability. A separate doc would mostly duplicate.

Open Questions

None at this time.

6 - Stand Up the Platform

The operator rebuilds the platform from its definitions — back to ready-to-host-tenants — confidently and verifiably, whether it’s the first build ever, recovery after total loss, or a periodic drill.

One-line definition: The operator rebuilds the platform from its definitions — back to ready-to-host-tenants — confidently and verifiably, whether it’s the first build ever, recovery after total loss, or a periodic drill.

Parent capability: Self-Hosted Application Platform

Persona

The actor here is the operator — the parent capability’s Owner / Accountable party and sole administrator. There are no co-operators in this journey, and the sealed successor credentials are not in play during routine standup.

If a successor has taken over (because the primary operator is unavailable), they run this same journey. The act of breaking the seal and asserting takeover is a separate experience; once they have access to the operator’s context, the rebuild flow is identical. From this UX’s perspective there is one persona — whoever is currently the operator.

  • Role: The operator. Sole party with administrative access to the platform and accountable for it existing and running.
  • Context they come from: Either there is no platform yet (first-ever build) or the platform is gone / being rebuilt in parallel. Either way, what they have in hand is the definitions repo, root-level access to the underlying infrastructure (cloud account, home-lab), and — for disaster recovery — backups of tenant data sitting somewhere reachable.
  • What they care about here: Confidence that the platform really is reproducible from its definitions. Speed matters too (the Reproducibility KPI is 1 hour) but takes second place — a fast rebuild that leaves the operator unsure whether anything was missed is worse than a slower one that finishes verifiably clean.

Goal

“I want to rebuild the platform from nothing back to ready-to-host-tenants — confidently and verifiably, fast enough that the 1-hour KPI holds — so that total loss is recoverable, not catastrophic.”

Confidence beats speed when the two conflict. The operator is rebuilding the substrate that everything else of theirs depends on; a hurried rebuild that they don’t trust is its own kind of failure.

Entry Point

Three triggers converge on this same flow:

  • First-ever build. No platform has existed before; the operator is bringing it into being.
  • Disaster recovery. The platform existed and is now gone (cloud project lost, home-lab destroyed, ransomware, etc.); the operator is rebuilding on top of root-level access that survived the disaster.
  • Drift / reproducibility drill. The operator rebuilds the platform in parallel on scratch infrastructure after every significant platform change — meaning any change that would alter what they are rebuilding, what they must validate, or what they must trust before calling the platform ready again — and at least quarterly to prove the KPI still holds while the live platform keeps serving. The drill is identical to the real flow — only the underlying infrastructure differs.

What the operator has in hand at minute zero:

  • The definitions repo, pulled fresh.
  • Root-level access to the underlying infrastructure (cloud-provider account, home-lab access). Loss of these is not in scope for this UX — they are foundational and must already be in place before the platform can be (re)built.
  • For disaster recovery only: tenant-data backups. Restoring those into newly-provisioned tenants is a separate UX; this UX ends before that begins.

The operator’s state of mind is steady, not panicked: this journey exists precisely so total loss isn’t catastrophic, and a drill rehearses it on purpose.

What is not assumed at entry:

  • Definitions drift. Before any rebuild with prior platform state starts, the operator performs a required preflight drift check against the live platform or the last known-good environment. On a first-ever build, the check is vacuously clean because there is no prior platform state yet. The check passes only when the platform state the operator is treating as real still matches the definitions closely enough that no unexplained differences remain. If drift exists, it must be detected and fixed before this journey begins, not discovered partway through. (See Constraints.)
  • The sealed successor credentials. They stay sealed during routine standup, including DR.

Journey

The rebuild is automated, with manual operator-validation checkpoints between phases. The operator is on standby throughout — watching log output and system-level signals, ready to validate at each checkpoint, but not driving each step by hand.

1. Decide to rebuild and confirm preconditions

The operator decides to rebuild — first build, DR, or scheduled drill — and confirms what they have in hand: a fresh pull of the definitions repo and root-level access to the target infrastructure (the live infra for first-build/DR, scratch infra for a drill). Before they kick anything off, they perform the required preflight drift check whenever prior platform state exists, using the live platform or the last known-good environment as the reference, and confirm the platform they intend to trust still matches the intended definitions closely enough to rebuild from them honestly. If the check fails because unexplained differences remain, they stop and resolve drift before starting the rebuild.

What they perceive: nothing yet on the target infrastructure; a clean definitions repo on their workstation; the underlying provider UIs (cloud console, IPMI) showing the empty starting state.

2. Kick off the top-level rebuild

The operator runs the single top-level entry point that drives the rebuild from the definitions repo. From here on, automation does the work of provisioning; the operator’s job is to validate at each checkpoint.

What they perceive: log output begins streaming. The first phase is underway.

3. Phase 1 — Foundations

Automation provisions the underlying foundations: cloud project / home-lab base, network plumbing including the connectivity between cloud and home-lab. On completion the automation pauses and prints a phase summary.

The operator validates by checking the underlying provider’s UIs (cloud console, home-lab IPMI) and the expected signals for this phase. Only when they are satisfied that the foundations really are in place do they signal continue.

If validation fails, see Edge Cases — Phase fails.

4. Phase 2 — Core platform services

Automation provisions compute, persistent storage, and the platform-provided identity service on top of the foundations. Pauses. The operator validates the same way — provider UIs plus the expected signs that compute, storage, and identity are really available (e.g. the identity service is reachable and issuing tokens) — then signals continue.

5. Phase 3 — Cross-cutting services

Automation provisions backup and observability so they cover the platform itself before any tenant arrives. Pauses. The operator validates that backup is wired in and observability is collecting, then signals continue.

6. Phase 4 — Readiness verification and canary tenant

The platform deploys a purpose-built canary tenant maintained alongside the platform definitions end-to-end. It exists solely to prove the platform can host tenants without coupling readiness to any real tenant’s lifecycle. The trade-off is that this is less representative than using a small real tenant, so it may miss tenant-specific workload quirks; it is preferred anyway because it keeps readiness verification deterministic, disposable, and independent of any real tenant’s lifecycle. The canary is exercised (it should run, be reachable, store and read back data, authenticate against the platform-provided identity service, be picked up by backup and observability), then torn down.

What the operator perceives: a clear pass/fail on the canary. The canary’s success is the readiness signal — “ready to host tenants” is operationally identical to “did host a tenant just now.”

If the canary fails, see Edge Cases — Canary fails.

7. Note the wall-clock and close out

The operator records how long the rebuild took. If it came in under the 1-hour KPI, they’re done — the platform is ready for tenant restoration (a separate UX). If it took longer, the platform is still ready; the operator opens a GitHub issue capturing the cause of the slowdown so it can be analyzed and improved later. Either way, the journey ends here.

Flow Diagram

flowchart TD
    Start([Trigger: first build / DR / drill]) --> Confirm[Run required preflight drift check<br/>when prior state exists and confirm<br/>root-level access is in hand]
    Confirm --> Kickoff[Run top-level rebuild from definitions]
    Kickoff --> P1[Phase 1: Foundations<br/>cloud + home-lab base, networking]
    P1 --> V1{Validate via provider UIs<br/>+ maintained checklist}
    V1 -->|Fails| Halt[Halt, root-cause,<br/>tear down everything,<br/>fix definition, restart]
    V1 -->|OK| P2[Phase 2: Core services<br/>compute, storage, identity]
    P2 --> V2{Validate}
    V2 -->|Fails| Halt
    V2 -->|OK| P3[Phase 3: Cross-cutting<br/>backup, observability]
    P3 --> V3{Validate}
    V3 -->|Fails| Halt
    V3 -->|OK| Canary[Phase 4: Deploy purpose-built<br/>canary tenant, then tear down]
    Canary --> CanaryGreen{Canary green?}
    CanaryGreen -->|No| Halt
    CanaryGreen -->|Yes| Wallclock[Note wall-clock duration]
    Wallclock --> KPI{Under 1 hour?}
    KPI -->|Yes| Ready((Platform ready<br/>to host tenants))
    KPI -->|No| Issue[Open GitHub issue<br/>to analyze the slowdown]
    Issue --> Ready
    Halt --> Kickoff

Success

When the canary comes up green and is cleanly torn down, the operator walks away with:

  • A platform that is ready to host tenants — every platform-provided service has been exercised end-to-end by a purpose-built tenant deployment, not just by self-checks.
  • Confidence in reproducibility. The rebuild ran from the definitions repo, with no manual snowflake configuration, and produced a working platform. The KPI is honestly met (or, if not, the gap is captured for follow-up rather than papered over).
  • A clean handoff to tenant restoration. Any previously-hosted tenants come back via their own restoration journey; the platform-side standup ends cleanly without entangling itself in tenant data.
  • For drills specifically: a renewed assurance that “we can rebuild this in an hour” is a real property, not a hope, because the drill is run after every significant platform change and at least quarterly rather than whenever it feels convenient.

Edge Cases & Failure Modes

  • Phase fails mid-rebuild. The automation hits an error during one of the phases. The operator halts, root-causes the failure, fixes the underlying issue (typically a definition that needs updating), tears down everything that was provisioned so far, and restarts the rebuild from the top. Partial state is itself a snowflake risk and is not trusted. This implies each phase must be reversible — at minimum, “delete everything” must be a viable rollback. (See Constraints.)

  • Preflight drift check fails. The rebuild does not start. The operator treats this as a definitions integrity problem, reconciles the drift, and only re-enters this journey once the required preflight check passes.

  • Definitions are drifted despite the preflight check. Drift is supposed to be prevented by the platform’s enforcement of tracked changes and immutability, and detected/fixed before this journey starts. If drift still surfaces during the rebuild (e.g. the canary fails because something expected by the definitions is missing or inconsistent), the operator treats it as a definitions bug — fix the definition, tear down, restart.

  • 1-hour KPI is missed. The platform is still up and ready for tenants. The operator records the wall-clock and opens a GitHub issue to analyze why it took longer than it should have. The KPI is missed for that rebuild, but the platform doesn’t get blocked from going back into service; KPI improvement is a follow-up concern, not part of this journey.

  • Canary tenant fails to come up. The platform is not ready, regardless of how green every prior phase looked. The operator root-causes the canary failure, fixes the relevant definition, tears down, and restarts. Until the canary is green, the platform is not marked ready for tenants — even if the operator is under time pressure, this rule does not bend.

  • Successor at the keyboard. A successor who has taken over runs this same journey from the operator’s context. The act of breaking the sealed credentials and asserting takeover is a separate UX (not yet defined); once the successor is in, the rebuild flow does not differ.

  • First build has no backups. First-build and DR/drill produce the same platform-side outcome from this UX’s perspective. Tenant data restore is out of scope here, so the absence of backups during a first build is simply a non-event for this journey.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability — by name, so future readers can trace the lineage:

  • KPI: 1-hour reproducibility. This is the journey the KPI is measured against. The 1-hour budget is a target, not a hard fail — missing it does not stop the platform from going into service, but it does generate a tracked follow-up issue. The KPI cannot be honestly evaluated unless drills run this same flow on parallel infrastructure after every significant platform change and at least quarterly.

  • Operator-only operation. No co-operators, no delegated administration, no shared driving of the rebuild. The sealed successor credentials are not used during routine standup, including DR. A successor uses them only after takeover, and from that point operates as “the operator” through this same UX.

  • The platform may span public and private infrastructure. Phase 1 (foundations) explicitly crosses cloud and home-lab boundaries — the rebuild is not a single-environment affair. Connectivity between the two is part of the foundation, not an afterthought.

  • Reproducibility beats vendor independence beats minimizing operator effort (the parent capability’s stated tiebreaker). Manifests here as the rule that partial state is not trusted: tearing down and restarting from scratch on any phase failure is more operator effort than incremental fix-and-resume, but it is what reproducibility honesty requires.

  • Operator succession. Successor takeover converges on this same UX — sealed credentials grant access to the operator’s context, after which the rebuild flow is identical. The seal-breaking event itself is a separate journey.

  • No specific availability or performance SLA. The journey ends at “ready to host tenants” — what tenants experience after that is governed by the platform’s normal availability characteristics, not by this UX.

  • Tracked changes and immutability across all platform UXs. The required preflight drift check is only meaningful if every UX that can introduce platform state enforces tracked changes and immutability rather than allowing ad-hoc modification. This is a property the platform’s definitions and operations must hold, not a step that invents drift policy on its own — but this UX is the one that refuses to proceed until that policy is verified.

  • Each phase must be reversible. Implied by the “phase fails → tear down everything and restart” edge-case rule. The platform’s definitions must support a clean teardown of any partially-provisioned state. “Delete everything and start over” must be a viable, reliable option at every checkpoint.

  • Default hosting target for the operator’s capabilities. Readiness cannot be declared from infrastructure self-checks alone; the platform has to prove it can actually host a tenant. That is why this UX requires a purpose-built canary tenant maintained with the platform definitions.

Out of Scope

  • Tenant data restoration. Bringing previously-hosted tenants’ data back into newly-provisioned tenants is a separate UX. This journey ends at “platform is ready to host tenants” — full stop.

  • Re-onboarding tenants after rebuild. Each tenant’s return is governed by its own journey (likely a variant of Host a Capability, possibly seeded by a backup-restore step). Not handled here.

  • Migration to new underlying infrastructure. Moving the platform to a different cloud account or different home-lab hardware while the old one is still running is a different journey (the old platform serves traffic while the new one comes up). Out of scope until migration becomes a realistic case worth defining.

  • Sealed-credential takeover by the successor. The act of breaking the seal, asserting authority, and gaining access to the operator’s context belongs in its own UX. This UX picks up after takeover, where the successor is operating as the operator.

  • The broader drift-management process. This UX requires a preflight drift check before rebuild, but the wider machinery that continuously enforces tracked changes, detects drift between rebuilds, and maintains the last known-good reference is a cross-cutting concern, not the focus of this journey.

  • Loss of root-level foundations (cloud account itself, all home-lab access). These are assumed in place before the platform was deployed in the first place. Recovery from their loss is not part of the platform’s capability.

Open Questions

None at this time.

7 - Tenant-Facing Observability

A capability owner with a live tenant checks whether their hosted capability is healthy — either pulling the view themselves or being pushed an alert when something crosses a threshold they set.

One-line definition: A capability owner with a live tenant checks whether their hosted capability is healthy — either pulling the view themselves or being pushed an alert when something crosses a threshold they set.

Parent capability: Self-Hosted Application Platform

Persona

The actor is a capability owner whose capability has already been onboarded via Host a Capability and is currently running on the platform. As with that UX, this is written as if the capability owner were a separate person from the operator, even though today they are the same human wearing different hats.

  • Role: Capability owner of a live tenant. They are not operating the platform; they are operating their capability, which happens to be hosted here.
  • Context they come from: Their capability is live and serving its end users. They are not in the middle of onboarding or modifying — that’s a different journey. They want to know how their thing is doing right now, or they have just been pinged that something is wrong.
  • What they care about here: Knowing the health of their capability without depending on end users to report problems first, and without having to interrupt the operator to ask.

Goal

“I want to know whether my hosted capability is healthy right now — and I want the platform to ping me if it isn’t — so I find out before my end users do, and I can tell whether the problem is mine to fix or the platform’s.”

Two arrival modes share this goal: proactive pull (capability owner goes looking) and reactive push (an alert reaches them). The view they reach is the same in both cases.

Entry Point

Two distinct entries, converging on the same view.

Pull entry. The capability owner opens the observability offering’s tenant view. They might be doing this:

  • Routinely (e.g. before promoting a new release of their capability to end users).
  • Reactively, because an end user reported something looked off and they want to confirm.
  • Out of curiosity / habit.

They reach it by authenticating to the shared observability offering. After login they land directly in their own tenant’s view and stay confined there for the rest of the session. There is no separate URL per tenant — the same offering serves everyone, but capability owners do not browse across tenants or switch into an operator-wide view.

Push entry. The platform’s alerting reaches them by email. They are pulled away from whatever they were doing and now have a concrete “your capability looks unhealthy” signal in hand.

In both cases their access was provisioned automatically as part of the original onboard my capability flow (step 5 of Host a Capability) — observability is part of being hosted, not an add-on they request later.

Journey

1. Access is already in place (set up during onboarding)

By the time the capability owner has a live tenant, they already have:

  • A working login to the observability offering, scoped to their tenant.
  • Email alerting wired to the address they use for platform communication.
  • A platform-standard health bundle for their capability: availability, latency, error rate, resource saturation, and restart / deployment events.
  • A clear contract about trust: the tenant view is the source of truth for current health, while email alerts are a best-effort nudge that helps them notice trouble sooner.

Nothing in this UX requires them to set any of that up. Threshold tuning happens inside the observability offering, but any request to expand the signal bundle or add new delivery channels goes through modify my capability — not this journey.

2. (Pull mode) Capability owner opens the observability view

They authenticate to the observability offering and land on a view scoped to their tenant. They see the current state of the platform-standard health bundle for their capability: whether it is up, how quickly it is responding, how often it is failing, whether it is under resource pressure, and whether it has recently restarted or been redeployed.

What they perceive: a current-state read of their capability’s health, plus enough recent history to tell whether something is trending bad. They cannot see other tenants, and there is no mode-switch that broadens their scope — only the operator can do that.

3. (Pull mode) Capability owner tunes thresholds, if needed

While in the offering, the capability owner can self-serve their alert thresholds — the values that, when crossed, will fire an email alert to them. Thresholds are their call: the platform does not prescribe what’s unhealthy enough to wake them up.

This is the one self-service surface the platform exposes to capability owners. Everything else still goes through GitHub issues; thresholds are an exception because they are a tuning knob the capability owner needs to iterate on without operator involvement.

If the observability offering knows email delivery is degraded for this tenant, it says so in the tenant view. What the capability owner perceives: do not treat silence from email as reassurance until the delivery path is healthy again; use the pull view as the authoritative answer in the meantime.

4. (Push mode) An alert reaches the capability owner

A signal crossed a threshold they set. The platform sends an email alert. The alert names which signal and which capability — enough for them to start without opening anything else.

What they perceive: their capability is unhealthy enough that they wanted to be told. They now have to figure out whose problem it is.

5. Root-cause — is this the tenant or the platform?

The capability owner investigates. They have two possible conclusions:

5a. It’s the tenant. The signals point at their capability — their code, their data, their config. They handle it the way they would handle any problem with their capability: fix on their side, ship a new artifact via modify my capability if it requires a deployment, or operate within the running tenant if the tools to do so exist.

5b. It’s the platform. The signals point at something below their capability — the host is gone, networking is broken, the storage offering is degraded. They look for an open operator-side issue tracking it. If the operator has already opened one, they watch that issue; the operator owns the fix, and the capability owner’s role from here is to stay aware so they can communicate to their own end users. If no such issue exists yet, the operator will probably open one shortly (the operator gets the same signals); the capability owner does not need to file anything themselves.

6. Resolution

Either:

  • They fix their side of it and signals return to healthy. The alert (if there was one) does not need to be acknowledged — the platform stops alerting because the threshold is no longer crossed.
  • The operator fixes the platform side of it and signals return to healthy. The operator-side issue closes. The capability owner has been a passive watcher.

In either case the capability owner walks away with the same end state: their capability is healthy again, and they knew about the unhealth without an end user telling them.

Flow Diagram

flowchart TD
    Onboarded([Capability is live — observability access<br/>and email alerts provisioned during onboarding]) --> Trigger{What brought<br/>them here?}
    Trigger -->|Routine check / end user reported| Pull[Open observability view —<br/>see tenant-scoped signals]
    Trigger -->|Push alert from platform| Alert[Receive email alert:<br/>which signal, which capability]
    Pull --> Tune{Want to adjust<br/>thresholds?}
    Tune -->|Yes| Self[Self-serve threshold change<br/>in the observability offering]
    Tune -->|No| Read[Read the signals]
    Self --> Read
    Read --> Healthy{Healthy?}
    Healthy -->|Yes| Done([Walk away — capability is fine])
    Healthy -->|No| RootCause
    Alert --> RootCause[Root-cause: tenant or platform?]
    RootCause -->|Tenant| FixSelf[Fix on their side —<br/>via 'modify my capability' if needed]
    RootCause -->|Platform| Watch[Watch the operator's issue —<br/>operator opens one, capability owner observes]
    FixSelf --> Recover([Signals recover])
    Watch --> Recover

Success

A successful experience looks like:

  • The capability owner learned about a health problem before their end users had to tell them, or confirmed health proactively before promoting a change.
  • They could tell, from the signals alone, whether the problem was theirs or the platform’s — without having to interrupt the operator to ask.
  • If it was theirs, they fixed it through the channels they already use (modify issue, in-tenant tools, redeploy).
  • If it was the platform’s, they had something concrete to watch (the operator’s issue) and could relay status to their own end users.
  • They understood that the tenant view was authoritative and email was an acceleration path, so silence from email was never the only evidence they relied on.
  • They did not have to set anything up to make this work — onboarding put it in place.

Edge Cases & Failure Modes

  • Alert fatigue / ignored alerts. A capability owner who stops responding to their own alerts is not the platform’s problem — alerts are a courtesy; tenant health is tenant responsibility. The platform keeps emitting; what the capability owner does with them is their call.
  • Threshold set too tight, capability owner spammed. Self-serve thresholds means the capability owner can fix this themselves. The platform does not intervene to “save them from themselves.”
  • Threshold set too loose, real problems missed. Same — their call, their consequence. The platform’s defaults (whatever the observability offering ships with) provide a starting point.
  • Operator hasn’t opened a platform-side issue yet when the capability owner is investigating. The capability owner does not need to file one themselves. The operator gets the same signals and will open one. If they don’t and the problem persists, that is an operator-side failure, not a capability-owner-side action.
  • Capability owner suspects the platform but signals look fine for the platform. They surface this on a modify my capability issue or a comment to the operator — same surface they would use for anything ambiguous. This UX does not introduce a new issue type for “I think it’s you, not me.”
  • Alert delivery is broken (email bounces, mailbox rule hides it, etc.). The capability owner does not treat silence from email as proof of health; the pull view remains the source of truth. If the offering knows delivery is failing, the tenant view shows alerting as degraded so the capability owner understands email is currently unavailable as a nudge.
  • Capability owner wants more than email alerts or wants a broader signal bundle. Goes through modify my capability, not this UX — it’s a contract change about what the platform delivers to the tenant, even if a small one.

Constraints Inherited from the Capability

This UX must respect the following items from the parent capability’s Business Rules and Success Criteria:

  • Operator-only operation. The capability owner is not an operator. Their access is scoped to their own tenant; the operator is the only role that sees across tenants. The one self-service surface (threshold tuning) does not violate this — it adjusts only their own email alerts, not platform configuration.
  • Direct outputs include observability. The parent capability lists observability as a direct output: “the operator can tell whether each tenant is up and healthy without the tenant having to instrument that itself.” This UX extends that same plumbing to the capability owner with tenant-scoped data access — observability is offered as an offering, with cross-tenant visibility kept to the operator.
  • No direct end-user access to the platform. The capability owner’s end users do not get observability access. This view stops at the capability owner.
  • Tenants must accept the platform’s contract. The signals available are the platform-standard health bundle — availability, latency, error rate, resource saturation, and restart / deployment events. Capability owners do not ask the platform to instrument arbitrary tenant-specific metrics as part of this UX.
  • The capability evolves with its tenants. If multiple tenants need observability beyond the standard health bundle, the right response is to expand the offering’s category — not to push instrumentation back onto the tenant.
  • KPI: 2-hr/week operator maintenance budget. Implication: the alerting path must not produce so many false positives that the operator is constantly fielding “is this me or you?” questions from capability owners. Self-serve thresholds and “operator gets the same signals” are both pressure-reliefs on this — capability owners can tune their noise themselves, and they do not need to escalate “is this the platform?” questions to the operator because they can read the signals directly.

Out of Scope

  • Changing the signal bundle or adding non-email alert channels. Both go through Host a Capability’s modify loop, not here. They are contract changes about what the platform delivers.
  • End-user-facing observability. End users of a tenant do not get a “is the thing I use up?” view from the platform. If a tenant wants a status page for their end users, that is a feature of the tenant capability, not the platform.
  • Operator-side observability. The operator’s view across all tenants is its own surface, used during operator-driven journeys (rebuild, contract rollout, eviction decisions). This UX is strictly the capability owner’s slice.
  • Platform-side incident management. When the capability owner concludes “this is the platform’s problem and I’ll watch the operator’s issue,” what the operator does inside that issue is operator workflow — not part of this UX.
  • Threshold-tuning best practices. This UX provides the surface for self-serve threshold tuning; it does not document what thresholds a capability owner should pick. That belongs in the observability offering’s own documentation.

Open Questions

None at this time.