DPF Install Verification Runbook

DPF Install Verification Runbook

For early adopters with hardware in hand. The CI gates in .github/workflows/release-gates.yml and install-verification.yml cover what GitHub-hosted runners can run. This runbook covers everything CI can’t reach — and you’re the person who can close those gaps.

macOS Apple Silicon is GA — validated end-to-end on real hardware. The Linux and cloud installers are code-complete and statically CI-green and graduate to GA when the community sends us verification reports from real hardware. Reports still widen the tested matrix on every platform, including additional Mac models / macOS versions. If you have an Apple Silicon Mac, a non-Ubuntu Linux box, a TAPPaaS environment, or a cloud VM you can spare for an hour, please pick a section below and run through it. Both happy-path and failure reports are valuable.

How to report: open an issue using the Install verification report template (it pre-fills the title prefix Install verification — and the install-verification label). Paste your environment fingerprint, tick the checklist items you observed, and attach the doctor bundle (bash install-dpf.sh doctor~/.dpf/doctor-<timestamp>.tar.gz). Secrets are redacted automatically.

We don’t need every checkbox before reading your report — partial reports are useful too.

Verification matrix

Path What CI proves What this runbook covers Status
Repository unit tests .github/workflows/ci.yml runs pnpm test with an ephemeral Postgres service and prisma migrate deploy before the package tests Local package-specific repros when a test is tied to host Docker state, platform credentials, or a dirty install 🧪 CI informational until issue #104 is drained
Windows installer n/a (no CI gate today) One-command wrapper now available (verify-install-windows.ps1); community reports wanted ✓ verified — formal reports now enabled
Linux end-to-end install install-verification.yml (ubuntu-latest, dev + release modes) — full compose up, /api/health=200, doctor bundle Distro coverage beyond Ubuntu: Debian 12, Fedora 39. Autostart-after-reboot. 🙋 reports wanted
macOS end-to-end install dry-run only (macos-14 can’t nest-virt Docker Desktop) The actual .dmg install + Docker Desktop boot + portal up + LaunchAgent reboot survival 🙋 reports wanted
Discovery collectors (darwin) unit tests with mocked deps Real pkgutil --pkgs / brew list enumeration emits sensible discovery items 🙋 reports wanted
Observability stack compose-render only Prometheus actually scrapes metrics; Grafana dashboards populate; linux-monitoring profile cAdvisor / node-exporter on real Linux 🙋 reports wanted
LLM provider — Docker Model Runner dry-run Real Model Runner serves chat completions to portal 🙋 reports wanted
LLM provider — Ollama (Linux) compose-render ollama service actually pulls + serves a model 🙋 reports wanted
LLM provider — external code review Real LLM_BASE_URL (Anthropic / OpenAI / hosted Ollama) round-trips 🙋 reports wanted
TAPPaaS deployment none — spec only Pilot deploy into a real TAPPaaS environment 🧪 design partner wanted
DPF Edge Node enrollment (single-host) none — spec only First-draft enrollment ceremony executed end-to-end 🙋 reports wanted
DPF Edge Node — multi-host LAN (T2) none — code-complete via T2.1-T2.4 Authority on Host A, Edge Node on Host B over a real LAN with a switch; non-loopback IP attribution; ARP / nmap / SNMP collector output reaching Postgres + Neo4j 🙋 reports wanted
Cloud — Single VM substrate (AWS / GCP / Azure) runbook + verify-wrapper ready (cloud-single-vm.md) First real-cloud pilot report on each major cloud 🙋 reports wanted
Cloud — Managed Container service / Managed k8s none — spec only Substrate pilot per packaging target 🧪 design partner wanted

Fastest path: one command for the whole sweep

Linux + macOS

Run the bash wrapper to verify the full install + Edge Node path and produce one tarball:

# On a host that already has DPF installed and running:
bash scripts/verify-install-edge.sh

# Or, on a fresh host (clones + installs first):
git clone https://github.com/OpenDigitalProductFactory/opendigitalproductfactory ~/dpf
cd ~/dpf
bash install-dpf.sh --headless --release --no-autostart
bash scripts/verify-install-edge.sh

The wrapper covers ledger rows 1-4 in a single sweep:

Windows

Run the PowerShell wrapper from the repo root:

# On a host that already has DPF installed and running:
.\scripts\verify-install-windows.ps1

# Override the portal URL (e.g. remote host):
.\scripts\verify-install-windows.ps1 -AuthorityUrl http://my-host:3000

The wrapper covers the same surface as the bash version, adapted for Windows:

Filing a report

Attach the resulting bundle (.tar.gz on Linux/macOS, .zip on Windows) to a new install-verification issue and paste the printed summary into the body.

The wrappers do not cover real-LAN multi-host, TAPPaaS, or cloud substrate pilots. Those still need the manual sections below.

Local scratch rehearsal before promotion

For Windows development machines that already have a production-served local DPF install running, use the non-destructive scratch rehearsal instead of resetting the real stack:

.\scripts\scratch-install-rehearsal.ps1 -Execute

See Scratch Install Rehearsal. The script creates a separate worktree, a separate Compose project, alternate ports, and scratch-only secrets, then records evidence under the scratch directory.

For install, setup, provider, Build Studio, Work Capsule, or promotion-sensitive branches, include the browser-driven new-customer path:

.\scripts\scratch-install-rehearsal.ps1 -Execute -RunFirstRunWalkthrough

That mode verifies /setup, owner account creation, provider setup access, and /build/work access from the scratch install before cleanup. The browser walkthrough uses a 120-second per-step timeout for cold production containers; raise -WalkthroughTimeoutSeconds only when the machine is known to be slower.

How to run each verification (manual sections)

Each section below is paste-able. Copy the block, fill in the prompts, capture the artifacts named at the bottom of the section, and check the corresponding row off the matrix.

1. Linux end-to-end install (real distro coverage)

Hardware: any of Ubuntu 22.04+ / Debian 12+ / Fedora 39+. A fresh VM is ideal so you exercise the “user has nothing installed” path.

# Capture environment fingerprint before starting.
uname -a > /tmp/dpf-verify-host.txt
[ -r /etc/os-release ] && cat /etc/os-release >> /tmp/dpf-verify-host.txt

# Prerequisites the installer expects you to bring.
which git curl bash
node -v          # must report v20.x or higher
pnpm -v          # any recent version

# Clone + install.
git clone https://github.com/OpenDigitalProductFactory/opendigitalproductfactory ~/dpf
cd ~/dpf
bash install-dpf.sh                 # interactive — say "Ready to go"
# Or: bash install-dpf.sh --headless --release --no-autostart

Expected outcomes (check each):

Artifacts to capture:

2. macOS Apple Silicon end-to-end install

Hardware: a real Apple Silicon Mac (M1 / M2 / M3 / M4). macOS 14 (Sonoma) or newer. A fresh user account or a VM is ideal — you need to exercise the “no Docker Desktop installed yet” path.

# Capture environment fingerprint.
sw_vers > /tmp/dpf-verify-host.txt
uname -a >> /tmp/dpf-verify-host.txt

# Prerequisites.
xcode-select --install            # one-time, if not already installed
which git curl bash
node -v                           # v20.x or higher; brew install node or use nvm
pnpm -v                           # npm install -g pnpm

# Clone + install. Do NOT pre-install Docker Desktop — let the
# installer exercise the `.dmg` flow.
git clone https://github.com/OpenDigitalProductFactory/opendigitalproductfactory ~/dpf
cd ~/dpf
bash install-dpf.sh

Expected outcomes:

Artifacts to capture:

3. Discovery collectors — real macOS data

After the macOS install above is green, verify the discovery pipeline emits real data, not just unit-test fixtures.

# Trigger a bootstrap discovery run from the portal:
#   Settings -> Discovery -> Run Bootstrap

# Then inspect the database for host evidence.
docker compose -p dpf exec postgres psql -U dpf -c \
  "SELECT count(*), \"evidenceSource\", \"packageManager\"
   FROM \"DiscoveredSoftware\"
   GROUP BY \"evidenceSource\", \"packageManager\";"

Expected outcomes:

4. Observability stack — real metrics flow

# After a Linux install with the linux-monitoring profile active.
curl --silent http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job, health}'

Expected outcomes:

5. LLM providers

Provider Verification
Docker Model Runner (macOS DD ≥ 4.40) Settings → AI → run a prompt against the local model. Portal logs show POST http://model-runner.docker.internal/engines/v1/chat/completions returning 200.
Ollama (Linux native Docker) docker compose -p dpf exec portal curl http://ollama:11434/api/tags lists at least one model after first portal use. A chat completion round-trips.
External Set LLM_BASE_URL=https://api.anthropic.com/v1 and a real API_KEY in .env; restart portal; verify a chat completion succeeds.

6. TAPPaaS pilot deployment

Status: spec-only — no code shipped. This row stays “not started” until a Phase-0 spike lands.

The spike should:

Linked spec: docs/superpowers/specs/2026-05-09-cloud-deployment-design.md.

7. DPF Edge Node enrollment

Status: Phase 0 code shipped — Authority Core surface (/api/v1/edge/enroll, /heartbeat, /discovery-runs), service skeleton at services/edge-node, Admin UI at /platform/edge-nodes, lifecycle verification script. Verification reports wanted for a real Edge Node enrolling against a running Authority Core on each platform.

How to run the lifecycle verification script

The script exercises the spec’s reference flow (enroll → heartbeat → submit → idempotent replay → stale_observation → rate limit) against a running Authority Core and exits non-zero on any assertion failure.

# 1. Bring up the platform.
bash dpf-start.sh   # or dpf-start on Windows

# 2. Sign in as an HR-000 / superuser at /platform/edge-nodes and
#    issue a bootstrap token. Copy the plaintext shown ONCE.

# 3. Run the lifecycle verification.
cd services/edge-node
DPF_AUTHORITY_URL=http://localhost:3000 \
DPF_BOOTSTRAP_TOKEN='dpfboot_YOURPLAINTEXTHERE' \
DPF_VERIFY_NODE_NAME='verify-2026-05-12-mike-laptop' \
  pnpm verify-lifecycle

# 4. The script enrolls; if your Approval policy is "operator
#    approval required" (the default for paste-provisioned tokens),
#    it pauses at Phase 2 and prints a "node needs approval" message.
#    Go back to /platform/edge-nodes, click Approve on the new
#    pending node, then re-run the script. (Phases 1-2 will run
#    again because the script issues a fresh runKey per invocation.)
#
#    On a "trusted" enroll path (auto-approve metadata on the
#    bootstrap token), the script runs all six phases in one pass.

# 5. The script prints
#       Results: N passed, M failed
#    and exits 0 if M = 0.

Then file the verification report (template below) and check off the matrix row for your platform.

Phase 0 verification matrix

Linked spec: docs/superpowers/specs/2026-05-09-dpf-edge-node-design.md. Lifecycle script: services/edge-node/scripts/verify-lifecycle.ts.

Clock sync prerequisite (NTP)

The Authority’s /api/v1/edge/discovery-runs route enforces a freshness window on submitted observations. The default bounds are asymmetric:

Prerequisite: every Edge Node host must run NTP. On a freshly provisioned VM:

# Debian / Ubuntu — systemd-timesyncd is the default
sudo timedatectl set-ntp true
timedatectl status                # expect: System clock synchronized: yes

# Fedora / RHEL — chrony is the default
sudo systemctl enable --now chronyd
chronyc tracking                  # expect: Leap status: Normal

If you see stale_observation errors with the message “must be at most 5.0min ahead of server time (likely cause: NTP skew on the Edge Node host)” or “must be at most 24.0h before server time”, that’s the freshness gate firing. Fix NTP first, then re-check; widening the bounds with the env vars below should be a last resort.

Operator-configurable bounds:

The Authority Core honors two env vars (set on the Authority host, not on the Edge Node — the check runs server-side):

Env var Default Purpose
DPF_EDGE_FRESHNESS_PAST_SEC 86400 (24h) How old observedAt may be. Tighten for air-gap deployments where 24h-stale data is suspicious by definition.
DPF_EDGE_FRESHNESS_FUTURE_SEC 300 (5min) How far ahead observedAt may be. Widen only if you have chronic clock skew that’s expected (rare).

Either bound set to a non-numeric value or <= 0 falls back to the default — operator typos don’t accidentally disable the check.

Audit rows for rejections include the signed deltaMs, direction (past or future), and the bounds in force at the time:

SELECT parameters->'summary' AS summary
FROM "ToolExecution"
WHERE "toolName" = 'edge.discovery_runs.submit'
  AND "result"->>'error' = 'stale_observation'
ORDER BY "createdAt" DESC LIMIT 5;

7b. Edge Node multi-host LAN verification (T2)

Status: T2 code-complete per T2.1 through T2.4. Real-hardware reports wanted for a real second-host Edge Node enrolling against a remote Authority Core across a LAN.

This is the next verification rung up from § 7 (which covers single-host enrollment). The bar:

Authority Core on Host A
   │  (over a real LAN, separated by at least one switch)
   ▼
Edge Node on Host B   ← a different machine, native Linux Docker

The Authority portal shows the Edge Node enrolled with a non-loopback IP attributed to it, and the discovery rows include real LAN-side IPs plus at least one switch / gateway / non-portal-host item with osiLayer >= 2.

Full runbook

The end-to-end ceremony — Authority preflight, bootstrap token, operator approval, first sweep, and the SQL that verifies it landed — is in docs/install/edge-node-multi-host.md.

It also walks the optional HTTPS path (T2.2 — Caddy sidecar on the Authority, NODE_EXTRA_CA_CERTS on the Edge Node).

Operational notes you’ll want before starting

These are the gotchas the T2 verification gap list (G4 + G7) called out — they bite operators who jump straight to docker compose up -d:

1. Authority Core URL stability (G4). The Edge Node persists the DPF_AUTHORITY_URL it enrolled against. If Host A reboots and DHCP hands it a different IP, every enrolled Edge Node on the LAN loses contact until you reconfigure. Use one of:

A dynamic DHCP IP is fine for the first verification run, but plan to pin it before you treat the deployment as durable. Authority auto-discovery (mDNS-SD, Zeroconf, etc.) is deferred to a later thread — T2’s expectation is “operator pins it.”

2. Operator approval gate (G7). Paste-provisioned bootstrap tokens always land in trustState=pending per spec § Approval policy. The node enrolls successfully but cannot submit observations until an operator clicks Approve in /platform/edge-nodes. This is the friction that makes Edge Node enrollment opt-in, not silent.

The Phase 0 single-host demo glosses over this because the local installer-issued bootstrap token auto-approves. The multi-host path does NOT — if you skip the Approve click, your sweeps will keep failing with node_not_trusted errors and your DiscoveryRun table will stay empty. Watch for this in your first run.

3. NTP on both hosts. See § 7’s “Clock sync prerequisite (NTP)” subsection above. The Authority’s /api/v1/edge/discovery-runs route rejects submissions whose observedAt falls outside the asymmetric freshness window. Fresh VMs without NTP can drift tens of seconds and silently fail their first sweep.

Verification gate

Same checklist as § 7’s Phase 0 matrix, plus the multi-host-specific rows from the Edge Node multi-host verification template:

When that happens, T2 flips to verified on real LAN in the parent thread ledger.

File a report

Open an issue with the Edge Node multi-host verification template. The template asks for LAN topology, the path you took (HTTP / HTTPS), the T2 checklist tick-list, the SQL output, and a doctor bundle from each host. Partial reports are valuable — failure reports name the symptom and the LAN topology and are more useful than no report.

8. Cloud deployment patterns

Three substrates per the cloud deployment design; each needs its own pilot to flip from spec to verified.

Reporting verification results

When you’ve run a section, file your report through the Install verification report template. The form mirrors the checklist above and pre-fills:

The template prompts for everything maintainers need:

Maintainers will:

The bar for ✅ verified is “we have evidence on real hardware”, not “we believe”. Partial reports still count — even a “got to step N and failed” failure report is more valuable than no report.