DPF Edge Node — Fleet Operations
What this is. The operator runbook for running many edge nodes against one DPF Authority Core: rollout, version, rotation, quarantine, decommission, and the health signals that keep a fleet from overwhelming the portal. For where nodes run, see the deployment topology guide; for token/data handling, see security & sovereignty.
Design: topology spec §7, §8A.3, §11.2 · SysML:
PART-EDGE-fleetops,SM-LIFECYCLE,VC-EDGE-SCALE
The fleet contract in one paragraph
Every edge node is an independently enrolled, scoped, trusted, and reaped client of one Authority Core. The Authority Core is the single pane of glass and the only system of record. Nodes observe and submit; the Authority authenticates, scopes, persists, decides, and presents. Everything below exists so that scaling the number of nodes does not turn the portal into a bottleneck or the edge into a second management platform.
Node operational lifecycle
A node’s operational lifecycle is wider than its trust state:
created → pending → trusted → degraded → quarantined → revoked → retired
- pending / trusted / quarantined / revoked are real
EdgeNode.trustStatevalues — operator-driven. - degraded is a health/observability concept (not a DB trust value): a node that is trusted yet behind on version, missing a capability, failing a collector, over its cardinality budget, or running on stale policy. Surface it on the fleet view; act on it before it becomes an incident.
- created → retired bracket the row: a decommissioned node is revoked, then its row retained for audit (its evidence stays scoped) or archived per retention policy.
Rollout (adding nodes at scale)
Add nodes through the portal flow, not by hand-cloning the repo (see topology §8). For a fleet:
- Stage by scope. Roll out customer-by-customer (MSP) or region-by-region (retail), not all at once — a staged rollout keeps the first-heartbeat thundering herd bounded and lets you catch a bad build early.
- One node per context. One node per customer × site (MSP) or per location (retail). Do not multi-home one node across contexts — scope is per node and enforced server-side.
- Expect
pending. Remote nodes landpending; approve them deliberately. A burst of unexpectedpendingnodes with unfamiliar hostnames is a signal, not a chore — investigate before approving.
Versioning and upgrades
- Signed, staged. Edge binaries/images are release artifacts (deployment Contract 1) — pull by pinned tag, verify the checksum/signature, and roll forward by scope.
- Compatibility window. The agent and the Authority
/api/v1/edge/*contract maintain a version compatibility window; a fleet will always be briefly mixed-version. Track version skew on the fleet dashboard and treat large skew asdegraded. - Rollback path. Keep the prior pinned tag reachable so a bad edge release rolls back without re-enrollment (state survives in the node’s state volume/dir).
Rotation, quarantine, decommission
| Action | When | How | Effect |
|---|---|---|---|
| Rotate node token | suspected token exposure; periodic hygiene | portal node action | new dpfedge_*; old token invalid on next call |
| Re-issue bootstrap token | a generated install command leaked / expired | portal “issue token” | new one-use short-TTL dpfboot_*; old one already single-use |
| Quarantine | anomalous submissions; investigation | portal node action | node may heartbeat (stays visible) but discovery/metrics are rejected/diverted at the route layer |
| Revoke | decommission; compromise | portal node action | node clears state and exits on next heartbeat; evidence retained, scoped |
| Retire | row no longer needed | retention workflow | archived per retention class |
Quarantine being route-effective (not just a label) is the invariant: a quarantined node cannot keep feeding the inventory graph while you investigate.
Health signals to watch (and alert on)
These come from the Authority-side observability surface, not from scraping remote nodes (see security & sovereignty and topology §8A.4):
- Missed-heartbeat rate by scope — alert on the rate across a scope, not only per-node misses.
- Ingest error / discovery payload rejection rate — rising rejections mean a bad collector, clock skew (freshness window), or a misconfigured node.
- Ingest backlog / projection lag — the graph/reporting projection is async; a growing backlog is the early sign of fan-in pressure.
- Metrics cardinality budget — per-interface/per-host labels explode Prometheus series; keep them in inventory tables, alert when the budget is approached.
- Version skew — share of the fleet off the current pinned release.
- Stale
pendingnodes / tokens — enrolled-but-never-approved nodes and unused tokens are both cleanup and security signals. - Token reuse / quarantine attempts — security signals worth a dedicated alert.
Fan-in controls (why the fleet stays healthy)
The Authority Core is the fan-in point by design. The controls that keep that correct rather than
fragile (topology §8A.3): server-assigned heartbeat intervals with jitter, per-scope concurrency
caps, payload size/rate caps, runKey idempotency, async projection to Neo4j/Qdrant/reporting,
bounded offline queues with drop-oldest-by-class, and visible backlog gauges. Validate them with the
synthetic fleet harness (100 / 1,000-node profiles, VC-EDGE-SCALE) before a broad rollout.