DPF Edge Node — Fleet Operations

What this is. The operator runbook for running many edge nodes against one DPF Authority Core: rollout, version, rotation, quarantine, decommission, and the health signals that keep a fleet from overwhelming the portal. For where nodes run, see the deployment topology guide; for token/data handling, see security & sovereignty.

Design: topology spec §7, §8A.3, §11.2 · SysML: PART-EDGE-fleetops, SM-LIFECYCLE, VC-EDGE-SCALE

The fleet contract in one paragraph

Every edge node is an independently enrolled, scoped, trusted, and reaped client of one Authority Core. The Authority Core is the single pane of glass and the only system of record. Nodes observe and submit; the Authority authenticates, scopes, persists, decides, and presents. Everything below exists so that scaling the number of nodes does not turn the portal into a bottleneck or the edge into a second management platform.

Node operational lifecycle

A node’s operational lifecycle is wider than its trust state:

created → pending → trusted → degraded → quarantined → revoked → retired

pending / trusted / quarantined / revoked are real EdgeNode.trustState values — operator-driven.
degraded is a health/observability concept (not a DB trust value): a node that is trusted yet behind on version, missing a capability, failing a collector, over its cardinality budget, or running on stale policy. Surface it on the fleet view; act on it before it becomes an incident.
created → retired bracket the row: a decommissioned node is revoked, then its row retained for audit (its evidence stays scoped) or archived per retention policy.

Rollout (adding nodes at scale)

Add nodes through the portal flow, not by hand-cloning the repo (see topology §8). For a fleet:

Stage by scope. Roll out customer-by-customer (MSP) or region-by-region (retail), not all at once — a staged rollout keeps the first-heartbeat thundering herd bounded and lets you catch a bad build early.
One node per context. One node per customer × site (MSP) or per location (retail). Do not multi-home one node across contexts — scope is per node and enforced server-side.
Expect pending. Remote nodes land pending; approve them deliberately. A burst of unexpected pending nodes with unfamiliar hostnames is a signal, not a chore — investigate before approving.

Versioning and upgrades

Signed, staged. Edge binaries/images are release artifacts (deployment Contract 1) — pull by pinned tag, verify the checksum/signature, and roll forward by scope.
Compatibility window. The agent and the Authority /api/v1/edge/* contract maintain a version compatibility window; a fleet will always be briefly mixed-version. Track version skew on the fleet dashboard and treat large skew as degraded.
Rollback path. Keep the prior pinned tag reachable so a bad edge release rolls back without re-enrollment (state survives in the node’s state volume/dir).

Rotation, quarantine, decommission

Action	When	How	Effect
Rotate node token	suspected token exposure; periodic hygiene	portal node action	new `dpfedge_*`; old token invalid on next call
Re-issue bootstrap token	a generated install command leaked / expired	portal “issue token”	new one-use short-TTL `dpfboot_*`; old one already single-use
Quarantine	anomalous submissions; investigation	portal node action	node may heartbeat (stays visible) but discovery/metrics are rejected/diverted at the route layer
Revoke	decommission; compromise	portal node action	node clears state and exits on next heartbeat; evidence retained, scoped
Retire	row no longer needed	retention workflow	archived per retention class

Quarantine being route-effective (not just a label) is the invariant: a quarantined node cannot keep feeding the inventory graph while you investigate.

Health signals to watch (and alert on)

These come from the Authority-side observability surface, not from scraping remote nodes (see security & sovereignty and topology §8A.4):

Missed-heartbeat rate by scope — alert on the rate across a scope, not only per-node misses.
Ingest error / discovery payload rejection rate — rising rejections mean a bad collector, clock skew (freshness window), or a misconfigured node.
Ingest backlog / projection lag — the graph/reporting projection is async; a growing backlog is the early sign of fan-in pressure.
Metrics cardinality budget — per-interface/per-host labels explode Prometheus series; keep them in inventory tables, alert when the budget is approached.
Version skew — share of the fleet off the current pinned release.
Stale pending nodes / tokens — enrolled-but-never-approved nodes and unused tokens are both cleanup and security signals.
Token reuse / quarantine attempts — security signals worth a dedicated alert.

Fan-in controls (why the fleet stays healthy)

The Authority Core is the fan-in point by design. The controls that keep that correct rather than fragile (topology §8A.3): server-assigned heartbeat intervals with jitter, per-scope concurrency caps, payload size/rate caps, runKey idempotency, async projection to Neo4j/Qdrant/reporting, bounded offline queues with drop-oldest-by-class, and visible backlog gauges. Validate them with the synthetic fleet harness (100 / 1,000-node profiles, VC-EDGE-SCALE) before a broad rollout.