DPF Edge Node — Fleet Operations

DPF Edge Node — Fleet Operations

What this is. The operator runbook for running many edge nodes against one DPF Authority Core: rollout, version, rotation, quarantine, decommission, and the health signals that keep a fleet from overwhelming the portal. For where nodes run, see the deployment topology guide; for token/data handling, see security & sovereignty.

Design: topology spec §7, §8A.3, §11.2 · SysML: PART-EDGE-fleetops, SM-LIFECYCLE, VC-EDGE-SCALE

The fleet contract in one paragraph

Every edge node is an independently enrolled, scoped, trusted, and reaped client of one Authority Core. The Authority Core is the single pane of glass and the only system of record. Nodes observe and submit; the Authority authenticates, scopes, persists, decides, and presents. Everything below exists so that scaling the number of nodes does not turn the portal into a bottleneck or the edge into a second management platform.

Node operational lifecycle

A node’s operational lifecycle is wider than its trust state:

created → pending → trusted → degraded → quarantined → revoked → retired

Rollout (adding nodes at scale)

Add nodes through the portal flow, not by hand-cloning the repo (see topology §8). For a fleet:

Versioning and upgrades

Rotation, quarantine, decommission

Action When How Effect
Rotate node token suspected token exposure; periodic hygiene portal node action new dpfedge_*; old token invalid on next call
Re-issue bootstrap token a generated install command leaked / expired portal “issue token” new one-use short-TTL dpfboot_*; old one already single-use
Quarantine anomalous submissions; investigation portal node action node may heartbeat (stays visible) but discovery/metrics are rejected/diverted at the route layer
Revoke decommission; compromise portal node action node clears state and exits on next heartbeat; evidence retained, scoped
Retire row no longer needed retention workflow archived per retention class

Quarantine being route-effective (not just a label) is the invariant: a quarantined node cannot keep feeding the inventory graph while you investigate.

Health signals to watch (and alert on)

These come from the Authority-side observability surface, not from scraping remote nodes (see security & sovereignty and topology §8A.4):

Fan-in controls (why the fleet stays healthy)

The Authority Core is the fan-in point by design. The controls that keep that correct rather than fragile (topology §8A.3): server-assigned heartbeat intervals with jitter, per-scope concurrency caps, payload size/rate caps, runKey idempotency, async projection to Neo4j/Qdrant/reporting, bounded offline queues with drop-oldest-by-class, and visible backlog gauges. Validate them with the synthetic fleet harness (100 / 1,000-node profiles, VC-EDGE-SCALE) before a broad rollout.

See also