0005 — OpenFeature SDK + DB-backed provider with cron circuit breaker¶
- Status: accepted
- Date: 2026-04-26
- Deciders: @kackey621, Willen Federation contributors
Context and Problem Statement¶
Several M3-M5 features ship behind controlled rollouts:
- OIDC / SAML providers (M4) — operators may want to enable Auth0 for staff before letting external partners in.
- Vendor-bundled release ZIP installer flow (M5) — a wrong-step changes the entire deploy story for shared-hosting users.
- Domain rewrites under the Strangler Fig migration — the new code path for "create item" can cohabit with the legacy path for a release before the legacy path is removed.
- Tracking experiments and gradual percentage rollouts as the system grows.
We need a runtime switching mechanism that:
- lets operators flip a feature without deploying code or restarting Apache;
- auto-disables a feature when its error rate crosses a configured threshold (so a bad rollout self-heals);
- exposes a clean API to call sites — typed flag keys, not hard-coded
if (config.foo)branches; and - works on shared hosting where cron may or may not be available.
Decision Drivers¶
- Operator-facing UI — flips happen from the admin console (M4), not by editing config files.
- Audit trail — every flip is logged with who/when/why. "We just turned off X to recover from an incident" must be reconstructable from the database.
- Self-healing on errors — a flag whose code path is throwing should disable itself before an operator notices, then alert.
- Cron-optional — shared hosting that lacks cron must still get circuit-breaker behaviour, even at reduced freshness.
- Standards alignment — flag evaluation should follow a published interface so we are not inventing a fourth flag-evaluation API on top of the three already in the wild.
- Compliance with ADR 0001 — the SDK lives in
Infrastructure/; call sites depend on a domain-level interface, not on the SDK.
Considered Options¶
Option A — Hand-rolled boolean column on system_setting¶
Reuse the M4 system_setting table. Each flag is a row; reads happen on every request through the existing repository.
- (+) Smallest possible footprint. Zero new dependencies.
- (−) No standard evaluation API. Each call site queries a setting by key — easy to misuse, hard to mock in tests.
- (−) No structured rollout (percentages, target groups) without bolting it on.
- (−) Audit log and circuit breaker would be hand-rolled too. Three new bespoke systems on a tiny base.
Option B — Third-party SaaS (LaunchDarkly, GrowthBook Cloud, …)¶
- (+) Mature dashboards, percentage rollouts, targeting rules out of the box.
- (−) Each operator deployment now has a runtime dependency on a third-party SaaS. SASO is self-hosted by intent.
- (−) Many target deployments are air-gapped or behind corporate proxies that block outbound traffic.
- (−) Cost.
Option C — OpenFeature PHP SDK + a SASO-owned DB provider¶
OpenFeature is an open standard for flag evaluation; the PHP SDK exposes a stable Client API (getBooleanValue, getStringValue, getNumberValue, getObjectValue). We implement a Saso\Infrastructure\FeatureFlag\DbProvider that reads from a feature_flag table; the SDK handles the surrounding ergonomics (variant resolution, error handling, hooks, evaluation context).
Self-healing is a separate concern: an error_log_aggregate table records errors per feature_key, a cron script (or, on cron-less hosts, a synchronous "tail-of-request" trigger) sweeps the aggregate and toggles feature_flag.enabled = 0 when a threshold is crossed, writing the reason to feature_flag_audit.
- (+) Standard API. Call sites depend on the OpenFeature client interface, not on our table schema.
- (+) The SDK gives us hooks (logging, metrics) without bespoke wiring.
- (+) Provider implementation stays small (a few hundred lines) — most complexity already lives in the SDK.
- (+) We control the storage, so we can add domain-specific columns (
error_threshold,auto_disabled_at,auto_disable_reason) without forking anything. - (−) New dependency. The OpenFeature PHP SDK is at 1.x but ecosystem maturity lags Java / Node.
Option D — OpenFeature SDK + an existing community provider (e.g. flagd, Unleash, GrowthBook OSS)¶
- (+) Reuses an external evaluation engine; gets percentage rollouts + audit "for free".
- (−) Adds a second deployable component (flagd / Unleash / GrowthBook server). Shared-hosting operators run a single PHP app; a second daemon is a non-starter.
- (−) Self-healing on application errors still has to be glued in — these engines do not know about our error catalogue.
Decision Outcome¶
Chosen option: C — OpenFeature PHP SDK + a SASO-owned DB provider with a cron circuit breaker.
Tables¶
-- The flag itself.
CREATE TABLE feature_flag (
id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
key_name VARCHAR(120) NOT NULL UNIQUE,
description VARCHAR(500) NOT NULL,
enabled TINYINT(1) NOT NULL DEFAULT 0,
rollout_percent TINYINT UNSIGNED NOT NULL DEFAULT 0,
conditions JSON NULL, -- targeting rules
error_threshold INT UNSIGNED NOT NULL DEFAULT 0, -- 0 = never auto-disable
error_window_min INT UNSIGNED NOT NULL DEFAULT 60,
auto_disabled_at DATETIME NULL,
auto_disable_reason VARCHAR(500) NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL
);
-- Per-flag error counts in time buckets, written from the global
-- exception handler (cf. ADR 0004) when ProblemExceptionHandler can
-- attribute the failure to a feature.
CREATE TABLE error_log_aggregate (
id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
feature_key VARCHAR(120) NOT NULL,
error_code VARCHAR(40) NOT NULL,
count INT UNSIGNED NOT NULL,
window_start DATETIME NOT NULL,
window_end DATETIME NOT NULL,
KEY idx_feature_window (feature_key, window_start)
);
-- Append-only audit of every flip, including circuit-breaker events.
CREATE TABLE feature_flag_audit (
id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
flag_key VARCHAR(120) NOT NULL,
old_enabled TINYINT(1) NOT NULL,
new_enabled TINYINT(1) NOT NULL,
changed_by VARCHAR(120) NOT NULL, -- 'circuit_breaker' | member id
changed_at DATETIME NOT NULL,
reason VARCHAR(500) NULL,
KEY idx_flag (flag_key, changed_at)
);
Code shape¶
src/
├── Domain/
│ └── Feature/
│ ├── FeatureKey.php # value object
│ ├── FeatureFlag.php # aggregate
│ ├── EvaluationContext.php # member + tenant + locale
│ └── Repository/
│ └── FeatureFlagRepository.php # interface
└── Infrastructure/
└── FeatureFlag/
├── DbProvider.php # OpenFeature\Provider implementation
├── PdoFeatureFlagRepository.php
├── ErrorAggregator.php # writes error_log_aggregate
└── CircuitBreaker.php # invoked by cron + fallback
Circuit breaker — cron + fallback¶
scripts/feature_flag_circuit_breaker.phpruns every 5 minutes from cron. It scanserror_log_aggregateagainst each flag'serror_thresholdover the configurederror_window_min, flips offending flags off, and writes the audit row.- For cron-less hosts (some shared-hosting deployments), a tail-of-request middleware checks "have we run the breaker in the last 60 minutes?" and if not, triggers it synchronously after the response is flushed. This is best-effort: a low-traffic instance may go longer without a check, but the breaker still runs eventually.
- Operators can disable the synchronous fallback via
FEATURE_FLAG_INLINE_BREAKER=0in.envonce they confirm cron is wired.
Evaluation pipeline¶
Application code
└─> OpenFeature\Client::getBooleanValue('checkout.new_flow', false, ctx)
└─> Saso\Infrastructure\FeatureFlag\DbProvider::resolveBooleanValue(...)
├─> reads from PdoFeatureFlagRepository (request-scoped cache)
└─> applies rollout_percent + conditions against ctx
Call sites depend on the OpenFeature Client, not on our provider. Tests inject the SDK's InMemoryProvider.
Observability¶
- Every evaluation that returns the default value (provider couldn't resolve the flag) emits a Monolog warning with the flag key. Misspelled or unregistered flags surface in the log instead of silently using the default.
- Circuit-breaker activations emit a structured Monolog
errorand (in M4) a Slack/email notification.
Consequences¶
- Application code learns one API (OpenFeature
Client) and never touches the database directly. - Operators get a single admin screen (M4) to flip flags; the audit row makes accidental flips reviewable.
- Bad rollouts auto-disable, and the audit log records the breaker event so post-mortems do not start with "wait, who turned this off?".
- Self-healing works on cron-equipped and cron-less hosts alike. We accept that cron-less hosts have weaker freshness guarantees.
- Adopting the OpenFeature standard means future migration to a different provider (flagd, GrowthBook OSS) is a wiring change, not a domain change.
- The
feature_flag+error_log_aggregate+feature_flag_audittables add ~40 KB per 10 K aggregate rows. We accept this as a cost of operability.