ADR-0022: Learning Loops via Outcome Labeling

Status

proposed

Context

Agent system без feedback loop не учится. Problem:

Agents confident claims не verified против реальных outcomes
Confidence calibration drifts over time
Patterns успеха / провала не переиспользуются
Quality metrics (per Manifesto: task success rate ≥85%) можно считать только если outcomes labeled
Agent-Judge делает quality check, но post-hoc outcome loop missing

Existing принципы и infrastructure:

Manifesto принцип #3 (observability), #6 (separate judge)
Agent-Judge для per-task quality
Decision Log: ADR-0019 (cross-model validation pattern, Judge-as-evaluator)
Observability — metrics pipeline

Нужно формализовать continuous learning: outcomes → retro → tuning → improved guardrails.

Decision

Принять 4-stage learning loop:

Outcome labeling — каждый scenario / task получает explicit outcome (success / partial / fail / killed) с reasoning
Retro analysis — Agent-Judge или Chamber-for-Strategic analyzes correlation agent-decisions ↔ outcomes
Threshold / prompt tuning — identified weak points → updates в agent manifests
New playbook / guardrail rules — emerging patterns → new rules в Rules-* или event subscriptions

Cadence:

Outcome labeling: real-time при scenario completion
Retro analysis: weekly batch (Judge) + ad-hoc (Chamber для significant)
Threshold tuning: monthly review
Playbook rules: quarterly (через ADR если material)

Minimum Viable Loop (MVL) starter:

1-click outcome input через HITL-Gateway
Weekly Judge batch на subset scenarios
Quarterly tuning cycle

Alternatives Considered

Option A: No formal loop, continuous manual tuning

Founder ad-hoc tunes agents based on observed issues. No systematic retros.

Pros:
- Zero overhead
- Founder control полный
Cons:
- Не scale к 30-50 agents
- Patterns missed (systematic issues не видны one-at-a-time)
- Confidence calibration drift не detected
- Dependent on Founder bandwidth

Option B: Per-agent self-learning (agents update own prompts)

Each agent analyzes own outcomes, self-tunes.

Pros:
- Distributed, self-organizing
- No central bottleneck
Cons:
- Violates Manifesto принцип #6 (separate judge, not self-critique)
- Risk of agent drift / Goodhart’s law (optimize measured not desired)
- Hard to audit (который agent changed что when)
- Loss of overview на cross-agent patterns

Option C: 4-stage loop с MVL ← chosen

Outcome labeling → retro → tuning → playbook rules. Minimum viable setup сейчас, rich setup по мере роста.

Pros:
- Aligned с Manifesto принципами observability + separate judge
- Scales к 30-50 agents
- Clear audit trail (какие changes when based на какие data)
- Can start MVL сейчас, enrich по мере потребности
- Reuses existing infrastructure (Agent-Judge, HITL-Gateway, Observability)
Cons:
- Требует discipline в outcome labeling (human touches каждый scenario)
- Weekly Judge retros = ongoing cost
- Changes к manifests требуют version control / ADR process
- Risk of over-engineering measurement
Why chosen: MVL approach позволяет начать simple и enrich incremental. Alternative A не scale, Option B нарушает foundational принцип.

Consequences

Positive:

Manifesto metrics (task success rate ≥85%, QoQ cost decline) measurable
Confidence calibration tracked, drift corrected
Systematic patterns vs one-off issues distinguishable
Agent manifest changes auditable (what / when / why / outcome)
New guardrails grounded в data, не speculation
Scales к target 30-50 agents

Negative / Trade-offs:

Outcome labeling overhead для Founder / Directors
Weekly Judge retro batch adds ongoing cost
Manifest version control discipline required
Risk of measuring proxies (Goodhart’s law)
Slow feedback (weekly retros mean 1-2 week lag to insights)

Mitigations:

1-click outcome labeling в HITL-Gateway (low friction)
Batched labeling if Founder overloaded (weekly review)
Judge prompt emphasizes “does this pattern generalize?” (prevent over-fitting)
Pattern must appear в 3+ scenarios до triggering playbook rule change
Chamber review для material changes (prevent rubber-stamping)
Success rate monitor trends, not single point metrics

Follow-ups

Create Process-OutcomeLabeling — mechanics
Update Observability — learning loop metrics
Update Agent-Judge — retro batch mode spec
HITL-Gateway 1-click labeling UI (depends on Gateway implementation)
Weekly retro cadence owner — Agent-CEO или dedicated Judge agent?
Threshold tuning governance — who can change manifest values без ADR?
Outcome taxonomy — success / partial / fail / killed — sufficient categories?

References

Manifesto — принципы #3 observability, #6 separate judge, metrics targets
ADR-0003-Three-Tier-Hierarchy — Judge role
ADR-0020-reference-architecture-blueprint — learning loops в §8
ADR-0021-criticality-levels — criticality impact на labeling priority
Agent-Judge — primary retro analyzer
Chamber-for-Strategic — escalated retros для L4
Observability — metrics infrastructure
HITL-Gateway — labeling entry point
Reference-Org-Blueprint — section §8 Observability & Learning Loops
Process-OutcomeLabeling (создаётся вместе с ADR)
Decision Log: ADR-0019 Cross-Model Validation (precedent pattern)
Build-Measure-Learn

Synth Nova Manifest

Explorer

ADR-0022: Learning Loops via Outcome Labeling

ADR-0022: Learning Loops via Outcome Labeling

Status

Context

Decision

Alternatives Considered

Option A: No formal loop, continuous manual tuning

Option B: Per-agent self-learning (agents update own prompts)

Option C: 4-stage loop с MVL ← chosen

Consequences

Follow-ups

References

Graph View

Table of Contents

Backlinks