ADR-0022: Learning Loops via Outcome Labeling
Status
proposed
Context
Agent system без feedback loop не учится. Problem:
- Agents confident claims не verified против реальных outcomes
- Confidence calibration drifts over time
- Patterns успеха / провала не переиспользуются
- Quality metrics (per Manifesto: task success rate ≥85%) можно считать только если outcomes labeled
- Agent-Judge делает quality check, но post-hoc outcome loop missing
Existing принципы и infrastructure:
- Manifesto принцип #3 (observability), #6 (separate judge)
- Agent-Judge для per-task quality
- Decision Log: ADR-0019 (cross-model validation pattern, Judge-as-evaluator)
- Observability — metrics pipeline
Нужно формализовать continuous learning: outcomes → retro → tuning → improved guardrails.
Decision
Принять 4-stage learning loop:
- Outcome labeling — каждый scenario / task получает explicit outcome (success / partial / fail / killed) с reasoning
- Retro analysis — Agent-Judge или Chamber-for-Strategic analyzes correlation agent-decisions ↔ outcomes
- Threshold / prompt tuning — identified weak points → updates в agent manifests
- New playbook / guardrail rules — emerging patterns → new rules в Rules-* или event subscriptions
Cadence:
- Outcome labeling: real-time при scenario completion
- Retro analysis: weekly batch (Judge) + ad-hoc (Chamber для significant)
- Threshold tuning: monthly review
- Playbook rules: quarterly (через ADR если material)
Minimum Viable Loop (MVL) starter:
- 1-click outcome input через HITL-Gateway
- Weekly Judge batch на subset scenarios
- Quarterly tuning cycle
Alternatives Considered
Option A: No formal loop, continuous manual tuning
Founder ad-hoc tunes agents based on observed issues. No systematic retros.
- Pros:
- Zero overhead
- Founder control полный
- Cons:
- Не scale к 30-50 agents
- Patterns missed (systematic issues не видны one-at-a-time)
- Confidence calibration drift не detected
- Dependent on Founder bandwidth
Option B: Per-agent self-learning (agents update own prompts)
Each agent analyzes own outcomes, self-tunes.
- Pros:
- Distributed, self-organizing
- No central bottleneck
- Cons:
- Violates Manifesto принцип #6 (separate judge, not self-critique)
- Risk of agent drift / Goodhart’s law (optimize measured not desired)
- Hard to audit (который agent changed что when)
- Loss of overview на cross-agent patterns
Option C: 4-stage loop с MVL ← chosen
Outcome labeling → retro → tuning → playbook rules. Minimum viable setup сейчас, rich setup по мере роста.
- Pros:
- Aligned с Manifesto принципами observability + separate judge
- Scales к 30-50 agents
- Clear audit trail (какие changes when based на какие data)
- Can start MVL сейчас, enrich по мере потребности
- Reuses existing infrastructure (Agent-Judge, HITL-Gateway, Observability)
- Cons:
- Требует discipline в outcome labeling (human touches каждый scenario)
- Weekly Judge retros = ongoing cost
- Changes к manifests требуют version control / ADR process
- Risk of over-engineering measurement
- Why chosen: MVL approach позволяет начать simple и enrich incremental. Alternative A не scale, Option B нарушает foundational принцип.
Consequences
Positive:
- Manifesto metrics (task success rate ≥85%, QoQ cost decline) measurable
- Confidence calibration tracked, drift corrected
- Systematic patterns vs one-off issues distinguishable
- Agent manifest changes auditable (what / when / why / outcome)
- New guardrails grounded в data, не speculation
- Scales к target 30-50 agents
Negative / Trade-offs:
- Outcome labeling overhead для Founder / Directors
- Weekly Judge retro batch adds ongoing cost
- Manifest version control discipline required
- Risk of measuring proxies (Goodhart’s law)
- Slow feedback (weekly retros mean 1-2 week lag to insights)
Mitigations:
- 1-click outcome labeling в HITL-Gateway (low friction)
- Batched labeling if Founder overloaded (weekly review)
- Judge prompt emphasizes “does this pattern generalize?” (prevent over-fitting)
- Pattern must appear в 3+ scenarios до triggering playbook rule change
- Chamber review для material changes (prevent rubber-stamping)
- Success rate monitor trends, not single point metrics
Follow-ups
- Create Process-OutcomeLabeling — mechanics
- Update Observability — learning loop metrics
- Update Agent-Judge — retro batch mode spec
- HITL-Gateway 1-click labeling UI (depends on Gateway implementation)
- Weekly retro cadence owner — Agent-CEO или dedicated Judge agent?
- Threshold tuning governance — who can change manifest values без ADR?
- Outcome taxonomy — success / partial / fail / killed — sufficient categories?
References
- Manifesto — принципы #3 observability, #6 separate judge, metrics targets
- ADR-0003-Three-Tier-Hierarchy — Judge role
- ADR-0020-reference-architecture-blueprint — learning loops в §8
- ADR-0021-criticality-levels — criticality impact на labeling priority
- Agent-Judge — primary retro analyzer
- Chamber-for-Strategic — escalated retros для L4
- Observability — metrics infrastructure
- HITL-Gateway — labeling entry point
- Reference-Org-Blueprint — section §8 Observability & Learning Loops
- Process-OutcomeLabeling (создаётся вместе с ADR)
- Decision Log: ADR-0019 Cross-Model Validation (precedent pattern)
- Build-Measure-Learn