ADR-0022: Learning Loops via Outcome Labeling

Status

proposed

Context

Agent system без feedback loop не учится. Problem:

  • Agents confident claims не verified против реальных outcomes
  • Confidence calibration drifts over time
  • Patterns успеха / провала не переиспользуются
  • Quality metrics (per Manifesto: task success rate ≥85%) можно считать только если outcomes labeled
  • Agent-Judge делает quality check, но post-hoc outcome loop missing

Existing принципы и infrastructure:

  • Manifesto принцип #3 (observability), #6 (separate judge)
  • Agent-Judge для per-task quality
  • Decision Log: ADR-0019 (cross-model validation pattern, Judge-as-evaluator)
  • Observability — metrics pipeline

Нужно формализовать continuous learning: outcomes → retro → tuning → improved guardrails.

Decision

Принять 4-stage learning loop:

  1. Outcome labeling — каждый scenario / task получает explicit outcome (success / partial / fail / killed) с reasoning
  2. Retro analysisAgent-Judge или Chamber-for-Strategic analyzes correlation agent-decisions ↔ outcomes
  3. Threshold / prompt tuning — identified weak points → updates в agent manifests
  4. New playbook / guardrail rules — emerging patterns → new rules в Rules-* или event subscriptions

Cadence:

  • Outcome labeling: real-time при scenario completion
  • Retro analysis: weekly batch (Judge) + ad-hoc (Chamber для significant)
  • Threshold tuning: monthly review
  • Playbook rules: quarterly (через ADR если material)

Minimum Viable Loop (MVL) starter:

  • 1-click outcome input через HITL-Gateway
  • Weekly Judge batch на subset scenarios
  • Quarterly tuning cycle

Alternatives Considered

Option A: No formal loop, continuous manual tuning

Founder ad-hoc tunes agents based on observed issues. No systematic retros.

  • Pros:
    • Zero overhead
    • Founder control полный
  • Cons:
    • Не scale к 30-50 agents
    • Patterns missed (systematic issues не видны one-at-a-time)
    • Confidence calibration drift не detected
    • Dependent on Founder bandwidth

Option B: Per-agent self-learning (agents update own prompts)

Each agent analyzes own outcomes, self-tunes.

  • Pros:
    • Distributed, self-organizing
    • No central bottleneck
  • Cons:
    • Violates Manifesto принцип #6 (separate judge, not self-critique)
    • Risk of agent drift / Goodhart’s law (optimize measured not desired)
    • Hard to audit (который agent changed что when)
    • Loss of overview на cross-agent patterns

Option C: 4-stage loop с MVL ← chosen

Outcome labeling → retro → tuning → playbook rules. Minimum viable setup сейчас, rich setup по мере роста.

  • Pros:
    • Aligned с Manifesto принципами observability + separate judge
    • Scales к 30-50 agents
    • Clear audit trail (какие changes when based на какие data)
    • Can start MVL сейчас, enrich по мере потребности
    • Reuses existing infrastructure (Agent-Judge, HITL-Gateway, Observability)
  • Cons:
    • Требует discipline в outcome labeling (human touches каждый scenario)
    • Weekly Judge retros = ongoing cost
    • Changes к manifests требуют version control / ADR process
    • Risk of over-engineering measurement
  • Why chosen: MVL approach позволяет начать simple и enrich incremental. Alternative A не scale, Option B нарушает foundational принцип.

Consequences

Positive:

  • Manifesto metrics (task success rate ≥85%, QoQ cost decline) measurable
  • Confidence calibration tracked, drift corrected
  • Systematic patterns vs one-off issues distinguishable
  • Agent manifest changes auditable (what / when / why / outcome)
  • New guardrails grounded в data, не speculation
  • Scales к target 30-50 agents

Negative / Trade-offs:

  • Outcome labeling overhead для Founder / Directors
  • Weekly Judge retro batch adds ongoing cost
  • Manifest version control discipline required
  • Risk of measuring proxies (Goodhart’s law)
  • Slow feedback (weekly retros mean 1-2 week lag to insights)

Mitigations:

  • 1-click outcome labeling в HITL-Gateway (low friction)
  • Batched labeling if Founder overloaded (weekly review)
  • Judge prompt emphasizes “does this pattern generalize?” (prevent over-fitting)
  • Pattern must appear в 3+ scenarios до triggering playbook rule change
  • Chamber review для material changes (prevent rubber-stamping)
  • Success rate monitor trends, not single point metrics

Follow-ups

  • Create Process-OutcomeLabeling — mechanics
  • Update Observability — learning loop metrics
  • Update Agent-Judge — retro batch mode spec
  • HITL-Gateway 1-click labeling UI (depends on Gateway implementation)
  • Weekly retro cadence owner — Agent-CEO или dedicated Judge agent?
  • Threshold tuning governance — who can change manifest values без ADR?
  • Outcome taxonomy — success / partial / fail / killed — sufficient categories?

References