Process — Outcome Labeling

Mechanics continuous learning loops per ADR-0022-learning-loops. От outcome capture до manifest tuning и new playbook rules.

Overview

4 stages:

  1. Outcome capture — real-time при scenario completion
  2. Retro analysis — weekly batch (Agent-Judge) + ad-hoc (Chamber-for-Strategic)
  3. Threshold tuning — monthly
  4. Playbook rule emergence — quarterly, через ADR если material

Stage 1: Outcome Capture

When

Scenario или significant task completes. Triggers:

  • scenario.completed event
  • task.completed с criticality ≥ L2
  • Human-initiated review (“label this”)

How

1-click через HITL-Gateway или inline в response к Founder.

Outcome taxonomy:

  • success — achieved intended outcome
  • partial_success — some outcomes, not complete
  • neutral — completed без positive / negative outcome
  • failure — did not achieve intended
  • killed — aborted intentionally (e.g. hypothesis killed per Build-Measure-Learn)

Required fields:

scenario_id: uuid
trace_id: uuid
outcome: success | partial_success | neutral | failure | killed
outcome_reason: "..."                # 1-2 sentences why
labeled_by: "founder" | "director" | "agent-judge-retro"
labeled_at: iso8601
cost_actual_usd: X.XX                # pre-filled из trace
duration_actual: X hours
confidence_predicted_avg: 0.XX       # из agent outputs
confidence_actual: derived from outcome

Low-friction rule

Labeling нельзя делать tax для Founder. Правила:

  • Default autocomplete — agent predicts outcome, human confirms (yes/no/modify)
  • Batch UI в Gateway — 10 scenarios за сессию, не per-scenario interruption
  • Optional — для L1 tasks auto-labeled as success если no errors
  • Mandatory — для L3+ always explicit human label

Persistence

Outcomes → Memory-Model structured partition (не vector RAG). Searchable by: agent involved, criticality, outcome type, time range.

Stage 2: Retro Analysis

Weekly batch (Agent-Judge)

Scheduled: every Monday morning.

Input: past week’s outcomes (typically 20-100 depending на scale).

Analysis:

  1. Confidence calibration — агенты предсказали 85% success, actual ?
  2. Failure clustering — patterns в failures (same agent? same category?)
  3. Criticality correctness — были ли L2 actions что retro-label-ны как L3 impact?
  4. Speed / cost trends — QoQ direction (per Manifesto metric)
  5. Handoff gap detection — scenario stages с abnormal latency
  6. New pattern flags — issues не matching existing Process-Failure-Patterns

Output:

  • Calibration report — per-agent, per-criticality level
  • Failure cluster summary — top 3-5 patterns
  • Recommended tuning candidates (for Stage 3)
  • Escalation candidates для Chamber (material issues)

Report emit-ится event retro.completed, stored в Memory-Model.

Ad-hoc retros (Chamber-for-Strategic)

Triggered for:

  • Significant failure (e.g. missed Manifesto SLA repeatedly)
  • Unexpected success worth understanding
  • Systematic pattern flagged by weekly Judge retro
  • Founder explicit request

Chamber retros go deeper — multi-model analysis, cross-reference historical, recommend structural changes.

Output feeds Stage 4 (playbook rules).

Stage 3: Threshold Tuning

Cadence

Monthly review. Owner: Agent-CEO + Founder.

Input

  • Weekly retro reports (4 за месяц)
  • Calibration trends per agent
  • Observability dashboards

Tuning candidates

  • Confidence thresholds per Rules-Criticality — если agent at 0.85 actually succeeds только 70%, raise threshold
  • Max criticality autonomous в agent manifests — expand if agent proves reliable, tighten if proves unreliable
  • SLA defaults — if pattern of breaches, either adjust SLA или address root cause
  • Cost budgets — scenarios routinely exceeding, adjust или address cost drivers

Change management

Minor changes (threshold adjustment ≤ 10%) — Director + Agent-CEO approval, log в Decision Log без ADR.

Material changes (significant threshold shifts, new constraints) — ADR required.

Any change к agent manifest — version bump, orchestrator continues in-flight scenarios на старой версии (Reference-Org-Blueprint §7 non-trivial choice #1).

Stage 4: Playbook Rule Emergence

Cadence

Quarterly. Owner: Agent-CEO + Founder.

Input

  • All Stage 2 outputs за quarter
  • Ad-hoc Chamber retros
  • Stage 3 tuning outcomes

Pattern threshold

Rule change considered если pattern appeared в 3+ scenarios независимо. Prevent over-fitting на один incident.

Output types

  • New rule in Rules-* files — requires ADR
  • Updated process in Process-* files — requires ADR if material
  • New event subscription для cross-cutting agent (policy-sensitive) — requires Event-Bus-Pattern schema update
  • New scenario или existing scenario update — via Template-Scenario

Examples

Hypothetical (показать pattern, не реальные decisions):

  • Quarter 1: 5 scenarios failed because research data stale → new rule “research outputs older than 7 days invalidate, require refresh”
  • Quarter 2: Agents consistently over-confident на Dubai market edge cases → manifest threshold 0.85 → 0.9 для Dubai-specific tasks
  • Quarter 3: Handoff gaps recurring между Agent-IntelDirector и Executors → explicit handoff.received confirmation pattern added к Event-Bus-Pattern

Integration с other processes

Process-HypothesisValidation

Hypothesis outcomes (killed / validated / pivoted) — first-class outcome type. Kill-criteria effectiveness tracked (were они well-calibrated?).

Process-Escalation

Escalation outcomes labeled — was escalation right call? Under/over-escalation tracked per agent.

Process-Rollback

Rollback events — definitional failures. Root cause captured.

Process-Failure-Patterns

Retros may surface new patterns → added к doc with examples. Also, frequency tracking за patterns — if Pattern 3 появляется 5 раз за quarter, systemic issue.

Observability

Metrics tracked в Observability:

  • Outcome label completion rate — % scenarios labeled (target >90%)
  • Calibration error — |predicted confidence - actual success rate|, per agent
  • Retro cadence compliance — weekly retro delivered on-time?
  • Tuning cycle throughput — how many changes per cycle
  • Time from retro insight к implemented change — learning speed

Privacy / confidentiality

Outcomes могут содержать sensitive info (business decisions, financial data). Labeling storage respects Rules-DataAccess.

External sharing outcomes только aggregated / anonymized.

Open Questions

  • Outcome labeling UI — Gateway web vs Telegram vs Slack
  • Who labels when Founder unavailable — delegation rules
  • Handling late outcomes (scenario “completed” но real impact seen неделей later)
  • Negative outcome disclosure — how transparent externally (investors, partners)
  • Automation level — can Agent-Judge label without human при low-stakes L1?

Связанные документы