Process — Outcome Labeling
Mechanics continuous learning loops per ADR-0022-learning-loops. От outcome capture до manifest tuning и new playbook rules.
Overview
4 stages:
- Outcome capture — real-time при scenario completion
- Retro analysis — weekly batch (Agent-Judge) + ad-hoc (Chamber-for-Strategic)
- Threshold tuning — monthly
- Playbook rule emergence — quarterly, через ADR если material
Stage 1: Outcome Capture
When
Scenario или significant task completes. Triggers:
scenario.completedeventtask.completedс criticality ≥ L2- Human-initiated review (“label this”)
How
1-click через HITL-Gateway или inline в response к Founder.
Outcome taxonomy:
success— achieved intended outcomepartial_success— some outcomes, not completeneutral— completed без positive / negative outcomefailure— did not achieve intendedkilled— aborted intentionally (e.g. hypothesis killed per Build-Measure-Learn)
Required fields:
scenario_id: uuid
trace_id: uuid
outcome: success | partial_success | neutral | failure | killed
outcome_reason: "..." # 1-2 sentences why
labeled_by: "founder" | "director" | "agent-judge-retro"
labeled_at: iso8601
cost_actual_usd: X.XX # pre-filled из trace
duration_actual: X hours
confidence_predicted_avg: 0.XX # из agent outputs
confidence_actual: derived from outcomeLow-friction rule
Labeling нельзя делать tax для Founder. Правила:
- Default autocomplete — agent predicts outcome, human confirms (yes/no/modify)
- Batch UI в Gateway — 10 scenarios за сессию, не per-scenario interruption
- Optional — для L1 tasks auto-labeled as
successесли no errors - Mandatory — для L3+ always explicit human label
Persistence
Outcomes → Memory-Model structured partition (не vector RAG). Searchable by: agent involved, criticality, outcome type, time range.
Stage 2: Retro Analysis
Weekly batch (Agent-Judge)
Scheduled: every Monday morning.
Input: past week’s outcomes (typically 20-100 depending на scale).
Analysis:
- Confidence calibration — агенты предсказали 85% success, actual ?
- Failure clustering — patterns в failures (same agent? same category?)
- Criticality correctness — были ли L2 actions что retro-label-ны как L3 impact?
- Speed / cost trends — QoQ direction (per Manifesto metric)
- Handoff gap detection — scenario stages с abnormal latency
- New pattern flags — issues не matching existing Process-Failure-Patterns
Output:
- Calibration report — per-agent, per-criticality level
- Failure cluster summary — top 3-5 patterns
- Recommended tuning candidates (for Stage 3)
- Escalation candidates для Chamber (material issues)
Report emit-ится event retro.completed, stored в Memory-Model.
Ad-hoc retros (Chamber-for-Strategic)
Triggered for:
- Significant failure (e.g. missed Manifesto SLA repeatedly)
- Unexpected success worth understanding
- Systematic pattern flagged by weekly Judge retro
- Founder explicit request
Chamber retros go deeper — multi-model analysis, cross-reference historical, recommend structural changes.
Output feeds Stage 4 (playbook rules).
Stage 3: Threshold Tuning
Cadence
Monthly review. Owner: Agent-CEO + Founder.
Input
- Weekly retro reports (4 за месяц)
- Calibration trends per agent
- Observability dashboards
Tuning candidates
- Confidence thresholds per Rules-Criticality — если agent at 0.85 actually succeeds только 70%, raise threshold
- Max criticality autonomous в agent manifests — expand if agent proves reliable, tighten if proves unreliable
- SLA defaults — if pattern of breaches, either adjust SLA или address root cause
- Cost budgets — scenarios routinely exceeding, adjust или address cost drivers
Change management
Minor changes (threshold adjustment ≤ 10%) — Director + Agent-CEO approval, log в Decision Log без ADR.
Material changes (significant threshold shifts, new constraints) — ADR required.
Any change к agent manifest — version bump, orchestrator continues in-flight scenarios на старой версии (Reference-Org-Blueprint §7 non-trivial choice #1).
Stage 4: Playbook Rule Emergence
Cadence
Quarterly. Owner: Agent-CEO + Founder.
Input
- All Stage 2 outputs за quarter
- Ad-hoc Chamber retros
- Stage 3 tuning outcomes
Pattern threshold
Rule change considered если pattern appeared в 3+ scenarios независимо. Prevent over-fitting на один incident.
Output types
- New rule in Rules-* files — requires ADR
- Updated process in Process-* files — requires ADR if material
- New event subscription для cross-cutting agent (policy-sensitive) — requires Event-Bus-Pattern schema update
- New scenario или existing scenario update — via Template-Scenario
Examples
Hypothetical (показать pattern, не реальные decisions):
- Quarter 1: 5 scenarios failed because research data stale → new rule “research outputs older than 7 days invalidate, require refresh”
- Quarter 2: Agents consistently over-confident на Dubai market edge cases → manifest threshold 0.85 → 0.9 для Dubai-specific tasks
- Quarter 3: Handoff gaps recurring между Agent-IntelDirector и Executors → explicit
handoff.receivedconfirmation pattern added к Event-Bus-Pattern
Integration с other processes
Process-HypothesisValidation
Hypothesis outcomes (killed / validated / pivoted) — first-class outcome type. Kill-criteria effectiveness tracked (were они well-calibrated?).
Process-Escalation
Escalation outcomes labeled — was escalation right call? Under/over-escalation tracked per agent.
Process-Rollback
Rollback events — definitional failures. Root cause captured.
Process-Failure-Patterns
Retros may surface new patterns → added к doc with examples. Also, frequency tracking за patterns — if Pattern 3 появляется 5 раз за quarter, systemic issue.
Observability
Metrics tracked в Observability:
- Outcome label completion rate — % scenarios labeled (target >90%)
- Calibration error — |predicted confidence - actual success rate|, per agent
- Retro cadence compliance — weekly retro delivered on-time?
- Tuning cycle throughput — how many changes per cycle
- Time from retro insight к implemented change — learning speed
Privacy / confidentiality
Outcomes могут содержать sensitive info (business decisions, financial data). Labeling storage respects Rules-DataAccess.
External sharing outcomes только aggregated / anonymized.
Open Questions
- Outcome labeling UI — Gateway web vs Telegram vs Slack
- Who labels when Founder unavailable — delegation rules
- Handling late outcomes (scenario “completed” но real impact seen неделей later)
- Negative outcome disclosure — how transparent externally (investors, partners)
- Automation level — can Agent-Judge label without human при low-stakes L1?
Связанные документы
- ADR-0022-learning-loops — decision principle
- ADR-0019-cross-model-validation — precedent (Judge-as-evaluator)
- Manifesto — metrics targets
- Agent-Judge — primary retro executor
- Chamber-for-Strategic — escalated retros
- HITL-Gateway — labeling entry point
- Observability — metrics infrastructure
- Memory-Model — outcome persistence
- Process-HypothesisValidation
- Process-Escalation
- Process-Rollback
- Process-Failure-Patterns
- Rules-Criticality
- Rules-AgentDecisionBoundaries
- Build-Measure-Learn
- Reference-Org-Blueprint — section §8