Process — Outcome Labeling

Mechanics continuous learning loops per ADR-0022-learning-loops. От outcome capture до manifest tuning и new playbook rules.

Overview

4 stages:

Outcome capture — real-time при scenario completion
Retro analysis — weekly batch (Agent-Judge) + ad-hoc (Chamber-for-Strategic)
Threshold tuning — monthly
Playbook rule emergence — quarterly, через ADR если material

Stage 1: Outcome Capture

When

Scenario или significant task completes. Triggers:

scenario.completed event
task.completed с criticality ≥ L2
Human-initiated review (“label this”)

How

1-click через HITL-Gateway или inline в response к Founder.

Outcome taxonomy:

success — achieved intended outcome
partial_success — some outcomes, not complete
neutral — completed без positive / negative outcome
failure — did not achieve intended
killed — aborted intentionally (e.g. hypothesis killed per Build-Measure-Learn)

Required fields:

scenario_id: uuid
trace_id: uuid
outcome: success | partial_success | neutral | failure | killed
outcome_reason: "..."                # 1-2 sentences why
labeled_by: "founder" | "director" | "agent-judge-retro"
labeled_at: iso8601
cost_actual_usd: X.XX                # pre-filled из trace
duration_actual: X hours
confidence_predicted_avg: 0.XX       # из agent outputs
confidence_actual: derived from outcome

Low-friction rule

Labeling нельзя делать tax для Founder. Правила:

Default autocomplete — agent predicts outcome, human confirms (yes/no/modify)
Batch UI в Gateway — 10 scenarios за сессию, не per-scenario interruption
Optional — для L1 tasks auto-labeled as success если no errors
Mandatory — для L3+ always explicit human label

Persistence

Outcomes → Memory-Model structured partition (не vector RAG). Searchable by: agent involved, criticality, outcome type, time range.

Stage 2: Retro Analysis

Weekly batch (Agent-Judge)

Scheduled: every Monday morning.

Input: past week’s outcomes (typically 20-100 depending на scale).

Analysis:

Confidence calibration — агенты предсказали 85% success, actual ?
Failure clustering — patterns в failures (same agent? same category?)
Criticality correctness — были ли L2 actions что retro-label-ны как L3 impact?
Speed / cost trends — QoQ direction (per Manifesto metric)
Handoff gap detection — scenario stages с abnormal latency
New pattern flags — issues не matching existing Process-Failure-Patterns

Output:

Calibration report — per-agent, per-criticality level
Failure cluster summary — top 3-5 patterns
Recommended tuning candidates (for Stage 3)
Escalation candidates для Chamber (material issues)

Report emit-ится event retro.completed, stored в Memory-Model.

Ad-hoc retros (Chamber-for-Strategic)

Triggered for:

Significant failure (e.g. missed Manifesto SLA repeatedly)
Unexpected success worth understanding
Systematic pattern flagged by weekly Judge retro
Founder explicit request

Chamber retros go deeper — multi-model analysis, cross-reference historical, recommend structural changes.

Output feeds Stage 4 (playbook rules).

Stage 3: Threshold Tuning

Cadence

Monthly review. Owner: Agent-CEO + Founder.

Input

Weekly retro reports (4 за месяц)
Calibration trends per agent
Observability dashboards

Tuning candidates

Confidence thresholds per Rules-Criticality — если agent at 0.85 actually succeeds только 70%, raise threshold
Max criticality autonomous в agent manifests — expand if agent proves reliable, tighten if proves unreliable
SLA defaults — if pattern of breaches, either adjust SLA или address root cause
Cost budgets — scenarios routinely exceeding, adjust или address cost drivers

Change management

Minor changes (threshold adjustment ≤ 10%) — Director + Agent-CEO approval, log в Decision Log без ADR.

Material changes (significant threshold shifts, new constraints) — ADR required.

Any change к agent manifest — version bump, orchestrator continues in-flight scenarios на старой версии (Reference-Org-Blueprint §7 non-trivial choice #1).

Stage 4: Playbook Rule Emergence

Cadence

Quarterly. Owner: Agent-CEO + Founder.

Input

All Stage 2 outputs за quarter
Ad-hoc Chamber retros
Stage 3 tuning outcomes

Pattern threshold

Rule change considered если pattern appeared в 3+ scenarios независимо. Prevent over-fitting на один incident.

Output types

New rule in Rules-* files — requires ADR
Updated process in Process-* files — requires ADR if material
New event subscription для cross-cutting agent (policy-sensitive) — requires Event-Bus-Pattern schema update
New scenario или existing scenario update — via Template-Scenario

Examples

Hypothetical (показать pattern, не реальные decisions):

Quarter 1: 5 scenarios failed because research data stale → new rule “research outputs older than 7 days invalidate, require refresh”
Quarter 2: Agents consistently over-confident на Dubai market edge cases → manifest threshold 0.85 → 0.9 для Dubai-specific tasks
Quarter 3: Handoff gaps recurring между Agent-IntelDirector и Executors → explicit handoff.received confirmation pattern added к Event-Bus-Pattern

Observability

Metrics tracked в Observability:

Outcome label completion rate — % scenarios labeled (target >90%)
Calibration error — |predicted confidence - actual success rate|, per agent
Retro cadence compliance — weekly retro delivered on-time?
Tuning cycle throughput — how many changes per cycle
Time from retro insight к implemented change — learning speed

Privacy / confidentiality

Outcomes могут содержать sensitive info (business decisions, financial data). Labeling storage respects Rules-DataAccess.

External sharing outcomes только aggregated / anonymized.

Open Questions

Outcome labeling UI — Gateway web vs Telegram vs Slack
Who labels when Founder unavailable — delegation rules
Handling late outcomes (scenario “completed” но real impact seen неделей later)
Negative outcome disclosure — how transparent externally (investors, partners)
Automation level — can Agent-Judge label without human при low-stakes L1?

Связанные документы

ADR-0022-learning-loops — decision principle
ADR-0019-cross-model-validation — precedent (Judge-as-evaluator)
Manifesto — metrics targets
Agent-Judge — primary retro executor
Chamber-for-Strategic — escalated retros
HITL-Gateway — labeling entry point
Observability — metrics infrastructure
Memory-Model — outcome persistence
Process-HypothesisValidation
Process-Escalation
Process-Rollback
Process-Failure-Patterns
Rules-Criticality
Rules-AgentDecisionBoundaries
Build-Measure-Learn
Reference-Org-Blueprint — section §8

Synth Nova Manifest

Explorer

Process — Outcome Labeling

Process — Outcome Labeling

Overview

Stage 1: Outcome Capture

When

How

Low-friction rule

Persistence

Stage 2: Retro Analysis

Weekly batch (Agent-Judge)

Ad-hoc retros (Chamber-for-Strategic)

Stage 3: Threshold Tuning

Cadence

Input

Tuning candidates

Change management

Stage 4: Playbook Rule Emergence

Cadence

Input

Pattern threshold

Output types

Examples

Integration с other processes

Process-HypothesisValidation

Process-Escalation

Process-Rollback

Process-Failure-Patterns

Observability

Privacy / confidentiality

Open Questions

Связанные документы

Graph View

Table of Contents

Backlinks