Process — Failure Patterns
Повторяющиеся паттерны отказов в multi-agent flows + их mitigation strategies. Не описывает конкретный flow — это reference для authors scenarios и agent manifests.
Based on Reference-Org-Blueprint §9 Cross-Scenario Failure Patterns.
Why document these
Multi-agent системы fail не в том же месте что monolithic software:
- Failures часто emergent (interaction между agents), не local
- Detection harder — один agent видит только свой scope
- Recovery требует coordination
- Postmortems нужны чтобы учить систему, не just исправить bug
Каждый scenario author должен review эти patterns и проверить что scenario handled их. Не все patterns relevant для всех scenarios — выбирай applicable.
Pattern 1: Handoff gaps
Описание: State теряется между agents. Agent A завершил task, Agent B не подобрал его вовремя / с неправильным context / с incomplete data.
Examples:
- Research result в memory, но Director не subscribed на
research.completed - Agent-CEO создал task, Director не увидел в queue
- Child task done, parent scenario не advanced
Detection:
- Scenario timeout (ожидает transition)
- Event emitted но no subscriber reacted в SLA
- Orphan tasks в queue
Mitigation (design-time):
- Explicit handoff events (не rely на implicit state)
- Confirmation pattern: B emits
handoff.receivedподтверждая - Orchestrator tracks expected transitions, timeouts trigger escalation
Recovery (runtime):
- Orchestrator replays last event к missed subscriber
- Escalate к Agent-CEO если replay не работает
- Manual re-trigger через HITL-Gateway
Pattern 2: HITL bottlenecks
Описание: Agent escalated правильно, но human overloaded → approval pending, scenario stuck.
Examples:
- 20 agents ждут одного Founder approval
- Stage 9 Hypothesis-to-Validation (см. Template-Scenario) pending > SLA
- Weekend / offline period — no approvals моving
Detection:
- HITL-Gateway SLA tracker — pending > threshold
- Queue depth monitoring per approver
- Scenario completion latency p95 growing
Mitigation (design-time):
- Batched approvals в Gateway
- Delegation rules (Rules-AgentDecisionBoundaries)
- Pre-approval templates для predictable patterns
- Lower criticality если possible (review if L3 really needed vs L2)
Recovery (runtime):
- Auto-escalate up the hierarchy после SLA breach
- Digest notifications — “5 approvals pending >24h”
- Founder “overload mode” — batch-approve safe categories with audit
Pattern 3: Over-confident action
Описание: Agent executes в edge case где должен был escalate. Confidence calibration off, или novel situation missed.
Examples:
- Agent уверен в research result на новой теме (no prior data), но actually hallucinating
- Agent-MarketResearcher применяет Dubai playbook к other market без adjustment
- L2 action выполнен autonomously, но реально был L3 impact
Detection:
- Post-hoc Process-OutcomeLabeling outcomes хуже confidence predicted
- Agent-Judge flags overconfident claims
- Founder feedback на выполненные actions
Mitigation (design-time):
- Confidence thresholds в Rules-Criticality
- Novel-situation detection — если pattern не в memory, escalate
- Default-down rule (Rules-Criticality — ambiguity → lower level)
- Required “uncertainty acknowledgment” в agent outputs
Recovery (runtime):
- Retrospective escalation — “this was L3, not L2, review outcome”
- Action rollback if possible
- Learning loop: pattern goes into guardrails
Pattern 4: Under-confident escalation flood
Описание: Agent escalates всё, human drowns. Opposite of Pattern 3.
Examples:
- New agent tuned aggressively, escalates routine actions
- Threshold too conservative → every moderate confidence triggers HITL
- Agent scared после недавнего rollback — over-escalates
Detection:
- Escalation rate per agent abnormally high
- HITL-Gateway queue depth дominated одним agent
- Human feedback “too many trivial approvals”
Mitigation (design-time):
- Gradual threshold tuning
- Training period for new agents (shadow mode)
- Explicit “low-risk auto-approve” список в manifest
Recovery (runtime):
- Threshold adjustment в agent manifest
- Temporarily auto-approve low-risk category
- Manifest review в quarterly tuning cycle
Pattern 5: Trace gaps
Описание: Decision path теряется между agents. Когда нужна audit, full path не reconstructable. Violation Law 6 (trace).
Examples:
- Agent emits event без trace_id
- Subscriber не propagate parent_span_id
- External tool call без trace metadata
Detection:
- Trace completeness metric — % scenarios с full trace
- Audit queries fail to find decision rationale
- Events с missing trace_id / span_id
Mitigation (design-time):
- Event-Bus-Pattern enforces trace_id / span_id mandatory в envelope
- Agent templates include trace propagation boilerplate
- External tool wrappers inject trace metadata
Recovery (runtime):
- Partial trace accepted, gap explicitly noted
- Escalate если material decision без proper trace
- Scenario marked “trace incomplete” в memory
Pattern 6: Cross-agent contradiction
Описание: 2+ agents дают противоречивые рекомендации human. Human confused, nobody resolves.
Examples:
- Agent-IntelDirector recommends entry в niche, Agent-NicheEvaluationDirector recommends skip
- Market research says X, competitor research says opposite
- Two executors same task — divergent results
Detection:
- Explicit contradictions в agent outputs
- Agent-Judge flags conflict
- Human feedback “agents telling me different things”
Mitigation (design-time):
- Orchestrator conflict resolution rules
- Chamber-for-Strategic arbitration для strategic contradictions
- Single source of truth per decision type
- Explicit “primary” vs “advisory” role distinction
Recovery (runtime):
- Orchestrator triggers Chamber review
- Both agents explain reasoning к human
- Escalate к Agent-CEO для routing
Pattern 7: Policy / rules blindspots
Описание: Действие одного agent triggers obligations другого agent (policy, security, compliance), но второй agent не узнаёт.
Examples:
- Agent publishes research публично, но no one flagged IP review obligation
- External communication отправлен, но privacy-agent не notified для audit
- Data access expanded, security-agent не audit-ил
Detection:
- Retro review обнаруживает missed obligations
- External feedback (“you should have done X”)
- Audit gaps
Mitigation (design-time):
- Policy-sensitive agents
[future]subscribed на cross-cutting events - Policy-Layer inspects all actions не just emitter-side
- Event naming convention surfaces sensitive actions (e.g.
*.external,*.published)
Recovery (runtime):
- Post-hoc audit + corrective action
- HITL-Gateway escalation
- Update policy-agent subscriptions to catch similar в future
Usage guidance
For scenario authors
Review этот doc создавая новый scenario. Для каждого pattern:
- Relevant ли для scenario? (yes/no + reasoning)
- Если yes — mitigation applied? (link или описание)
- Recovery path defined?
Scenarios без consideration patterns = incomplete. Agent-Judge может flag.
For agent manifest authors
Review pattern list при проектировании agent:
- Confidence thresholds → Pattern 3/4
- Escalation rules → Pattern 2
- Event subscriptions → Pattern 1/5
- Trace propagation → Pattern 5
- Contradiction handling → Pattern 6
For retrospectives
Post-incident review ask:
- Какой из 7 patterns проявился?
- Был ли detection adequate?
- Была mitigation но не сработала, или missing?
- Что добавить / изменить чтобы prevent recurrence?
Outcome feeds Process-OutcomeLabeling и manifest / threshold tuning.
New pattern addition
Если обнаруживается pattern не в списке:
- Document в shadow notes (временно)
- После 2+ occurrences — propose formal addition через ADR
- Update этот doc + relevant mitigations в system
Patterns не frozen — list will grow с experience.
Связанные документы
- Reference-Org-Blueprint — section §9 source
- Template-Scenario — mandatory failure modes section
- Agent-Judge — runtime detection
- HITL-Gateway — bottleneck / escalation handling
- Chamber-for-Strategic — contradiction arbitration
- Event-Bus-Pattern — trace propagation
- Policy-Layer — blindspot prevention
- Observability — metrics для detection
- Process-OutcomeLabeling — feed learning loops
- Process-Escalation
- Process-Rollback
- Rules-Criticality
- Rules-AgentDecisionBoundaries
- Manifesto — принципы observability, async, separate judge