Process — Failure Patterns

Повторяющиеся паттерны отказов в multi-agent flows + их mitigation strategies. Не описывает конкретный flow — это reference для authors scenarios и agent manifests.

Based on Reference-Org-Blueprint §9 Cross-Scenario Failure Patterns.

Why document these

Multi-agent системы fail не в том же месте что monolithic software:

  • Failures часто emergent (interaction между agents), не local
  • Detection harder — один agent видит только свой scope
  • Recovery требует coordination
  • Postmortems нужны чтобы учить систему, не just исправить bug

Каждый scenario author должен review эти patterns и проверить что scenario handled их. Не все patterns relevant для всех scenarios — выбирай applicable.

Pattern 1: Handoff gaps

Описание: State теряется между agents. Agent A завершил task, Agent B не подобрал его вовремя / с неправильным context / с incomplete data.

Examples:

  • Research result в memory, но Director не subscribed на research.completed
  • Agent-CEO создал task, Director не увидел в queue
  • Child task done, parent scenario не advanced

Detection:

  • Scenario timeout (ожидает transition)
  • Event emitted но no subscriber reacted в SLA
  • Orphan tasks в queue

Mitigation (design-time):

  • Explicit handoff events (не rely на implicit state)
  • Confirmation pattern: B emits handoff.received подтверждая
  • Orchestrator tracks expected transitions, timeouts trigger escalation

Recovery (runtime):

  • Orchestrator replays last event к missed subscriber
  • Escalate к Agent-CEO если replay не работает
  • Manual re-trigger через HITL-Gateway

Pattern 2: HITL bottlenecks

Описание: Agent escalated правильно, но human overloaded → approval pending, scenario stuck.

Examples:

  • 20 agents ждут одного Founder approval
  • Stage 9 Hypothesis-to-Validation (см. Template-Scenario) pending > SLA
  • Weekend / offline period — no approvals моving

Detection:

  • HITL-Gateway SLA tracker — pending > threshold
  • Queue depth monitoring per approver
  • Scenario completion latency p95 growing

Mitigation (design-time):

  • Batched approvals в Gateway
  • Delegation rules (Rules-AgentDecisionBoundaries)
  • Pre-approval templates для predictable patterns
  • Lower criticality если possible (review if L3 really needed vs L2)

Recovery (runtime):

  • Auto-escalate up the hierarchy после SLA breach
  • Digest notifications — “5 approvals pending >24h”
  • Founder “overload mode” — batch-approve safe categories with audit

Pattern 3: Over-confident action

Описание: Agent executes в edge case где должен был escalate. Confidence calibration off, или novel situation missed.

Examples:

  • Agent уверен в research result на новой теме (no prior data), но actually hallucinating
  • Agent-MarketResearcher применяет Dubai playbook к other market без adjustment
  • L2 action выполнен autonomously, но реально был L3 impact

Detection:

Mitigation (design-time):

  • Confidence thresholds в Rules-Criticality
  • Novel-situation detection — если pattern не в memory, escalate
  • Default-down rule (Rules-Criticality — ambiguity → lower level)
  • Required “uncertainty acknowledgment” в agent outputs

Recovery (runtime):

  • Retrospective escalation — “this was L3, not L2, review outcome”
  • Action rollback if possible
  • Learning loop: pattern goes into guardrails

Pattern 4: Under-confident escalation flood

Описание: Agent escalates всё, human drowns. Opposite of Pattern 3.

Examples:

  • New agent tuned aggressively, escalates routine actions
  • Threshold too conservative → every moderate confidence triggers HITL
  • Agent scared после недавнего rollback — over-escalates

Detection:

  • Escalation rate per agent abnormally high
  • HITL-Gateway queue depth дominated одним agent
  • Human feedback “too many trivial approvals”

Mitigation (design-time):

  • Gradual threshold tuning
  • Training period for new agents (shadow mode)
  • Explicit “low-risk auto-approve” список в manifest

Recovery (runtime):

  • Threshold adjustment в agent manifest
  • Temporarily auto-approve low-risk category
  • Manifest review в quarterly tuning cycle

Pattern 5: Trace gaps

Описание: Decision path теряется между agents. Когда нужна audit, full path не reconstructable. Violation Law 6 (trace).

Examples:

  • Agent emits event без trace_id
  • Subscriber не propagate parent_span_id
  • External tool call без trace metadata

Detection:

  • Trace completeness metric — % scenarios с full trace
  • Audit queries fail to find decision rationale
  • Events с missing trace_id / span_id

Mitigation (design-time):

  • Event-Bus-Pattern enforces trace_id / span_id mandatory в envelope
  • Agent templates include trace propagation boilerplate
  • External tool wrappers inject trace metadata

Recovery (runtime):

  • Partial trace accepted, gap explicitly noted
  • Escalate если material decision без proper trace
  • Scenario marked “trace incomplete” в memory

Pattern 6: Cross-agent contradiction

Описание: 2+ agents дают противоречивые рекомендации human. Human confused, nobody resolves.

Examples:

Detection:

  • Explicit contradictions в agent outputs
  • Agent-Judge flags conflict
  • Human feedback “agents telling me different things”

Mitigation (design-time):

  • Orchestrator conflict resolution rules
  • Chamber-for-Strategic arbitration для strategic contradictions
  • Single source of truth per decision type
  • Explicit “primary” vs “advisory” role distinction

Recovery (runtime):

  • Orchestrator triggers Chamber review
  • Both agents explain reasoning к human
  • Escalate к Agent-CEO для routing

Pattern 7: Policy / rules blindspots

Описание: Действие одного agent triggers obligations другого agent (policy, security, compliance), но второй agent не узнаёт.

Examples:

  • Agent publishes research публично, но no one flagged IP review obligation
  • External communication отправлен, но privacy-agent не notified для audit
  • Data access expanded, security-agent не audit-ил

Detection:

  • Retro review обнаруживает missed obligations
  • External feedback (“you should have done X”)
  • Audit gaps

Mitigation (design-time):

  • Policy-sensitive agents [future] subscribed на cross-cutting events
  • Policy-Layer inspects all actions не just emitter-side
  • Event naming convention surfaces sensitive actions (e.g. *.external, *.published)

Recovery (runtime):

  • Post-hoc audit + corrective action
  • HITL-Gateway escalation
  • Update policy-agent subscriptions to catch similar в future

Usage guidance

For scenario authors

Review этот doc создавая новый scenario. Для каждого pattern:

  1. Relevant ли для scenario? (yes/no + reasoning)
  2. Если yes — mitigation applied? (link или описание)
  3. Recovery path defined?

Scenarios без consideration patterns = incomplete. Agent-Judge может flag.

For agent manifest authors

Review pattern list при проектировании agent:

  • Confidence thresholds → Pattern 3/4
  • Escalation rules → Pattern 2
  • Event subscriptions → Pattern 1/5
  • Trace propagation → Pattern 5
  • Contradiction handling → Pattern 6

For retrospectives

Post-incident review ask:

  • Какой из 7 patterns проявился?
  • Был ли detection adequate?
  • Была mitigation но не сработала, или missing?
  • Что добавить / изменить чтобы prevent recurrence?

Outcome feeds Process-OutcomeLabeling и manifest / threshold tuning.

New pattern addition

Если обнаруживается pattern не в списке:

  1. Document в shadow notes (временно)
  2. После 2+ occurrences — propose formal addition через ADR
  3. Update этот doc + relevant mitigations в system

Patterns не frozen — list will grow с experience.

Связанные документы