R28 Timeout Incident
Status: Open — remediation path not chosen yet Severity: Medium (M1 pipeline does complete on simpler niches per R27; this affects complex briefs) Fundraise-blocker: Yes if left unfixed — pipeline scalability is a demo risk
Summary
R28 run hit 30-minute hard timeout at 1802s. All 11 specialized agents completed successfully, but the final intel_director aggregation stage hung for 596s without completing. Pipeline exited with status completed_partial, having produced valid full.json (62KB) and report.md (31KB) but no meta.json or judgement.json.
This is the first observed timeout incident since R25 (2026-04-19). R26 and R27 completed in 1515s and 1434s respectively on the same AI-esoterics ТЗ.
Context — what R28 was supposed to validate
R28 was launched Day 3 to validate that R27’s Judge 8.0 was not niche-specific to AI-esoterics. Second niche chosen: e-commerce SMB AI-copilot for Shopify/WooCommerce merchants. ТЗ was intentionally more complex than R27 — longer word count (~800 vs ~500), more explicit deliverables (top-15 competitors vs top-10, two acquisition channels vs one, three pricing tiers), harder B2B context.
Timeline (from activity.jsonl)
| Time (s) | Event |
|---|---|
| 51 | Scout + Researcher start in parallel |
| 220 | Scout done ($0.59) |
| 466 | Researcher done ($1.08) |
| 466-663 | Phase 1 expand: FinancialModeler + RatingAgent + ProductDesigner + AudienceSegmenter complete in parallel ($0.57 combined) |
| 661 | Phase 2 expand starts: FunnelArchitect + ContentStrategist + Copywriter |
| 866 | FunnelArchitect done ($0.15) |
| 869 | ContentStrategist done ($0.16) |
| 1207 | Copywriter done ($0.34) — took 545s alone (36% of pipeline time) |
| 1208 | intel_director final aggregation starts |
| 1804 | 30-min hard timeout hits; pipeline aborts during director aggregation |
Root cause analysis
Two concurrent issues:
Issue 1 — Copywriter regression
In R27 (AI-esoterics), Copywriter completed in ~170 seconds. In R28, Copywriter took 545 seconds — 3.2x longer.
Hypothesis: Copywriter prompt size scales with prior agent outputs. R28 had larger outputs from Researcher (1.08 vs 0.37 Scout cost indicates much more source material), which inflates Copywriter’s input context, which in turn slows generation.
No investigation done yet on Copywriter prompt internals for this run.
Issue 2 — Intel Director aggregation hang
After all 11 agents complete, intel_director runs a final aggregation LLM call to synthesize executive summary. In R27 this completed in <60s. In R28 it hung for the full 596s until hard timeout cut it off.
Hypothesis: max_tokens ceiling reached, or retry loop, or rate limit backoff. Pipeline had consumed $2.90 by this point — not hitting per-run budget cap. Total tokens across agents was high.
Commit 67af174 from Day 2 raised max_tokens to 32000 to address a similar issue, but may be insufficient for R28’s larger context.
Impact
What we lost
- No
meta.jsonwith verdict + final scores - No
judgement.jsonwith Judge overall and per-agent scores - No validation of R27’s Judge 8.0 against different niche
What we kept
- Full
activity.jsonlwith complete per-stage timing and costs full.jsonwith all 11 agents’ structured outputsreport.md— pipeline wrote 31KB of narrative report before timeout- Enough data to diagnose (this report)
Financial impact
$2.90 spent on an incomplete run. Acceptable cost of investigation. Not a pattern that would bankrupt the project at current volumes.
Brand / positioning impact
Two niches attempted, one completed cleanly. Cannot yet claim “M1 produces Judge 8.0 consistently across niches.” M2 implementation gate (”≥ 3 clean runs on distinct niches”) now pushed further out.
Remediation options
Three remediation paths, non-mutually-exclusive.
Option A — Raise hard timeout
Simplest. Change 30-min limit to 45 or 60 min. Let complex pipelines complete.
Pros:
- 5-minute code change
- Immediately unblocks complex niche validation
- R28-class complexity completes
Cons:
- Doesn’t address root cause
- Cost ceiling per-run increases proportionally
- User experience degrades (45-min wait)
- Does not fix the underlying Copywriter regression or Director aggregation hang
- Next time we hit complex niche, we may still hit 60-min limit
Cost: ~15 min code change, 1 test run to validate.
Option B — Fix Director aggregation
Investigate why intel_director hung for 596s. Likely candidates:
- Retry loop on LLM rate-limit errors (cascading backoff)
- max_tokens insufficient — LLM generates partial output, triggers internal retry
- Sync API call with no outer timeout
Fix: add explicit per-call timeout on director aggregation (e.g., 120s), retry at most once, then gracefully degrade to partial synthesis.
Pros:
- Addresses actual root cause
- R27 proved the code path works when not overloaded — something specific broke in R28
- Preserves 30-min UX contract
- Deterministic fix (aggregation call either completes or doesn’t)
Cons:
- Requires investigation time (~2-4 hours Claude Code)
- May reveal Copywriter regression as upstream problem (needs further work)
Cost: ~3-4 hours Claude Code work.
Option C — Split aggregation into smaller chunks
Instead of one large synthesis call, do multiple focused passes:
- Executive summary synthesis (small scope)
- Scorecards synthesis (small scope)
- Next steps synthesis (small scope)
Each < 2000 tokens output. Then concatenate.
Pros:
- Addresses scaling issue structurally
- Each sub-call finishes faster and hits smaller max_tokens
- Parallelizable if needed
- Better error isolation
Cons:
- Larger refactor (~1 day work)
- Loses benefit of LLM’s cross-section reasoning (may reduce quality)
- Risk of regressions in R27-class runs that currently work
Cost: ~1 day Claude Code work.
Recommendation
Default path: Option B (fix Director aggregation) first, then consider Option C if B reveals deeper issues.
Reasoning:
- R27 proves the code works — something specific broke in R28
- Debugging before restructuring is cheaper
- Preserves existing quality on simple niches while fixing complex ones
- 3-4 hour investigation gives evidence whether C is needed
Option A (raise timeout) only as temporary workaround if urgent demo pending.
Next steps
- Decide remediation path (A / B / C) — Denis decision
- If Option B: create Claude Code task to investigate director aggregation
- Re-run R28 after fix to validate
- If clean, run R29 on third niche to close M1 stability gate
- Update ADR-0025-Combined-M1-M2-Primary-Product implementation gates if M1 stability needs more runs
Related
- R27-Results — R27 completed successfully, baseline for comparison
- ADR-0016-chamber-v2-vision — CriticalityPolicy defines pipeline timeout classes
- Commit
67af174— prior max_tokens fix (raised to 32000) - Commit
ce144a6— prior director aggregation change (removed LLM call from extended stage, but aggregation still uses LLM for core) - Commit
abee295— prior director aggregation change (removed dead ThreadPoolExecutor timeout)