ADR-0018: RAG Memory Layer for M1 Pipeline
Status
proposed
Context
M1 pipeline agents have no memory across runs. Each run starts tabula rasa. Director doesn’t know that Scout returned 0 actors in previous BADs Russia runs. Researcher doesn’t know that TAM/SAM without methodology is a recurring P1 issue (v3 through v7). Scout doesn’t remember that ok.ru and listicle sites were penalized by Judge.
Meanwhile, reports.synth-nova.com holds a complete knowledge base: 7 versions of BADs Russia tests, judgement files with specific issues, known issues lists, gap analyses, calibration data. Agents cannot access any of it.
This causes:
- Repeated known mistakes (same P1 issues across 7 versions)
- Wasted tokens rediscovering context that already exists
- No learning loop — quality plateaus at ~6/10 despite fixes
- Higher cost per run ($1.27) than necessary
Decision
Add a vectorized memory layer to M1 pipeline:
- Source: all .md and .json files from reports/ directory
- Processing: chunking → embeddings → vector DB (pgvector, already planned for Phase 2)
- Integration: before each agent’s main action, semantic search for relevant history (top-3 chunks, ~300-500 tokens)
- Format: inject
## Relevant Historysection before agent prompt with retrieved chunks
Example flow for current P0 bug:
- Director pre-lookup: “scout actors results BADs Russia” → gets chunk from v6: “Scout returned 0 actors, actors=’?’, Director did not escalate” → Director already knows to validate actors
- Scout pre-lookup: “source quality issues BADs Russia” → gets chunk from v5 judgement: “ok.ru, listicle sites penalized” → Scout filters bad sources
- Researcher pre-lookup: “TAM SAM SOM methodology issues” → gets chunk: “P1: TAM/SAM without methodology” → Researcher includes methodology
- Judge pre-lookup: “known scoring patterns BADs Russia” → gets score history v1-v7 → calibrated by past evaluations
Infrastructure: reuse existing famous-media RAG pipeline (chunking, embeddings, semantic search). Not building from scratch.
Expected Impact
- Cost: 0.50-0.80/run (-30-40%)
- Duration: faster (fewer wasted cycles on known errors)
- Quality: learning loop — each run better than previous because agents remember mistakes
Alternatives Considered
- (a) Stuff all history into system prompt — token-expensive, hits context limits on large histories
- (b) Manual prompt updates after each test cycle — current approach, doesn’t scale, founder becomes bottleneck
- (c) RAG with semantic search ← chosen. Targeted retrieval, low token overhead, scales with history
Priority
After M1 quality sprint (after achieving 8/10 on prompt fixes). Prompt fixes first because they’re faster and higher-impact per hour of work. RAG layer is architectural improvement that compounds over time.
Consequences
Positive: learning loop, lower cost, fewer repeated mistakes, scalable to M2/M3 Negative: new infrastructure dependency (pgvector), ingestion pipeline maintenance, embedding costs (~$0.01/run for retrieval)
Links
- Constitution — Law 2 (minimum cost), Law 7 (verify), Law 8 (tokens are capital)
- Niche-Evaluation-Module — M1 spec
- Sprint-Week5-7-Plan — M1 quality sprint context