ADR-0018: RAG Memory Layer for M1 Pipeline

Status

proposed

Context

M1 pipeline agents have no memory across runs. Each run starts tabula rasa. Director doesn’t know that Scout returned 0 actors in previous BADs Russia runs. Researcher doesn’t know that TAM/SAM without methodology is a recurring P1 issue (v3 through v7). Scout doesn’t remember that ok.ru and listicle sites were penalized by Judge.

Meanwhile, reports.synth-nova.com holds a complete knowledge base: 7 versions of BADs Russia tests, judgement files with specific issues, known issues lists, gap analyses, calibration data. Agents cannot access any of it.

This causes:

  • Repeated known mistakes (same P1 issues across 7 versions)
  • Wasted tokens rediscovering context that already exists
  • No learning loop — quality plateaus at ~6/10 despite fixes
  • Higher cost per run ($1.27) than necessary

Decision

Add a vectorized memory layer to M1 pipeline:

  • Source: all .md and .json files from reports/ directory
  • Processing: chunking → embeddings → vector DB (pgvector, already planned for Phase 2)
  • Integration: before each agent’s main action, semantic search for relevant history (top-3 chunks, ~300-500 tokens)
  • Format: inject ## Relevant History section before agent prompt with retrieved chunks

Example flow for current P0 bug:

  • Director pre-lookup: “scout actors results BADs Russia” → gets chunk from v6: “Scout returned 0 actors, actors=’?’, Director did not escalate” → Director already knows to validate actors
  • Scout pre-lookup: “source quality issues BADs Russia” → gets chunk from v5 judgement: “ok.ru, listicle sites penalized” → Scout filters bad sources
  • Researcher pre-lookup: “TAM SAM SOM methodology issues” → gets chunk: “P1: TAM/SAM without methodology” → Researcher includes methodology
  • Judge pre-lookup: “known scoring patterns BADs Russia” → gets score history v1-v7 → calibrated by past evaluations

Infrastructure: reuse existing famous-media RAG pipeline (chunking, embeddings, semantic search). Not building from scratch.

Expected Impact

  • Cost: 0.50-0.80/run (-30-40%)
  • Duration: faster (fewer wasted cycles on known errors)
  • Quality: learning loop — each run better than previous because agents remember mistakes

Alternatives Considered

  • (a) Stuff all history into system prompt — token-expensive, hits context limits on large histories
  • (b) Manual prompt updates after each test cycle — current approach, doesn’t scale, founder becomes bottleneck
  • (c) RAG with semantic search ← chosen. Targeted retrieval, low token overhead, scales with history

Priority

After M1 quality sprint (after achieving 8/10 on prompt fixes). Prompt fixes first because they’re faster and higher-impact per hour of work. RAG layer is architectural improvement that compounds over time.

Consequences

Positive: learning loop, lower cost, fewer repeated mistakes, scalable to M2/M3 Negative: new infrastructure dependency (pgvector), ingestion pipeline maintenance, embedding costs (~$0.01/run for retrieval)