ADR-0019: Cross-Model Validation for M1 Pipeline
Status
proposed
Context
Entire M1 pipeline — Director, Scout, Researcher, FinancialModeler, RatingAgent, Aggregate, AND Judge — runs on a single model: claude-haiku-4-5-20251001. Judge evaluates output produced by the same model family it belongs to. This is a structural weakness: single-model bias propagates through every stage, and the quality gate (Judge) cannot catch systematic model-specific blind spots because it shares them.
This is analogous to the problem M3 Deliberation Chamber was built to solve for strategic questions — but applied to pipeline quality validation.
Decision
Two-tier approach:
Minimum (implement first): Replace Judge model with GPT-4o for cross-model evaluation. Haiku produces the pipeline output, GPT-4o evaluates it. Different training data, different biases, different blind spots. Estimated additional cost: +$0.08/run.
Ideal (implement after minimum proves value): Chamber-lite as final validation step. Three models (Claude + GPT-4o + Gemini) independently score each pipeline stage. Arbiter synthesizes. Essentially a mini-Chamber session per M1 run focused on quality assessment. Estimated additional cost: +$0.15-0.25/run.
Expected Impact
- Structural quality improvement — Judge catches model-specific blind spots
- Better calibration — cross-model scoring reduces scoring variance
- Foundation for M4 Autonomous Dev Loop — QA agents from different providers
Alternatives Considered
- (a) Keep Haiku as Judge — cheapest, but self-evaluation bias proven in 7 BADs Russia tests
- (b) Upgrade Judge to Sonnet — same model family, marginally better but same biases
- (c) GPT-4o as Judge ← minimum viable cross-model validation
- (d) Chamber-lite (3-model panel) ← ideal, builds on M3 infrastructure
Priority
After RAG layer (ADR-0018). Sequence: prompt fixes → RAG → cross-model. Each layer compounds on the previous.
Consequences
Positive: structural quality improvement, reduced single-model bias, Judge independence Negative: +$0.08-0.25/run cost, new API dependency in critical path (if GPT-4o down, Judge fails), potential latency increase
Links
- Constitution — Law 2 (cost), Law 3 (reputation over speed), Law 7 (verify)
- Deliberation-Chamber-Module — M3 architecture reusable for Chamber-lite
- ADR-0018-m1-rag-memory-layer — prerequisite
- Sprint-Week5-7-Plan — M1 quality sprint context