ADR-0019: Cross-Model Validation for M1 Pipeline

Status

proposed

Context

Entire M1 pipeline — Director, Scout, Researcher, FinancialModeler, RatingAgent, Aggregate, AND Judge — runs on a single model: claude-haiku-4-5-20251001. Judge evaluates output produced by the same model family it belongs to. This is a structural weakness: single-model bias propagates through every stage, and the quality gate (Judge) cannot catch systematic model-specific blind spots because it shares them.

This is analogous to the problem M3 Deliberation Chamber was built to solve for strategic questions — but applied to pipeline quality validation.

Decision

Two-tier approach:

Minimum (implement first): Replace Judge model with GPT-4o for cross-model evaluation. Haiku produces the pipeline output, GPT-4o evaluates it. Different training data, different biases, different blind spots. Estimated additional cost: +$0.08/run.

Ideal (implement after minimum proves value): Chamber-lite as final validation step. Three models (Claude + GPT-4o + Gemini) independently score each pipeline stage. Arbiter synthesizes. Essentially a mini-Chamber session per M1 run focused on quality assessment. Estimated additional cost: +$0.15-0.25/run.

Expected Impact

  • Structural quality improvement — Judge catches model-specific blind spots
  • Better calibration — cross-model scoring reduces scoring variance
  • Foundation for M4 Autonomous Dev Loop — QA agents from different providers

Alternatives Considered

  • (a) Keep Haiku as Judge — cheapest, but self-evaluation bias proven in 7 BADs Russia tests
  • (b) Upgrade Judge to Sonnet — same model family, marginally better but same biases
  • (c) GPT-4o as Judge ← minimum viable cross-model validation
  • (d) Chamber-lite (3-model panel) ← ideal, builds on M3 infrastructure

Priority

After RAG layer (ADR-0018). Sequence: prompt fixes → RAG → cross-model. Each layer compounds on the previous.

Consequences

Positive: structural quality improvement, reduced single-model bias, Judge independence Negative: +$0.08-0.25/run cost, new API dependency in critical path (if GPT-4o down, Judge fails), potential latency increase