Claude CodeExperimentsBlogPortfolioAbout me

Experiments / Bench

Back to Experiments

Multi-model Bench

Same task, multiple models. Accuracy on a golden set, p95 latency, cost per run. No marketing numbers: only what I measured myself.

Extract liability clauses from vendor contracts (n=80)

Each model receives the same 47-page contract PDF as semantic chunks and a strict JSON schema. Accuracy = (correct clauses extracted) / (clauses in golden set). Cost is per single contract run.

golden_set: n=80judge: schema-validation + exact-match on clause textdate: 2026-04
Model
Accuracy
p95 latency
Cost / run
Notes
  • Claude Opus 4.7anthropic
    94.0%
    14.8s
    $0.078
    Best on ambiguous clauses (cap references, force-majeure carveouts). Slowest.
  • Claude Sonnet 4.6anthropic
    91.0%
    8.5s
    $0.022
    Sweet spot for production. 3.5x cheaper than Opus, only 3pt accuracy drop.
  • GPT-5openai
    89.0%
    7.9s
    $0.054
    Strong baseline, but missed 2 nested liability caps that Opus caught.
  • Gemini 2.5 Progoogle
    84.0%
    6.1s
    $0.018
    Fastest and cheapest. Drop comes from over-eager extraction of non-liability clauses.
  • Claude Haiku 4.5anthropic
    78.0%
    3.7s
    $6.00m
    Reasonable for first-pass triage. Misses clauses requiring multi-paragraph reasoning.