Experiments / Bench
Back to ExperimentsMulti-model Bench
Same task, multiple models. Accuracy on a golden set, p95 latency, cost per run. No marketing numbers: only what I measured myself.
Extract liability clauses from vendor contracts (n=80)
Each model receives the same 47-page contract PDF as semantic chunks and a strict JSON schema. Accuracy = (correct clauses extracted) / (clauses in golden set). Cost is per single contract run.
golden_set: n=80judge: schema-validation + exact-match on clause textdate: 2026-04
Model
Accuracy
p95 latency
Cost / run
Notes
- Claude Opus 4.7anthropic94.0%14.8s$0.078Best on ambiguous clauses (cap references, force-majeure carveouts). Slowest.
- Claude Sonnet 4.6anthropic91.0%8.5s$0.022Sweet spot for production. 3.5x cheaper than Opus, only 3pt accuracy drop.
- GPT-5openai89.0%7.9s$0.054Strong baseline, but missed 2 nested liability caps that Opus caught.
- Gemini 2.5 Progoogle84.0%6.1s$0.018Fastest and cheapest. Drop comes from over-eager extraction of non-liability clauses.
- Claude Haiku 4.5anthropic78.0%3.7s$6.00mReasonable for first-pass triage. Misses clauses requiring multi-paragraph reasoning.