Experiments / Eval Lab
Back to ExperimentsEval Lab
Golden sets with LLM-judge scoring. Each task shows how the prompt evolved across versions and what every version got wrong on the same example.
Extract company name from email signature
60 real-world signatures, 18 with no clear company (should return null). Judge: exact-match on company, with a 'reasonable normalization' rule (LLC vs L.L.C.).
schema: { company: string | null }
judge rubric: 1.0 = exact match or normalized variant. 0.5 = right entity, wrong form. 0.0 = wrong or hallucinated.
Aggregate mean score
v1: zero-shot
62%
v2: with null rule
78%
v3: three examples
91%
Examples
Input
Best, John Doe Senior PM, Acme Corp john@acme.com +1 555 0100
Expected
Acme Corpv1: zero-shot~0.5
AcmeDropped 'Corp' suffix
v2: with null rule1.0
Acme CorpExact match
v3: three examples1.0
Acme CorpExact match
Input
Sent from my iPhone -- Maria
Expected
nullv1: zero-shot0.0
iPhoneHallucinated company from device name
v2: with null rule1.0
nullCorrectly returns null when no company
v3: three examples1.0
nullCorrect
Input
Cheers, A. Singh Head of Data | datascience.io @asingh on x
Expected
datascience.iov1: zero-shot0.0
DataMisread department as company
v2: with null rule1.0
datascience.ioDomain-style company
v3: three examples1.0
datascience.ioCorrect
Input
Привет, Иван Петров CEO @ Северсталь Цифра ivan@severstal-digital.ru
Expected
Северсталь Цифраv1: zero-shot~0.5
Severstal DigitalTranslated to English; rubric requires native form
v2: with null rule1.0
Северсталь ЦифраCorrect native form
v3: three examples1.0
Северсталь ЦифраCorrect