Experiments
Experiments
Small AI engineering experiments and playgrounds. Each one is self-contained, with the data, the methodology, and what failed before it worked.
01 / agent traces
Agent Trace Viewer
Step-by-step traces of real agent runs: thinking, tool calls, latency, tokens, cost.
02 / benchmark
Multi-model Bench
Same task across Opus, Sonnet, Haiku, GPT-5, Gemini 2.5: latency, cost, quality side by side.
03 / evals
Eval Lab
Golden sets, LLM-judge scoring, regression detection on prompt edits.
04 / writeups
Prompt Evolution
Real prompt iterations: v1 -> v4, with golden-set scores at each step and why each edit happened.