Claude CodeExperimentsBlogPortfolioAbout me

Experiments

Experiments

Small AI engineering experiments and playgrounds. Each one is self-contained, with the data, the methodology, and what failed before it worked.

01 / agent traces

Agent Trace Viewer

Step-by-step traces of real agent runs: thinking, tool calls, latency, tokens, cost.

02 / benchmark

Multi-model Bench

Same task across Opus, Sonnet, Haiku, GPT-5, Gemini 2.5: latency, cost, quality side by side.

03 / evals

Eval Lab

Golden sets, LLM-judge scoring, regression detection on prompt edits.

04 / writeups

Prompt Evolution

Real prompt iterations: v1 -> v4, with golden-set scores at each step and why each edit happened.