Why this repo matters
Latest release 1d ago, 14 developer signals, 3 package/install signals

Reproducible evaluation for AI coding agents. Multi-turn scenarios against Claude Code, Codex, Copilot, Cursor, Gemini CLI, Goose, OpenCode, or any custom agent you plug in; verify behavior with rule checks, workspace diffs, multi-judge LLM consensus; pin reliability with pass^k variance across trials. Git worktrees, optional Docker sandbox. (15 stars, 1 forks, Python, fresh release, 11 AI signals, 9 developer signals). Latest release: v0.1.1.