Why this repo matters
Latest release 0d ago, 6 developer signals, 2 package/install signals

Reproducible evals for LLM reliability failures in agentic and knowledge work — 8-mode taxonomy, deterministic graders, and a trajectory harness with scripted tools. Orthogonal to capability and safety evals. (0 stars, 0 forks, Python, fresh release, 5 AI signals, 3 developer signals). Latest release: v0.2.0.