Most agent evals measure one of three things: knowledge, task completion, or tool use. Some newer benchmarks measure social intelligence or deception in bounded settings. CrucibleBench exists to measure a different failure surface: whether an agent can remain coherent, adaptive, and trustworthy when relationships persist, feedback accumulates, and actions change the future state of the world.