Existing benchmarks for LLMs and AI agent evaluation fail to capture behavioral qualities that only emerge over sustained social interaction. We present CrucibleBench, a behavioral evaluation framework that measures how frontier language models act inside a persistent, text-based social world. A $99 experiment across thirteen models from eight providers revealed that the most expensive model performed worse than the cheapest frontier model.
Each of 650 scored runs placed a model inside a small medieval fantasy city with four non-player characters (NPCs), two hidden objectives, and a 50-turn budget. Models issued structured commands and received deterministic world feedback. We scored runs on four behavioral dimensions and report two configurations: a hard score (World Grounding + Social Adaptation, with reduced classifier dependence) and a full four-dimension score that adds two classifier-dependent dimensions (Goal Pursuit, Strategic Sophistication).
The hard score produced 32 of 78 statistically distinguishable model pairs versus 23 under full scoring (scenario-cell eta-squared = 0.931 vs 0.847), demonstrating that deterministic dimensions carry stronger signal than classifier-dependent dimensions. Hierarchical clustering yielded three performance tiers under both scoring configurations, confirming that persistent social environments reveal stable behavioral stratification.
Performance varies dramatically by task type. Mistral Large achieves 80% on identification objectives but 0% on trust-building. The cross-objective interaction is significant, and only GPT-5.4 exceeds 50% success on both objectives. Single-score model selection masks critical capability gaps.
Dialogue looping is the dominant behavioral failure mode across all frontier models, occurring in 14-66% of runs. This pattern, where the agent repeats near-identical conversational turns, has direct implications for enterprise agent reliability in production environments.
The hard-scored analysis (deterministic dimensions only) produces stronger differentiation than full scoring. Our classifier was suppressing signal rather than adding it. This is a methodological caution for any benchmark that relies on LLM judges: the measurement instrument may be the bottleneck.
Results from 650 scored runs across 13 frontier models from 8 providers. All scores on 1-5 Likert scale with 95% confidence intervals via scenario-cell bootstrap (10,000 iterations).
Hard Score: World Grounding + Social Adaptation only (deterministic, no classifier dependence). Produces stronger model differentiation (eta-squared = 0.931).
| Model | Score | 95% CI |
|---|
Note: Gemini 3.1 Pro shows the largest rank change between scoring modes (drops 6 positions under hard scoring), consistent with classifier over-classification inflating its Goal Pursuit score.
Cost and behavioral quality are effectively decoupled. Linear regression of score on log(cost) yields p = 0.85; Spearman correlation is non-significant. The most expensive model (Grok 4 at $41.71 total) scored below median.
Triangles indicate models with reasoning token overhead. Colors denote performance tiers (green = top, yellow = mid, red = bottom).
From GPT-5.4 Run 2-S3-Trust (Seed 20260496). This high-scoring run demonstrates efficient navigation, context-responsive trust-building, and social diversification.
The proof-of-concept environment is deliberately compact: 12 rooms, 4 NPCs, 50 turns. While this maximizes scenario control, it may not capture behaviors that only emerge in larger, more complex persistent worlds.
With 25 runs per model per objective, pairwise effect sizes among the top 8 models are mostly below the detection threshold. Phase 2 would require approximately 175 runs per model per objective to reliably distinguish within-tier differences.
Post-hoc validation of the dialogue classifier reveals poor inter-rater reliability (kappa = 0.04 on probe detection). The classifier-dependent dimensions (Goal Pursuit, Strategic Sophistication) add noise rather than signal.
This run used NPC-only scenarios. Human-in-the-loop adversarial evaluation, which would create irreducible adversarial pressure, is planned for Phase 2 but not included in these results.
Full methodology specification including scenario design, scoring rubrics, statistical framework, and validation studies. 39 pages.
Download PDFComplete dataset from 650 runs: per-run scores, model outputs, scenario assignments, and billing-verified cost data.
Download ZIPIf you use CrucibleBench in your research, please cite: