Research - CrucibleBench

Abstract

Existing benchmarks for LLMs and AI agent evaluation fail to capture behavioral qualities that only emerge over sustained social interaction. We present CrucibleBench, a behavioral evaluation framework that measures how frontier language models act inside a persistent, text-based social world. A $99 experiment across thirteen models from eight providers revealed that the most expensive model performed worse than the cheapest frontier model.

Each of 650 scored runs placed a model inside a small medieval fantasy city with four non-player characters (NPCs), two hidden objectives, and a 50-turn budget. Models issued structured commands and received deterministic world feedback. We scored runs on four behavioral dimensions and report two configurations: a hard score (World Grounding + Social Adaptation, with reduced classifier dependence) and a full four-dimension score that adds two classifier-dependent dimensions (Goal Pursuit, Strategic Sophistication).

The hard score produced 32 of 78 statistically distinguishable model pairs versus 23 under full scoring (scenario-cell eta-squared = 0.931 vs 0.847), demonstrating that deterministic dimensions carry stronger signal than classifier-dependent dimensions. Hierarchical clustering yielded three performance tiers under both scoring configurations, confirming that persistent social environments reveal stable behavioral stratification.

Key Findings

Task-Specific Behavioral Specialization

Performance varies dramatically by task type. Mistral Large achieves 80% on identification objectives but 0% on trust-building. The cross-objective interaction is significant, and only GPT-5.4 exceeds 50% success on both objectives. Single-score model selection masks critical capability gaps.

Dialogue Looping is Universal

Dialogue looping is the dominant behavioral failure mode across all frontier models, occurring in 14-66% of runs. This pattern, where the agent repeats near-identical conversational turns, has direct implications for enterprise agent reliability in production environments.

Deterministic Scoring Outperforms LLM Judges

The hard-scored analysis (deterministic dimensions only) produces stronger differentiation than full scoring. Our classifier was suppressing signal rather than adding it. This is a methodological caution for any benchmark that relies on LLM judges: the measurement instrument may be the bottleneck.

Model Performance Summary

Results from 650 scored runs across 13 frontier models from 8 providers. All scores on 1-5 Likert scale with 95% confidence intervals via scenario-cell bootstrap (10,000 iterations).

Hard Score: World Grounding + Social Adaptation only (deterministic, no classifier dependence). Produces stronger model differentiation (eta-squared = 0.931).

Model	Score	95% CI

Note: Gemini 3.1 Pro shows the largest rank change between scoring modes (drops 6 positions under hard scoring), consistent with classifier over-classification inflating its Goal Pursuit score.

Cost vs. Behavioral Performance

Cost and behavioral quality are effectively decoupled. Linear regression of score on log(cost) yields p = 0.85; Spearman correlation is non-significant. The most expensive model (Grok 4 at $41.71 total) scored below median.

Full Score Rankings

Triangles indicate models with reasoning token overhead. Colors denote performance tiers (green = top, yellow = mid, red = bottom).

Annotated Transcript Excerpt

From GPT-5.4 Run 2-S3-Trust (Seed 20260496). This high-scoring run demonstrates efficient navigation, context-responsive trust-building, and social diversification.

Turns 1-5: Navigation Phase

GPT-5.4 moves directly from the city gate to the guard barracks in 3 moves, examining the market square in passing. No wasted actions. By turn 5 it has initiated conversation with Captain Ser Alarik.

Turns 8-14: Trust Building

After a neutral initial response from Alarik, GPT-5.4 shifts to asking about the captain's duties and security concerns—topics the NPC's dialogue tree rewards with trust increments. When Alarik mentions supply shortages, GPT-5.4 leaves to acquire iron rations and returns to gift them. Trust rises from 45 to 62.

Turns 15-22: Social Diversification

Rather than grinding Alarik, GPT-5.4 visits Hale the Keeper and Bran the Merchant, gathering information and building secondary trust relationships. This diversification is rewarded by the Social Adaptation rubric (intent variety, NPC diversity).

Turns 30-38: Objective Completion

GPT-5.4 returns to Alarik with trust at 71, delivers a final gift pushing trust to 78, and explicitly requests a recommendation. Objective completed at turn 38. The remaining 12 turns are used for exploration.

Annotation: This run demonstrates GPT-5.4's signature pattern: efficient navigation, context-responsive gifting (responding to NPC dialogue cues rather than brute-forcing items), and social diversification that serves both the scoring rubric and strategic goals. The model treats the 50-turn budget as a scarce resource.

Limitations

Compact Environment

The proof-of-concept environment is deliberately compact: 12 rooms, 4 NPCs, 50 turns. While this maximizes scenario control, it may not capture behaviors that only emerge in larger, more complex persistent worlds.

Sample Size Within Objectives

With 25 runs per model per objective, pairwise effect sizes among the top 8 models are mostly below the detection threshold. Phase 2 would require approximately 175 runs per model per objective to reliably distinguish within-tier differences.

Classifier Reliability

Post-hoc validation of the dialogue classifier reveals poor inter-rater reliability (kappa = 0.04 on probe detection). The classifier-dependent dimensions (Goal Pursuit, Strategic Sophistication) add noise rather than signal.

No Human Adversaries

This run used NPC-only scenarios. Human-in-the-loop adversarial evaluation, which would create irreducible adversarial pressure, is planned for Phase 2 but not included in these results.

Downloads & Artifacts

Technical Paper (PDF)

Full methodology specification including scenario design, scoring rubrics, statistical framework, and validation studies. 39 pages.

Download PDF

Experiment Data

Complete dataset from 650 runs: per-run scores, model outputs, scenario assignments, and billing-verified cost data.

Download ZIP

Citation

If you use CrucibleBench in your research, please cite:

@techreport{cruciblebench2026, title = {CrucibleBench: Behavioral Differentiation of Frontier Language Models in a Deterministic Multi-User Dungeon Environment}, author = {Folcright LLC}, year = {2026}, month = {March}, url = {https://cruciblebench.ai/research}, note = {650 runs across 13 frontier models at \$99 total cost} }