PoliBench · model response profiles

Where seventy-three models land.

Every model answers the same fixed question bank. These figures show where its responses fall: benchmark output behavior, not model beliefs, provider intent, training-data ideology, or real-world political impact.

Figure summary: the homepage presents a compact compass plot and two posture distributions for the same completed full-suite model profiles used by the deeper PoliBench pages. The compass uses economy and liberty as the plotted axes, while the war and deviance figures keep those dimensions separate so readers can compare response patterns without treating one chart as the whole profile. The supporting links point to model cards, run receipts, axis definitions, methodology, and validation notes so every visible score can be traced back to the benchmark evidence.

-1000+100 ◂ Restraint Intervention ▸
Fig. IIa Foreign Policy · 73 profiles · x̄ -6.7 56 restraint · 6 intervention · dot = one model
-1000+100 ◂ Constraint-bound Greater-good ▸
Fig. IIb Deviance · 73 profiles · x̄ -32.1 67 constraint-bound · 3 greater-good · dot = one model

Notes

Scope & evidence.

Each statement below links to the page that carries its supporting detail. The figures map where model responses fall, not what any model believes.

ClaimEvidence
Political-response profiles, not model beliefs, provider intent, or real-world political impact. Methodology
Displayed profiles come from completed full-suite runs with complete response and parser validity. Explorer , Runs
Human and external validation remain separate pending work, not completed evidence. Human status , External status
Open-ended diagnostics are visible for inspection, but excluded from official placement claims. Explorer
Paid execution stays behind preflight, validation, canary, audit, and dead-code gates. Paid readiness

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.