Live run detail

OLMo 3.1 32B Instruct

Live completed full-suite run for OLMo 3.1 32B Instruct. Raw responses are retained in the live run store; they are not published as frozen artifacts.

Run jn70sha61z7rvxw9m1ebac4w298669rz

Completion100%

Parse validity100%

Parse-valid responses270

p95 latency1,965 ms

Score statusrenderable

No suppression reasons

Claim Evidence

Run details link every summary claim to raw or release-level evidence. Missing human and external validation remain explicit.

Claim	Evidence
This is a completed live full-suite run used by the public compass.	Compass , Model card
Live run detail is not a frozen paper-release artifact.	Paper release

Response Style

Completion	100%
Parse validity	100%
Robustness	78.3%
Rerun stability	0%
Contradiction consistency	58.7%
Resolution rate	57%

Axis Scores

Axis	Score	Items
economy	-5	30
liberty	-15	30
war	-10	30
nation	-15	30
culture	-15	30
governance	-30	30
secularism	-30	30
technology	-11.67	30
deviance	-41.67	30

Traceability

Started	May 6, 2026, 01:04 UTC
Finished	May 6, 2026, 01:13 UTC
Question signature	Not frozen (live run)
Model roster signature	Not frozen (live run)
Prompt template	pt.v2.0.0
Parser version	parser.v1.0.1
Scorer version	sc.v1.2.0
Model card	/models/allenai/olmo-3.1-32b-instruct/

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.