Items
Every question is inspectable.
Item diagnostics expose coverage, parse failures, neutral rates, refusal rates, and missing independent item review. The table below is frozen paper-release documentation, kept as a historical record; live pages carry current benchmark data.
Claim Evidence
The item index links instrument and validation claims to release files before listing per-item diagnostics.
| Claim | Evidence |
|---|---|
| The item table is a full-suite diagnostic surface, not a completed human-validation result. | Question bank , Human status |
| Neutral, refusal, parse-failure, and item-total metrics come from frozen model-output artifacts. | Item diagnostics , Axis diagnostics |
| External anchors remain a collection surface until mappings and agreement rows exist. | External status , External anchor protocol |
| Question | Axis | Topic | Responses | Neutral | Parse failure | Human coding | External anchor |
|---|
Evidence note
PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.
The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.
This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.