Methodology
A benchmark instrument, not a belief detector.
PoliBench scores model outputs under fixed prompts and parser rules. Open-ended diagnostics remain visible, but they do not widen the official claim boundary or change placement rules.
Claim Evidence
Methodology claims link to the live pages that document each release artifact before the scoring rules are summarized.
| Claim | Evidence |
|---|---|
| PoliBench measures standardized political-response profiles, not beliefs, provider intent, or real-world impact. | Limitations , Truth gate |
| Canonical model rows come from valid completed full-suite runs with duplicate decisions preserved. | Runs index , Canonical responses , Duplicate resolution |
| Official scores are recomputed from parsed Likert rows under the frozen scorer and schema. | Scoring config , Schema manifest , Canonical sample |
| Open-ended diagnostics are inspection material and stay outside official placement. | Open-ended diagnostics , Response-style controls |
Scoring Formula
S_m_a = 100 x mean(p_q x y_m_q) / 2. Each axis score is recomputed from parsed raw response rows. p_q is question polarity and y_m_q is the parsed Likert value.
Open-Ended Diagnostics
Free-form reasons and other diagnostic outputs are retained for inspection, but they stay out of the official compass score. They help explain failure modes, they do not redefine the benchmark.
Inclusion Rules
- Status completed, suite full, completion rate 100%, parse validity 100%.
- Response file present, receipt coverage 100%, raw response text present.
- No-answer-default rate <= 5%, 270 unique questions, and 30 parsed items per axis.
- Known model-catalog entry and declared benchmark version.
- Paid runs are preflighted, versioned, and intentionally separate from public browsing flow.
Duplicate Resolution
Duplicate run-question rows are resolved by preferring parsed rows, non-default answers, the preferred source pack, then the later artifact timestamp when quality is otherwise equal.
Evidence note
PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.
The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.
This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.