Methodology

A benchmark instrument, not a belief detector.

PoliBench scores model outputs under fixed prompts and parser rules. Open-ended diagnostics remain visible, but they do not widen the official claim boundary or change placement rules.

Claim Evidence

Methodology claims link to the live pages that document each release artifact before the scoring rules are summarized.

Claim	Evidence
PoliBench measures standardized political-response profiles, not beliefs, provider intent, or real-world impact.	Limitations , Truth gate
Canonical model rows come from valid completed full-suite runs with duplicate decisions preserved.	Runs index , Canonical responses , Duplicate resolution
Official scores are recomputed from parsed Likert rows under the frozen scorer and schema.	Scoring config , Schema manifest , Canonical sample
Open-ended diagnostics are inspection material and stay outside official placement.	Open-ended diagnostics , Response-style controls

Scoring Formula

S_m_a = 100 x mean(p_q x y_m_q) / 2. Each axis score is recomputed from parsed raw response rows. p_q is question polarity and y_m_q is the parsed Likert value.

Open-Ended Diagnostics

Free-form reasons and other diagnostic outputs are retained for inspection, but they stay out of the official compass score. They help explain failure modes, they do not redefine the benchmark.

Inclusion Rules

Status completed, suite full, completion rate 100%, parse validity 100%.
Response file present, receipt coverage 100%, raw response text present.
No-answer-default rate <= 5%, 270 unique questions, and 30 parsed items per axis.
Known model-catalog entry and declared benchmark version.
Paid runs are preflighted, versioned, and intentionally separate from public browsing flow.

Duplicate Resolution

Duplicate run-question rows are resolved by preferring parsed rows, non-default answers, the preferred source pack, then the later artifact timestamp when quality is otherwise equal.

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.