Validity

Validity is public evidence, not hidden caveat text.

This release is exploratory until expert coding, human baselines, and external anchors exist.

Validity Types

Content validity

Question-axis fit requires human expert coding before final paper claims.

Response-process validity

Structured answers are validated now, while open-ended diagnostics remain inspection-only evidence.

Internal-structure validity

Axis behavior is shown with reliability metrics, but factor evidence is not yet final.

External validity

External anchors have not been collected, so results are not externally validated.

Consequential validity

The site blocks claims about model belief, provider intent, or real-world political impact.

Claim Evidence

Validity claims link to the pages documenting the validation packet and release audit artifacts. Pending evidence remains pending.

ClaimEvidence
Human content-validity evidence is pending, not collected. Human status , Coder protocol
External anchors are collection-ready but not completed validation evidence. External status , External anchor protocol
Current validity support is model-output traceability and reliability diagnostics. Truth gate , Reliability metrics
Human-subjects determination is unresolved in this release. IRB status , Collection readiness

Evidence Status

The current release is public about what it can support and explicit about what it cannot. Validation is visible, external evidence is still pending, and that gap stays in the open instead of being hidden behind polished copy.

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.