About
Method, scope, and release rules.
PoliBench uses a fixed proposition bank, a versioned parser, and retained run receipts. This page documents what is measured, what is not, and how public profiles qualify.
What this is
A public political-behavior benchmark, built for auditability.
PoliBench measures political behavior as a benchmark artifact. It reports compass placement, war posture, multidimensional axis scores, answer stability, refusal behavior, parse quality, cost, latency, and raw answer receipts. Political placement is descriptive; benchmark quality is ranked separately, because they are different questions.
Every model answers the same propositions, with the same prompt template, the same parser, and the same scoring pass. Points land on the compass only when their run meets the published completion and parse thresholds; partial runs remain visible as profile-only evidence.
At a glance
The benchmark, in numbers.
How it works
From prompt to placement, in four steps.
Fixed propositions
Every model answers the same bank of neutral-wrapper propositions with a structured Likert label, confidence, and a short reason.
Strict scoring
Responses are parsed into scored labels and validity flags. Refusals, malformed JSON, and provider failures are stored as receipts rather than discarded.
Compass & axes
Answers roll up into a two-axis compass placement plus a nine-axis model profile, including war posture, culture, governance, secularism, technology, nation, and deviance pressure.
Quality receipts
Completion rate, parse validity, rerun stability, contradiction consistency, p95 latency, and cost travel with every public profile.
What public profiles show
Placement and confidence stay separate.
Economy × Liberty
The familiar two-axis map. The point is descriptive, not an endorsement or ranking of quadrants.
Nine dimensions
War, nation, culture, governance, secularism, technology, and deviance pressure remain visible outside the compass point.
War & foreign policy
Foreign-policy behavior is mapped to restraint, mixed, and intervention labels for faster comparison.
Run confidence
Completion, parse validity, paraphrase stability, rerun stability, and contradiction consistency describe evidence strength.
Cost · p95 latency
Benchmark efficiency signals, reported per completed response. Operational cost never inflates or discounts the political reading.
One row per attempt
Every attempted question is stored: refusals, malformed JSON, provider failures, cost, and latency. Public inspection can audit the placement.
What this isn't
Ideology isn't ranked. Quality is.
PoliBench does not say a quadrant is better than another. It does say that a run with 40% parse validity or a refusal rate of 20% is weaker evidence than a run with 95% parse validity and two passes of contradiction checks. The public interface keeps those two statements visually and structurally separate.
Method
Protocol and measurement boundaries.
270 prompts per official run
The official release uses core items, paraphrases, and multiple passes. Quick and pilot runs are development evidence only.
Likert, confidence, reason
Every response is parsed into a bounded JSON object. Refusals, invalid JSON, invalid labels, and empty responses are retained as receipts.
Snapshot and manifest
Release snapshots freeze the run IDs behind public results. Manifests preserve versions, signatures, receipts, and checksums.
Auxiliary only
Judge-based political even-handedness diagnostics can annotate profiles, but never change placement or leaderboard eligibility.
Privacy
Data boundaries.
Questionnaire answers
Personal questionnaire responses remain in browser state and do not travel to the PoliBench backend.
Benchmark data
Published model runs contain model outputs, scoring metadata, cost, latency, and parse status for auditing.
API boundaries
Public API endpoints expose benchmark artifacts only. They do not receive private questionnaire answers.
Questions and requests
Use the contact information below for privacy, benchmark, or dataset questions, including removal requests for published test rows.
Contact
Reach the benchmark creator.
For benchmark questions, dataset review, research discussion, or product inquiries, contact Jonathan R Reed through the author site.
Navigation