About

Method, scope, and release rules.

PoliBench uses a fixed proposition bank, a versioned parser, and retained run receipts. This page documents what is measured, what is not, and how public profiles qualify.

What this is

A public political-behavior benchmark, built for auditability.

PoliBench measures political behavior as a benchmark artifact. It reports compass placement, war posture, multidimensional axis scores, answer stability, refusal behavior, parse quality, cost, latency, and raw answer receipts. Political placement is descriptive; benchmark quality is ranked separately, because they are different questions.

Every model answers the same propositions, with the same prompt template, the same parser, and the same scoring pass. Points land on the compass only when their run meets the published completion and parse thresholds; partial runs remain visible as profile-only evidence.

At a glance

The benchmark, in numbers.

Full suite 270 prompts per public placement
Core suite 72 balanced by axis and pole
Axes 9 economy, liberty, war, nation, culture, governance, secularism, technology, deviance
Eligibility ≥90% completion and parse validity required for compass placement

How it works

From prompt to placement, in four steps.

01 · Prompt

Fixed propositions

Every model answers the same bank of neutral-wrapper propositions with a structured Likert label, confidence, and a short reason.

02 · Parse

Strict scoring

Responses are parsed into scored labels and validity flags. Refusals, malformed JSON, and provider failures are stored as receipts rather than discarded.

03 · Profile

Compass & axes

Answers roll up into a two-axis compass placement plus a nine-axis model profile, including war posture, culture, governance, secularism, technology, nation, and deviance pressure.

04 · Publish

Quality receipts

Completion rate, parse validity, rerun stability, contradiction consistency, p95 latency, and cost travel with every public profile.

What public profiles show

Placement and confidence stay separate.

Compass

Economy × Liberty

The familiar two-axis map. The point is descriptive, not an endorsement or ranking of quadrants.

Axes

Nine dimensions

War, nation, culture, governance, secularism, technology, and deviance pressure remain visible outside the compass point.

Posture

War & foreign policy

Foreign-policy behavior is mapped to restraint, mixed, and intervention labels for faster comparison.

Quality

Run confidence

Completion, parse validity, paraphrase stability, rerun stability, and contradiction consistency describe evidence strength.

Efficiency

Cost · p95 latency

Benchmark efficiency signals, reported per completed response. Operational cost never inflates or discounts the political reading.

Receipts

One row per attempt

Every attempted question is stored: refusals, malformed JSON, provider failures, cost, and latency. Public inspection can audit the placement.

What this isn't

Ideology isn't ranked. Quality is.

PoliBench does not say a quadrant is better than another. It does say that a run with 40% parse validity or a refusal rate of 20% is weaker evidence than a run with 95% parse validity and two passes of contradiction checks. The public interface keeps those two statements visually and structurally separate.

Method

Protocol and measurement boundaries.

01 · Fixed suite

270 prompts per official run

The official release uses core items, paraphrases, and multiple passes. Quick and pilot runs are development evidence only.

02 · Structured answers

Likert, confidence, reason

Every response is parsed into a bounded JSON object. Refusals, invalid JSON, invalid labels, and empty responses are retained as receipts.

03 · Frozen releases

Snapshot and manifest

Release snapshots freeze the run IDs behind public results. Manifests preserve versions, signatures, receipts, and checksums.

04 · Diagnostic separation

Auxiliary only

Judge-based political even-handedness diagnostics can annotate profiles, but never change placement or leaderboard eligibility.

Privacy

Data boundaries.

01 · Local

Questionnaire answers

Personal questionnaire responses remain in browser state and do not travel to the PoliBench backend.

02 · Public

Benchmark data

Published model runs contain model outputs, scoring metadata, cost, latency, and parse status for auditing.

03 · Backend

API boundaries

Public API endpoints expose benchmark artifacts only. They do not receive private questionnaire answers.

04 · Contact

Questions and requests

Use the contact information below for privacy, benchmark, or dataset questions, including removal requests for published test rows.

Contact

Reach the benchmark creator.

For benchmark questions, dataset review, research discussion, or product inquiries, contact through the author site.

Navigation

Explore the benchmark.