Live run detail
Qwen3.6 Max Preview
Live completed full-suite run for Qwen3.6 Max Preview. Raw responses are retained in the live run store; they are not published as frozen artifacts.
Run jn7b14sehdj2rhgte9k6pw824d864pmp
No suppression reasons
Claim Evidence
Run details link every summary claim to raw or release-level evidence. Missing human and external validation remain explicit.
| Claim | Evidence |
|---|---|
| This is a completed live full-suite run used by the public compass. | Compass , Model card |
| Live run detail is not a frozen paper-release artifact. | Paper release |
Response Style
| Completion | 100% |
|---|---|
| Parse validity | 100% |
| Robustness | 82.3% |
| Rerun stability | 0% |
| Contradiction consistency | 72% |
| Resolution rate | 76.7% |
Axis Scores
| Axis | Score | Items |
|---|---|---|
| economy | -6.67 | 30 |
| liberty | -41.67 | 30 |
| war | -5 | 30 |
| nation | -26.67 | 30 |
| culture | -20 | 30 |
| governance | -51.67 | 30 |
| secularism | -50 | 30 |
| technology | -15 | 30 |
| deviance | -60 | 30 |
Traceability
| Started | |
|---|---|
| Finished | |
| Question signature | Not frozen (live run) |
| Model roster signature | Not frozen (live run) |
| Prompt template | pt.v2.0.0 |
| Parser version | parser.v1.0.1 |
| Scorer version | sc.v1.2.0 |
| Model card | /models/qwen/qwen3.6-max-preview/ |
Evidence note
PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.
The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.
This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.