Models
Every model card carries its evidence limits.
Model version is currently unknown unless independently documented in the source artifacts.
Claim Evidence
The model index links evidence-level claims to the pages documenting release artifacts before showing model rows.
| Claim | Evidence |
|---|---|
| Model cards are sorted alphabetically and carry evidence levels, not leaderboard ranks. | Model catalog , Truth gate |
| Model version uncertainty is a visible limitation unless independently documented. | Limitations , Model roster preflight |
| Evidence levels are model-output evidence levels, not human or external validation. | Human status , External status |
| Model | Provider | Run | Completion | Parse | Evidence | Caveat |
|---|---|---|---|---|---|---|
| Claude Haiku 4.5 | anthropic | jn70fpqyr7an1bca1cn7fq93ys864cx0 | 100% | 100% | Level 2 | current Anthropic low-latency paid route |
| Claude Opus 4.5 | anthropic | jn7839n5vcfsf5zsyqg0098rwd864xxv | 100% | 100% | Level 2 | legacy Anthropic Opus comparison route |
| Claude Opus 4.7 | anthropic | jn7bedafgemk6hecfqtpd6e309864xhq | 100% | 100% | Level 2 | latest Anthropic Opus paid route |
| Claude Sonnet 4.6 | anthropic | jn74qyaygktq550zw4metb3xt5864hfv | 100% | 100% | Level 2 | current Anthropic Sonnet paid route |
| DeepSeek V3.2 | deepseek | jn76nae0e9j4pqakz7zwtj1yn186abyv | 100% | 100% | Level 2 | recent DeepSeek reasoning and agentic paid route |
| DeepSeek V4 Flash | deepseek | jn77m1pvwyaed4n2v6nb6btn1s864kct | 100% | 100% | Level 2 | latest available DeepSeek V4 route with healthy provider capacity |
| DeepSeek V4 Pro | deepseek | jn7aybgpr67x8zswpmegfqytyx869wkp | 100% | 100% | Level 2 | latest DeepSeek V4 Pro paid route |
| Devstral 2512 | mistralai | jn793dym6gfm0tssrp6mgh8es986b8nq | 100% | 100% | Level 2 | live completed Convex full-suite run |
| Gemini 2.0 Flash | jn77f88qts7had7ywj89ncd1yd86715s | 100% | 100% | Level 2 | older Google Flash route for generational comparison | |
| Gemini 2.0 Flash Lite | jn71419q7s8pmwrg8y9095xx9n867qp2 | 100% | 100% | Level 2 | older ultra-cheap Google Flash Lite route | |
| Gemini 2.5 Flash | jn7367rpjc0ar1m1mcpkwq1ahs867sk5 | 100% | 100% | Level 2 | cheap Google Flash route for comparison against Gemini 3 Flash | |
| Gemini 2.5 Flash Lite | jn7a7eaaja7pmzfc76pq7syqc18679yc | 100% | 100% | Level 2 | cheap Google baseline route even though newer Gemini 3 routes are already covered | |
| Gemini 3 Flash Preview | jn72d06xfqwj8pds5qgdq6t2gs8623kp | 100% | 100% | Level 2 | current Google Flash paid route | |
| Gemini 3.1 Flash Lite Preview | jn7ez731knc67nfs7gfshenwhd86777p | 100% | 100% | Level 2 | current Google efficient preview paid route | |
| Gemini 3.1 Pro Preview | jn74r11x1a5denvej7jbyc4p8h8633y8 | 100% | 100% | Level 2 | current Google Pro preview paid route | |
| Gemma 3 12B | jn794e1cgp5ecmz53cx08v2vhx866t4w | 100% | 100% | Level 2 | Gemma 3 mid-size open-model comparison route | |
| Gemma 3 27B | jn79cccx49g16xgxtftyamstsx866ay9 | 100% | 100% | Level 2 | Gemma 3 large open-model comparison route | |
| Gemma 3 4B | jn74gs2nnjg5q8n7ysvsfmdhzh8662ad | 100% | 100% | Level 2 | Gemma 3 small open-model comparison route | |
| Gemma 4 26B A4B | jn770jvh5bx4s74v63yhxmnxnh865vgq | 100% | 100% | Level 2 | newer compact Google Gemma 4 route with public interest | |
| Gemma 4 31B | jn7fdzd95ngzbwn6j42yfs5kzx864qrk | 100% | 100% | Level 2 | recent Google open model route with strong public interest | |
| GLM 4.7 | z-ai | jn7a14xnb15xckyfvyy41q6k4s86bhv1 | 100% | 100% | Level 2 | larger GLM 4.7 route to compare against GLM 4.7 Flash and GLM 5 |
| GLM 5 | z-ai | jn77hqg7j6vmamae2r3hwnv1t1869rww | 100% | 100% | Level 2 | current Z.ai GLM route with strong open-model benchmark interest |
| GLM 5.1 | z-ai | jn77es7pyamhprdbm0bb3dntz1869ydp | 100% | 100% | Level 2 | latest Z.ai flagship paid route |
| GPT OSS 120B | openai | jn74fwhmj5ehh9xmb5jy2rrxqx868gse | 100% | 100% | Level 2 | OpenAI open-weight route people will expect to see benchmarked |
| GPT OSS 20B | openai | jn7f62k61w8er0kyjr36fpph0n862d85 | 100% | 100% | Level 2 | small OpenAI open-weight route for efficient comparison coverage |
| GPT-4.1 Mini | openai | jn7ae2rzspfcdav901hm50bf71868yjf | 100% | 100% | Level 2 | cheap OpenAI workhorse |
| GPT-4.1 Nano | openai | jn70ff7ys17t6z347339a788kh868hka | 100% | 100% | Level 2 | ultra-cheap OpenAI baseline |
| GPT-5.1 | openai | jn7fbr2nfw808e81z8aszvp391864cr9 | 100% | 100% | Level 2 | legacy OpenAI flagship-generation comparison route |
| GPT-5.4 | openai | jn78dxdtvkys549w6ad5sfh6vh863tbm | 100% | 100% | Level 2 | latest OpenAI flagship paid route |
| GPT-5.4 Mini | openai | jn74tnhtxnqj7q7b7pqmg5a7nx863xhv | 100% | 100% | Level 2 | current OpenAI efficient paid route |
| GPT-5.5 | openai | jn7frn897xwxymwwpnbck45ejn8626rf | 100% | 100% | Level 2 | latest OpenAI flagship paid route |
| Granite 4.1 8b | ibm-granite | jn73khjf12dxsp17a9t1eg68ks863rwm | 100% | 100% | Level 2 | live completed Convex full-suite run |
| Grok 3 Mini | x-ai | jn7ftf922rmzy7k0ad1m8e18h5866wed | 100% | 100% | Level 2 | cheap xAI baseline for compact-model compass comparison |
| Grok 4 Fast | x-ai | jn7dsp0pxw3zk1yhe7swg846kh8624vq | 100% | 100% | Level 2 | popular xAI low-cost flagship-family route |
| Grok 4.1 Fast | x-ai | jn7322gpqzvnrj0p808perdjrn867ssk | 100% | 100% | Level 2 | popular current xAI fast paid route |
| Grok 4.20 | x-ai | jn7bpp64vsa9s9n3fj3g0mkdb18677j3 | 100% | 100% | Level 2 | latest xAI paid route |
| Grok 4.3 | x-ai | jn7cfwkqn38wj9715mw02tdxxh8630yc | 100% | 100% | Level 2 | live completed Convex full-suite run |
| Grok Code Fast 1 | x-ai | jn75acm3ttqh3n44gzgafkfqm58660ee | 100% | 100% | Level 2 | cheap xAI specialist route, useful as a weird compass comparison |
| Kimi K2.5 | moonshotai | jn7fgjneas7stn448cbt35fbcs8690g9 | 100% | 100% | Level 2 | recent Moonshot Kimi comparison route |
| Kimi K2.6 | moonshotai | jn7d4w89z1jma790phnzz9d8qh869t8a | 100% | 100% | Level 2 | latest Moonshot Kimi paid route |
| LFM2 24B A2B | liquid | jn790737dwtwgx14e1s31j6ycx867k67 | 100% | 100% | Level 2 | small efficient LiquidAI open-model comparison route |
| Ling 2.6 Flash | inclusionai | jn7ed5ge4j9xakj11zcvnsx2jd865y2k | 100% | 100% | Level 2 | live completed Convex full-suite run |
| Llama 3.3 70b Instruct | meta-llama | jn7bh1gqd6p23gdq346rc3jd1n869bqs | 100% | 100% | Level 2 | live completed Convex full-suite run |
| Llama 4 Maverick | meta-llama | jn7a64svgza9ah7n809x2cbqrx862g9r | 100% | 100% | Level 2 | current Meta Llama paid route |
| Llama 4 Scout | meta-llama | jn7deergpw9v49fk6rj2s0xwb1868tky | 100% | 100% | Level 2 | popular Meta Llama 4 comparison route |
| Mercury 2 | inception | jn7ejrx66rgxbgpy086rg1m665869aam | 100% | 100% | Level 2 | recent Inception comparison route |
| MiniMax M2 | minimax | jn705tsdrjcd0np9gz91ct01ks868wpr | 100% | 100% | Level 2 | cheap MiniMax route for historical small-model comparison coverage |
| MiniMax M2.1 | minimax | jn7f2kxcg3mg932p5txnza06tx8697w7 | 100% | 100% | Level 2 | cheap MiniMax route for small-model comparison coverage |
| MiniMax M2.5 | minimax | jn77fygwkbh1tcwk4sz25kmey58675ct | 100% | 100% | Level 2 | current MiniMax paid route with mandatory reasoning |
| MiniMax M2.7 | minimax | jn797b19f23bqm4ey1n6tr2z0h86980h | 100% | 100% | Level 2 | newer MiniMax route, cheap enough for broad comparison coverage |
| Ministral 3 14B 2512 | mistralai | jn78b7c7pyv9bwxfz63p58xrjs867dg5 | 100% | 100% | Level 2 | cheap Mistral small-model route with full-suite comparison value |
| Ministral 3 3B 2512 | mistralai | jn7cneszh8h4h169wp9m6ftj818669wz | 100% | 100% | Level 2 | tiny Mistral route for low-cost scale comparison |
| Ministral 3 8B 2512 | mistralai | jn72eek26kmhcfna69zsg8m0qs8667a2 | 100% | 100% | Level 2 | very cheap Mistral small route for scale and ideology stability checks |
| Mistral Large 3 2512 | mistralai | jn77znvpt5wtay1jkv1jp7y3an867fk7 | 100% | 100% | Level 2 | current Mistral large paid route |
| Mistral Medium 3.1 | mistralai | jn7egfgd1waqa15wwzyatnjk698673e7 | 100% | 100% | Level 2 | mid-size Mistral route for comparison against Ministral and Saba |
| Mistral Medium 3.5 | mistralai | jn72zhsrwq9m571zcf6mesd5rs865qss | 100% | 100% | Level 2 | current Mistral medium paid route |
| Mistral Saba | mistralai | jn7dn0ckwrdgp6nksteazb2rps866c2a | 100% | 100% | Level 2 | Mistral regional route for Middle East and South Asia comparison |
| Mistral Small 4 | mistralai | jn78dd95913j8fhzm2wpf1wxa1866pjq | 100% | 100% | Level 2 | current Mistral efficient paid route |
| Nemotron 3 Nano 30B A3B | nvidia | jn73p8tfdvn74nyaytf37zjve9867g4t | 100% | 100% | Level 2 | cheap NVIDIA Nemotron 3 route with open-model comparison value |
| Nemotron 3 Super | nvidia | jn79vwtvy4ew6phzgp37bkxncx866ngb | 100% | 100% | Level 2 | current NVIDIA reasoning-capable paid route |
| Nemotron Nano 9B V2 | nvidia | jn715rwtcrnae9trwpm6kwq74d867kba | 100% | 100% | Level 2 | very cheap NVIDIA route with small-model comparison value |
| OLMo 3.1 32B Instruct | allenai | jn70sha61z7rvxw9m1ebac4w298669rz | 100% | 100% | Level 2 | fully open Ai2 American instruct route |
| Phi 4 | microsoft | jn72y2njy87gzbvvymbanmm0b18686rt | 100% | 100% | Level 2 | popular small Microsoft model comparison route |
| Qwen3.5 397B A17B | qwen | jn745fmm7et4q7nq87x5r7yh1586ajta | 100% | 100% | Level 2 | large Qwen open-weight comparison route |
| Qwen3.5 Plus 20260420 | qwen | jn74kq8ve2jy1ms5czap7sqd4d86apss | 100% | 100% | Level 2 | live completed Convex full-suite run |
| Qwen3.6 35B A3B | qwen | jn72r49hq5g308wv4xv0y6rf9s864bjp | 100% | 100% | Level 2 | open-weight mid-size Qwen route for size-class coverage |
| Qwen3.6 Flash | qwen | jn7ccgp27qt3ecc85dq0vaabhh8659q1 | 100% | 100% | Level 2 | live completed Convex full-suite run |
| Qwen3.6 Max Preview | qwen | jn7b14sehdj2rhgte9k6pw824d864pmp | 100% | 100% | Level 2 | live completed Convex full-suite run |
| Reka Edge | rekaai | jn7ca900bsv6ychdfzf2r879js864j6a | 100% | 100% | Level 2 | new low-cost Reka edge-model comparison route |
| Solar Pro 3 | upstage | jn75x7fnknr6d2xwgq8px34pg5869a6m | 100% | 100% | Level 2 | Upstage Korean model route with regional comparison value |
| Trinity Large Preview | arcee-ai | jn76fvf4gfy08hdmsnfxmxdrbx869qc3 | 100% | 100% | Level 2 | high-usage US open-weight Arcee preview route |
| Trinity Large Thinking | arcee-ai | jn7bkw0f7z6fa6z1anr0t4xf2x869xp5 | 100% | 100% | Level 2 | US open-weight Arcee reasoning route |
| Trinity Mini | arcee-ai | jn7ayysp9g6ett652tdhefmpj586550k | 100% | 100% | Level 2 | small US open-weight Arcee MoE route |
Evidence note
PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.
The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.
This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.