Models

Every model card carries its evidence limits.

Model version is currently unknown unless independently documented in the source artifacts.

Claim Evidence

The model index links evidence-level claims to the pages documenting release artifacts before showing model rows.

Claim	Evidence
Model cards are sorted alphabetically and carry evidence levels, not leaderboard ranks.	Model catalog , Truth gate
Model version uncertainty is a visible limitation unless independently documented.	Limitations , Model roster preflight
Evidence levels are model-output evidence levels, not human or external validation.	Human status , External status

Model	Provider	Run	Completion	Parse	Evidence	Caveat
Claude Haiku 4.5	anthropic	jn70fpqyr7an1bca1cn7fq93ys864cx0	100%	100%	Level 2	current Anthropic low-latency paid route
Claude Opus 4.5	anthropic	jn7839n5vcfsf5zsyqg0098rwd864xxv	100%	100%	Level 2	legacy Anthropic Opus comparison route
Claude Opus 4.7	anthropic	jn7bedafgemk6hecfqtpd6e309864xhq	100%	100%	Level 2	latest Anthropic Opus paid route
Claude Sonnet 4.6	anthropic	jn74qyaygktq550zw4metb3xt5864hfv	100%	100%	Level 2	current Anthropic Sonnet paid route
DeepSeek V3.2	deepseek	jn76nae0e9j4pqakz7zwtj1yn186abyv	100%	100%	Level 2	recent DeepSeek reasoning and agentic paid route
DeepSeek V4 Flash	deepseek	jn77m1pvwyaed4n2v6nb6btn1s864kct	100%	100%	Level 2	latest available DeepSeek V4 route with healthy provider capacity
DeepSeek V4 Pro	deepseek	jn7aybgpr67x8zswpmegfqytyx869wkp	100%	100%	Level 2	latest DeepSeek V4 Pro paid route
Devstral 2512	mistralai	jn793dym6gfm0tssrp6mgh8es986b8nq	100%	100%	Level 2	live completed Convex full-suite run
Gemini 2.0 Flash	google	jn77f88qts7had7ywj89ncd1yd86715s	100%	100%	Level 2	older Google Flash route for generational comparison
Gemini 2.0 Flash Lite	google	jn71419q7s8pmwrg8y9095xx9n867qp2	100%	100%	Level 2	older ultra-cheap Google Flash Lite route
Gemini 2.5 Flash	google	jn7367rpjc0ar1m1mcpkwq1ahs867sk5	100%	100%	Level 2	cheap Google Flash route for comparison against Gemini 3 Flash
Gemini 2.5 Flash Lite	google	jn7a7eaaja7pmzfc76pq7syqc18679yc	100%	100%	Level 2	cheap Google baseline route even though newer Gemini 3 routes are already covered
Gemini 3 Flash Preview	google	jn72d06xfqwj8pds5qgdq6t2gs8623kp	100%	100%	Level 2	current Google Flash paid route
Gemini 3.1 Flash Lite Preview	google	jn7ez731knc67nfs7gfshenwhd86777p	100%	100%	Level 2	current Google efficient preview paid route
Gemini 3.1 Pro Preview	google	jn74r11x1a5denvej7jbyc4p8h8633y8	100%	100%	Level 2	current Google Pro preview paid route
Gemma 3 12B	google	jn794e1cgp5ecmz53cx08v2vhx866t4w	100%	100%	Level 2	Gemma 3 mid-size open-model comparison route
Gemma 3 27B	google	jn79cccx49g16xgxtftyamstsx866ay9	100%	100%	Level 2	Gemma 3 large open-model comparison route
Gemma 3 4B	google	jn74gs2nnjg5q8n7ysvsfmdhzh8662ad	100%	100%	Level 2	Gemma 3 small open-model comparison route
Gemma 4 26B A4B	google	jn770jvh5bx4s74v63yhxmnxnh865vgq	100%	100%	Level 2	newer compact Google Gemma 4 route with public interest
Gemma 4 31B	google	jn7fdzd95ngzbwn6j42yfs5kzx864qrk	100%	100%	Level 2	recent Google open model route with strong public interest
GLM 4.7	z-ai	jn7a14xnb15xckyfvyy41q6k4s86bhv1	100%	100%	Level 2	larger GLM 4.7 route to compare against GLM 4.7 Flash and GLM 5
GLM 5	z-ai	jn77hqg7j6vmamae2r3hwnv1t1869rww	100%	100%	Level 2	current Z.ai GLM route with strong open-model benchmark interest
GLM 5.1	z-ai	jn77es7pyamhprdbm0bb3dntz1869ydp	100%	100%	Level 2	latest Z.ai flagship paid route
GPT OSS 120B	openai	jn74fwhmj5ehh9xmb5jy2rrxqx868gse	100%	100%	Level 2	OpenAI open-weight route people will expect to see benchmarked
GPT OSS 20B	openai	jn7f62k61w8er0kyjr36fpph0n862d85	100%	100%	Level 2	small OpenAI open-weight route for efficient comparison coverage
GPT-4.1 Mini	openai	jn7ae2rzspfcdav901hm50bf71868yjf	100%	100%	Level 2	cheap OpenAI workhorse
GPT-4.1 Nano	openai	jn70ff7ys17t6z347339a788kh868hka	100%	100%	Level 2	ultra-cheap OpenAI baseline
GPT-5.1	openai	jn7fbr2nfw808e81z8aszvp391864cr9	100%	100%	Level 2	legacy OpenAI flagship-generation comparison route
GPT-5.4	openai	jn78dxdtvkys549w6ad5sfh6vh863tbm	100%	100%	Level 2	latest OpenAI flagship paid route
GPT-5.4 Mini	openai	jn74tnhtxnqj7q7b7pqmg5a7nx863xhv	100%	100%	Level 2	current OpenAI efficient paid route
GPT-5.5	openai	jn7frn897xwxymwwpnbck45ejn8626rf	100%	100%	Level 2	latest OpenAI flagship paid route
Granite 4.1 8b	ibm-granite	jn73khjf12dxsp17a9t1eg68ks863rwm	100%	100%	Level 2	live completed Convex full-suite run
Grok 3 Mini	x-ai	jn7ftf922rmzy7k0ad1m8e18h5866wed	100%	100%	Level 2	cheap xAI baseline for compact-model compass comparison
Grok 4 Fast	x-ai	jn7dsp0pxw3zk1yhe7swg846kh8624vq	100%	100%	Level 2	popular xAI low-cost flagship-family route
Grok 4.1 Fast	x-ai	jn7322gpqzvnrj0p808perdjrn867ssk	100%	100%	Level 2	popular current xAI fast paid route
Grok 4.20	x-ai	jn7bpp64vsa9s9n3fj3g0mkdb18677j3	100%	100%	Level 2	latest xAI paid route
Grok 4.3	x-ai	jn7cfwkqn38wj9715mw02tdxxh8630yc	100%	100%	Level 2	live completed Convex full-suite run
Grok Code Fast 1	x-ai	jn75acm3ttqh3n44gzgafkfqm58660ee	100%	100%	Level 2	cheap xAI specialist route, useful as a weird compass comparison
Kimi K2.5	moonshotai	jn7fgjneas7stn448cbt35fbcs8690g9	100%	100%	Level 2	recent Moonshot Kimi comparison route
Kimi K2.6	moonshotai	jn7d4w89z1jma790phnzz9d8qh869t8a	100%	100%	Level 2	latest Moonshot Kimi paid route
LFM2 24B A2B	liquid	jn790737dwtwgx14e1s31j6ycx867k67	100%	100%	Level 2	small efficient LiquidAI open-model comparison route
Ling 2.6 Flash	inclusionai	jn7ed5ge4j9xakj11zcvnsx2jd865y2k	100%	100%	Level 2	live completed Convex full-suite run
Llama 3.3 70b Instruct	meta-llama	jn7bh1gqd6p23gdq346rc3jd1n869bqs	100%	100%	Level 2	live completed Convex full-suite run
Llama 4 Maverick	meta-llama	jn7a64svgza9ah7n809x2cbqrx862g9r	100%	100%	Level 2	current Meta Llama paid route
Llama 4 Scout	meta-llama	jn7deergpw9v49fk6rj2s0xwb1868tky	100%	100%	Level 2	popular Meta Llama 4 comparison route
Mercury 2	inception	jn7ejrx66rgxbgpy086rg1m665869aam	100%	100%	Level 2	recent Inception comparison route
MiniMax M2	minimax	jn705tsdrjcd0np9gz91ct01ks868wpr	100%	100%	Level 2	cheap MiniMax route for historical small-model comparison coverage
MiniMax M2.1	minimax	jn7f2kxcg3mg932p5txnza06tx8697w7	100%	100%	Level 2	cheap MiniMax route for small-model comparison coverage
MiniMax M2.5	minimax	jn77fygwkbh1tcwk4sz25kmey58675ct	100%	100%	Level 2	current MiniMax paid route with mandatory reasoning
MiniMax M2.7	minimax	jn797b19f23bqm4ey1n6tr2z0h86980h	100%	100%	Level 2	newer MiniMax route, cheap enough for broad comparison coverage
Ministral 3 14B 2512	mistralai	jn78b7c7pyv9bwxfz63p58xrjs867dg5	100%	100%	Level 2	cheap Mistral small-model route with full-suite comparison value
Ministral 3 3B 2512	mistralai	jn7cneszh8h4h169wp9m6ftj818669wz	100%	100%	Level 2	tiny Mistral route for low-cost scale comparison
Ministral 3 8B 2512	mistralai	jn72eek26kmhcfna69zsg8m0qs8667a2	100%	100%	Level 2	very cheap Mistral small route for scale and ideology stability checks
Mistral Large 3 2512	mistralai	jn77znvpt5wtay1jkv1jp7y3an867fk7	100%	100%	Level 2	current Mistral large paid route
Mistral Medium 3.1	mistralai	jn7egfgd1waqa15wwzyatnjk698673e7	100%	100%	Level 2	mid-size Mistral route for comparison against Ministral and Saba
Mistral Medium 3.5	mistralai	jn72zhsrwq9m571zcf6mesd5rs865qss	100%	100%	Level 2	current Mistral medium paid route
Mistral Saba	mistralai	jn7dn0ckwrdgp6nksteazb2rps866c2a	100%	100%	Level 2	Mistral regional route for Middle East and South Asia comparison
Mistral Small 4	mistralai	jn78dd95913j8fhzm2wpf1wxa1866pjq	100%	100%	Level 2	current Mistral efficient paid route
Nemotron 3 Nano 30B A3B	nvidia	jn73p8tfdvn74nyaytf37zjve9867g4t	100%	100%	Level 2	cheap NVIDIA Nemotron 3 route with open-model comparison value
Nemotron 3 Super	nvidia	jn79vwtvy4ew6phzgp37bkxncx866ngb	100%	100%	Level 2	current NVIDIA reasoning-capable paid route
Nemotron Nano 9B V2	nvidia	jn715rwtcrnae9trwpm6kwq74d867kba	100%	100%	Level 2	very cheap NVIDIA route with small-model comparison value
OLMo 3.1 32B Instruct	allenai	jn70sha61z7rvxw9m1ebac4w298669rz	100%	100%	Level 2	fully open Ai2 American instruct route
Phi 4	microsoft	jn72y2njy87gzbvvymbanmm0b18686rt	100%	100%	Level 2	popular small Microsoft model comparison route
Qwen3.5 397B A17B	qwen	jn745fmm7et4q7nq87x5r7yh1586ajta	100%	100%	Level 2	large Qwen open-weight comparison route
Qwen3.5 Plus 20260420	qwen	jn74kq8ve2jy1ms5czap7sqd4d86apss	100%	100%	Level 2	live completed Convex full-suite run
Qwen3.6 35B A3B	qwen	jn72r49hq5g308wv4xv0y6rf9s864bjp	100%	100%	Level 2	open-weight mid-size Qwen route for size-class coverage
Qwen3.6 Flash	qwen	jn7ccgp27qt3ecc85dq0vaabhh8659q1	100%	100%	Level 2	live completed Convex full-suite run
Qwen3.6 Max Preview	qwen	jn7b14sehdj2rhgte9k6pw824d864pmp	100%	100%	Level 2	live completed Convex full-suite run
Reka Edge	rekaai	jn7ca900bsv6ychdfzf2r879js864j6a	100%	100%	Level 2	new low-cost Reka edge-model comparison route
Solar Pro 3	upstage	jn75x7fnknr6d2xwgq8px34pg5869a6m	100%	100%	Level 2	Upstage Korean model route with regional comparison value
Trinity Large Preview	arcee-ai	jn76fvf4gfy08hdmsnfxmxdrbx869qc3	100%	100%	Level 2	high-usage US open-weight Arcee preview route
Trinity Large Thinking	arcee-ai	jn7bkw0f7z6fa6z1anr0t4xf2x869xp5	100%	100%	Level 2	US open-weight Arcee reasoning route
Trinity Mini	arcee-ai	jn7ayysp9g6ett652tdhefmpj586550k	100%	100%	Level 2	small US open-weight Arcee MoE route

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.