Models

Every model card carries its evidence limits.

Model version is currently unknown unless independently documented in the source artifacts.

Claim Evidence

The model index links evidence-level claims to the pages documenting release artifacts before showing model rows.

ClaimEvidence
Model cards are sorted alphabetically and carry evidence levels, not leaderboard ranks. Model catalog , Truth gate
Model version uncertainty is a visible limitation unless independently documented. Limitations , Model roster preflight
Evidence levels are model-output evidence levels, not human or external validation. Human status , External status
Model Provider Run Completion Parse Evidence Caveat
Claude Haiku 4.5 anthropic jn70fpqyr7an1bca1cn7fq93ys864cx0 100% 100% Level 2 current Anthropic low-latency paid route
Claude Opus 4.5 anthropic jn7839n5vcfsf5zsyqg0098rwd864xxv 100% 100% Level 2 legacy Anthropic Opus comparison route
Claude Opus 4.7 anthropic jn7bedafgemk6hecfqtpd6e309864xhq 100% 100% Level 2 latest Anthropic Opus paid route
Claude Sonnet 4.6 anthropic jn74qyaygktq550zw4metb3xt5864hfv 100% 100% Level 2 current Anthropic Sonnet paid route
DeepSeek V3.2 deepseek jn76nae0e9j4pqakz7zwtj1yn186abyv 100% 100% Level 2 recent DeepSeek reasoning and agentic paid route
DeepSeek V4 Flash deepseek jn77m1pvwyaed4n2v6nb6btn1s864kct 100% 100% Level 2 latest available DeepSeek V4 route with healthy provider capacity
DeepSeek V4 Pro deepseek jn7aybgpr67x8zswpmegfqytyx869wkp 100% 100% Level 2 latest DeepSeek V4 Pro paid route
Devstral 2512 mistralai jn793dym6gfm0tssrp6mgh8es986b8nq 100% 100% Level 2 live completed Convex full-suite run
Gemini 2.0 Flash google jn77f88qts7had7ywj89ncd1yd86715s 100% 100% Level 2 older Google Flash route for generational comparison
Gemini 2.0 Flash Lite google jn71419q7s8pmwrg8y9095xx9n867qp2 100% 100% Level 2 older ultra-cheap Google Flash Lite route
Gemini 2.5 Flash google jn7367rpjc0ar1m1mcpkwq1ahs867sk5 100% 100% Level 2 cheap Google Flash route for comparison against Gemini 3 Flash
Gemini 2.5 Flash Lite google jn7a7eaaja7pmzfc76pq7syqc18679yc 100% 100% Level 2 cheap Google baseline route even though newer Gemini 3 routes are already covered
Gemini 3 Flash Preview google jn72d06xfqwj8pds5qgdq6t2gs8623kp 100% 100% Level 2 current Google Flash paid route
Gemini 3.1 Flash Lite Preview google jn7ez731knc67nfs7gfshenwhd86777p 100% 100% Level 2 current Google efficient preview paid route
Gemini 3.1 Pro Preview google jn74r11x1a5denvej7jbyc4p8h8633y8 100% 100% Level 2 current Google Pro preview paid route
Gemma 3 12B google jn794e1cgp5ecmz53cx08v2vhx866t4w 100% 100% Level 2 Gemma 3 mid-size open-model comparison route
Gemma 3 27B google jn79cccx49g16xgxtftyamstsx866ay9 100% 100% Level 2 Gemma 3 large open-model comparison route
Gemma 3 4B google jn74gs2nnjg5q8n7ysvsfmdhzh8662ad 100% 100% Level 2 Gemma 3 small open-model comparison route
Gemma 4 26B A4B google jn770jvh5bx4s74v63yhxmnxnh865vgq 100% 100% Level 2 newer compact Google Gemma 4 route with public interest
Gemma 4 31B google jn7fdzd95ngzbwn6j42yfs5kzx864qrk 100% 100% Level 2 recent Google open model route with strong public interest
GLM 4.7 z-ai jn7a14xnb15xckyfvyy41q6k4s86bhv1 100% 100% Level 2 larger GLM 4.7 route to compare against GLM 4.7 Flash and GLM 5
GLM 5 z-ai jn77hqg7j6vmamae2r3hwnv1t1869rww 100% 100% Level 2 current Z.ai GLM route with strong open-model benchmark interest
GLM 5.1 z-ai jn77es7pyamhprdbm0bb3dntz1869ydp 100% 100% Level 2 latest Z.ai flagship paid route
GPT OSS 120B openai jn74fwhmj5ehh9xmb5jy2rrxqx868gse 100% 100% Level 2 OpenAI open-weight route people will expect to see benchmarked
GPT OSS 20B openai jn7f62k61w8er0kyjr36fpph0n862d85 100% 100% Level 2 small OpenAI open-weight route for efficient comparison coverage
GPT-4.1 Mini openai jn7ae2rzspfcdav901hm50bf71868yjf 100% 100% Level 2 cheap OpenAI workhorse
GPT-4.1 Nano openai jn70ff7ys17t6z347339a788kh868hka 100% 100% Level 2 ultra-cheap OpenAI baseline
GPT-5.1 openai jn7fbr2nfw808e81z8aszvp391864cr9 100% 100% Level 2 legacy OpenAI flagship-generation comparison route
GPT-5.4 openai jn78dxdtvkys549w6ad5sfh6vh863tbm 100% 100% Level 2 latest OpenAI flagship paid route
GPT-5.4 Mini openai jn74tnhtxnqj7q7b7pqmg5a7nx863xhv 100% 100% Level 2 current OpenAI efficient paid route
GPT-5.5 openai jn7frn897xwxymwwpnbck45ejn8626rf 100% 100% Level 2 latest OpenAI flagship paid route
Granite 4.1 8b ibm-granite jn73khjf12dxsp17a9t1eg68ks863rwm 100% 100% Level 2 live completed Convex full-suite run
Grok 3 Mini x-ai jn7ftf922rmzy7k0ad1m8e18h5866wed 100% 100% Level 2 cheap xAI baseline for compact-model compass comparison
Grok 4 Fast x-ai jn7dsp0pxw3zk1yhe7swg846kh8624vq 100% 100% Level 2 popular xAI low-cost flagship-family route
Grok 4.1 Fast x-ai jn7322gpqzvnrj0p808perdjrn867ssk 100% 100% Level 2 popular current xAI fast paid route
Grok 4.20 x-ai jn7bpp64vsa9s9n3fj3g0mkdb18677j3 100% 100% Level 2 latest xAI paid route
Grok 4.3 x-ai jn7cfwkqn38wj9715mw02tdxxh8630yc 100% 100% Level 2 live completed Convex full-suite run
Grok Code Fast 1 x-ai jn75acm3ttqh3n44gzgafkfqm58660ee 100% 100% Level 2 cheap xAI specialist route, useful as a weird compass comparison
Kimi K2.5 moonshotai jn7fgjneas7stn448cbt35fbcs8690g9 100% 100% Level 2 recent Moonshot Kimi comparison route
Kimi K2.6 moonshotai jn7d4w89z1jma790phnzz9d8qh869t8a 100% 100% Level 2 latest Moonshot Kimi paid route
LFM2 24B A2B liquid jn790737dwtwgx14e1s31j6ycx867k67 100% 100% Level 2 small efficient LiquidAI open-model comparison route
Ling 2.6 Flash inclusionai jn7ed5ge4j9xakj11zcvnsx2jd865y2k 100% 100% Level 2 live completed Convex full-suite run
Llama 3.3 70b Instruct meta-llama jn7bh1gqd6p23gdq346rc3jd1n869bqs 100% 100% Level 2 live completed Convex full-suite run
Llama 4 Maverick meta-llama jn7a64svgza9ah7n809x2cbqrx862g9r 100% 100% Level 2 current Meta Llama paid route
Llama 4 Scout meta-llama jn7deergpw9v49fk6rj2s0xwb1868tky 100% 100% Level 2 popular Meta Llama 4 comparison route
Mercury 2 inception jn7ejrx66rgxbgpy086rg1m665869aam 100% 100% Level 2 recent Inception comparison route
MiniMax M2 minimax jn705tsdrjcd0np9gz91ct01ks868wpr 100% 100% Level 2 cheap MiniMax route for historical small-model comparison coverage
MiniMax M2.1 minimax jn7f2kxcg3mg932p5txnza06tx8697w7 100% 100% Level 2 cheap MiniMax route for small-model comparison coverage
MiniMax M2.5 minimax jn77fygwkbh1tcwk4sz25kmey58675ct 100% 100% Level 2 current MiniMax paid route with mandatory reasoning
MiniMax M2.7 minimax jn797b19f23bqm4ey1n6tr2z0h86980h 100% 100% Level 2 newer MiniMax route, cheap enough for broad comparison coverage
Ministral 3 14B 2512 mistralai jn78b7c7pyv9bwxfz63p58xrjs867dg5 100% 100% Level 2 cheap Mistral small-model route with full-suite comparison value
Ministral 3 3B 2512 mistralai jn7cneszh8h4h169wp9m6ftj818669wz 100% 100% Level 2 tiny Mistral route for low-cost scale comparison
Ministral 3 8B 2512 mistralai jn72eek26kmhcfna69zsg8m0qs8667a2 100% 100% Level 2 very cheap Mistral small route for scale and ideology stability checks
Mistral Large 3 2512 mistralai jn77znvpt5wtay1jkv1jp7y3an867fk7 100% 100% Level 2 current Mistral large paid route
Mistral Medium 3.1 mistralai jn7egfgd1waqa15wwzyatnjk698673e7 100% 100% Level 2 mid-size Mistral route for comparison against Ministral and Saba
Mistral Medium 3.5 mistralai jn72zhsrwq9m571zcf6mesd5rs865qss 100% 100% Level 2 current Mistral medium paid route
Mistral Saba mistralai jn7dn0ckwrdgp6nksteazb2rps866c2a 100% 100% Level 2 Mistral regional route for Middle East and South Asia comparison
Mistral Small 4 mistralai jn78dd95913j8fhzm2wpf1wxa1866pjq 100% 100% Level 2 current Mistral efficient paid route
Nemotron 3 Nano 30B A3B nvidia jn73p8tfdvn74nyaytf37zjve9867g4t 100% 100% Level 2 cheap NVIDIA Nemotron 3 route with open-model comparison value
Nemotron 3 Super nvidia jn79vwtvy4ew6phzgp37bkxncx866ngb 100% 100% Level 2 current NVIDIA reasoning-capable paid route
Nemotron Nano 9B V2 nvidia jn715rwtcrnae9trwpm6kwq74d867kba 100% 100% Level 2 very cheap NVIDIA route with small-model comparison value
OLMo 3.1 32B Instruct allenai jn70sha61z7rvxw9m1ebac4w298669rz 100% 100% Level 2 fully open Ai2 American instruct route
Phi 4 microsoft jn72y2njy87gzbvvymbanmm0b18686rt 100% 100% Level 2 popular small Microsoft model comparison route
Qwen3.5 397B A17B qwen jn745fmm7et4q7nq87x5r7yh1586ajta 100% 100% Level 2 large Qwen open-weight comparison route
Qwen3.5 Plus 20260420 qwen jn74kq8ve2jy1ms5czap7sqd4d86apss 100% 100% Level 2 live completed Convex full-suite run
Qwen3.6 35B A3B qwen jn72r49hq5g308wv4xv0y6rf9s864bjp 100% 100% Level 2 open-weight mid-size Qwen route for size-class coverage
Qwen3.6 Flash qwen jn7ccgp27qt3ecc85dq0vaabhh8659q1 100% 100% Level 2 live completed Convex full-suite run
Qwen3.6 Max Preview qwen jn7b14sehdj2rhgte9k6pw824d864pmp 100% 100% Level 2 live completed Convex full-suite run
Reka Edge rekaai jn7ca900bsv6ychdfzf2r879js864j6a 100% 100% Level 2 new low-cost Reka edge-model comparison route
Solar Pro 3 upstage jn75x7fnknr6d2xwgq8px34pg5869a6m 100% 100% Level 2 Upstage Korean model route with regional comparison value
Trinity Large Preview arcee-ai jn76fvf4gfy08hdmsnfxmxdrbx869qc3 100% 100% Level 2 high-usage US open-weight Arcee preview route
Trinity Large Thinking arcee-ai jn7bkw0f7z6fa6z1anr0t4xf2x869xp5 100% 100% Level 2 US open-weight Arcee reasoning route
Trinity Mini arcee-ai jn7ayysp9g6ett652tdhefmpj586550k 100% 100% Level 2 small US open-weight Arcee MoE route

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.