Literature

The benchmark is grounded in existing evaluation practice.

These sources motivate transparency, raw evidence, validity limits, reliability checks, artifact review, model cards, and datasheets.

ReferenceLink
Political Compass or Spinning Arrow?https://aclanthology.org/2024.acl-long.816/
ACM Artifact Review and Badginghttps://www.acm.org/publications/policies/artifact-review-and-badging-current
Holistic Evaluation of Language Modelshttps://arxiv.org/abs/2211.09110
NIST AI Risk Management Frameworkhttps://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
COBIAShttps://arxiv.org/abs/2402.14889
Whose Opinions Do Language Models Reflect?https://arxiv.org/abs/2303.17548
Measuring Political Bias in Large Language Modelshttps://aclanthology.org/2024.acl-long.600/
Model Cards for Model Reportinghttps://arxiv.org/abs/1810.03993
Datasheets for Datasetshttps://arxiv.org/abs/1803.09010