Literature
The benchmark is grounded in existing evaluation practice.
These sources motivate transparency, raw evidence, validity limits, reliability checks, artifact review, model cards, and datasheets.
| Reference | Link |
|---|---|
| Political Compass or Spinning Arrow? | https://aclanthology.org/2024.acl-long.816/ |
| ACM Artifact Review and Badging | https://www.acm.org/publications/policies/artifact-review-and-badging-current |
| Holistic Evaluation of Language Models | https://arxiv.org/abs/2211.09110 |
| NIST AI Risk Management Framework | https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf |
| COBIAS | https://arxiv.org/abs/2402.14889 |
| Whose Opinions Do Language Models Reflect? | https://arxiv.org/abs/2303.17548 |
| Measuring Political Bias in Large Language Models | https://aclanthology.org/2024.acl-long.600/ |
| Model Cards for Model Reporting | https://arxiv.org/abs/1810.03993 |
| Datasheets for Datasets | https://arxiv.org/abs/1803.09010 |