MVP: Build evaluation harness for human-label comparison

## Goal
Measure baseline scorer behavior against reviewer labels.

## Acceptance criteria
- Human label fixture format is documented.
- Evaluation report includes agreement metrics and per-dimension errors.
- The harness can run locally without external model credentials.
- Results clearly mark the baseline as unvalidated.