Skip to content

MVP: Build evaluation harness for human-label comparison #2

Description

@alp-topcu

Goal

Measure baseline scorer behavior against reviewer labels.

Acceptance criteria

  • Human label fixture format is documented.
  • Evaluation report includes agreement metrics and per-dimension errors.
  • The harness can run locally without external model credentials.
  • Results clearly mark the baseline as unvalidated.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions