Skip to content

Evaluation: Reliable scoring improvements #947

Description

@AkhileshNegi

Is your feature request related to a problem?

Evaluation runs often finish with missing cosine similarity scores. For example, a run with 215 pairs may only return 211 scores because some pairs are silently skipped due to empty inputs, embedding failures, missing embeddings, or Langfuse write failures.

The current workaround is to manually click Resync, but this only recovers scores that eventually reached Langfuse. It cannot recover scores that were never computed, leaving runs permanently incomplete.

Describe the solution you'd like

A run started with 215 pairs should automatically produce 215 scores (or clearly explain why some could not be scored), without requiring manual resyncs and making scoring reliable and self-healing:

  • Retry embedding generation and Langfuse score writes on transient failures.
  • Track expected vs. completed pair counts for each run.
  • Automatically retry missing pairs until all valid pairs are scored.
  • Surface pairs that cannot be scored instead of silently skipping them.

Screenshot

Image

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

Status
Closed

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions