Evaluation: Reliable scoring improvements

**Is your feature request related to a problem?**

Evaluation runs often finish with missing cosine similarity scores. For example, a run with 215 pairs may only return 211 scores because some pairs are silently skipped due to empty inputs, embedding failures, missing embeddings, or Langfuse write failures.

The current workaround is to manually click **Resync**, but this only recovers scores that eventually reached Langfuse. It cannot recover scores that were never computed, leaving runs permanently incomplete.

**Describe the solution you'd like**

A run started with 215 pairs should automatically produce 215 scores (or clearly explain why some could not be scored), without requiring manual resyncs and making scoring reliable and self-healing:

- Retry embedding generation and Langfuse score writes on transient failures.
- Track expected vs. completed pair counts for each run.
- Automatically retry missing pairs until all valid pairs are scored.
- Surface pairs that cannot be scored instead of silently skipping them.

**Screenshot**

<img width="1638" height="336" alt="Image" src="https://github.com/user-attachments/assets/55abb1e3-0663-4346-92a5-1f54bb4257f8" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation: Reliable scoring improvements #947

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Evaluation: Reliable scoring improvements #947

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions