Is your feature request related to a problem?
Evaluation runs often finish with missing cosine similarity scores. For example, a run with 215 pairs may only return 211 scores because some pairs are silently skipped due to empty inputs, embedding failures, missing embeddings, or Langfuse write failures.
The current workaround is to manually click Resync, but this only recovers scores that eventually reached Langfuse. It cannot recover scores that were never computed, leaving runs permanently incomplete.
Describe the solution you'd like
A run started with 215 pairs should automatically produce 215 scores (or clearly explain why some could not be scored), without requiring manual resyncs and making scoring reliable and self-healing:
- Retry embedding generation and Langfuse score writes on transient failures.
- Track expected vs. completed pair counts for each run.
- Automatically retry missing pairs until all valid pairs are scored.
- Surface pairs that cannot be scored instead of silently skipping them.
Screenshot

Is your feature request related to a problem?
Evaluation runs often finish with missing cosine similarity scores. For example, a run with 215 pairs may only return 211 scores because some pairs are silently skipped due to empty inputs, embedding failures, missing embeddings, or Langfuse write failures.
The current workaround is to manually click Resync, but this only recovers scores that eventually reached Langfuse. It cannot recover scores that were never computed, leaving runs permanently incomplete.
Describe the solution you'd like
A run started with 215 pairs should automatically produce 215 scores (or clearly explain why some could not be scored), without requiring manual resyncs and making scoring reliable and self-healing:
Screenshot