Skip to content

[Reliability] Implement SLO-Triggered CUDA Graph Reset #92

@DsThakurRawat

Description

@DsThakurRawat

Description (Medium)

Sometimes CUDA graph performance degrades over time. Automating a reset improves long-term stability.

Task

  • Add an on_violation callback to SLOTracker.
  • Implement a 'Self-Healing' mechanism in ReflexServer that triggers a graph re-capture if the p99 latency exceeds the threshold for N consecutive windows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions