🤖 SRE Copilot — AI-Powered API Failure Detection & Debugging Agent

This project was developed for the Agentic AI Hackathon hosted by Product Space on Unstop.

The system focuses on detecting API failures and transforming raw telemetry into intelligent operational insights for both developers and end users.

📌 1. Project Overview

SRE Copilot is a production-grade AI-powered Site Reliability Engineering (SRE) observability platform that transforms raw API telemetry into intelligent operational insights — before users ever complain.

Engineering teams today discover API failures only after users start reporting them. Traditional monitoring generates noisy alerts but fails to intelligently correlate issues, explain root causes, or suggest preventive strategies.

SRE Copilot solves this by combining:

Continuous API monitoring with async polling
Multi-layer anomaly detection using ML + statistical methods
Intelligent incident grouping to reduce alert fatigue
AI-powered Root Cause Analysis using a LangChain agent backed by Groq LLaMA 3.3
Real-time observability dashboard inspired by Datadog, Grafana, and New Relic

"Build an AI SRE Copilot capable of transforming raw API telemetry into intelligent operational insights before users complain."

🎥 Demo Video

Project_Pitch.mp4

🌐 Live Links


Frontend	sre-copilot.vercel.app
Backend API	sre-copilot-production.up.railway.app
Dashboard Data	/api/dashboard

⚡ 2. Key Features

✔️ Live API Monitoring — continuous async polling at configurable intervals (5s to 30min)

✔️ Real-time Telemetry Collection — latency, status codes, failures, timestamps

✔️ Statistical Anomaly Detection — Z-score + rolling average + threshold rules

✔️ Isolation Forest ML Detection — unsupervised anomaly detection on request logs

✔️ Incident Correlation Engine — groups related anomalies, deduplicates alerts, assigns severity

✔️ LangChain Agent with Tools — autonomous operational investigation

✔️ AI Root Cause Analysis — explains causes, business impact, fixes, prevention strategies

✔️ Premium Observability Dashboard — dark-mode, real-time charts, live API health table

✔️ Production Deployed — Vercel (frontend) + Railway (backend + MySQL)

🏗️ 3. System Architecture

The complete platform operates as a multi-stage observability and intelligence pipeline.

High-level flow:

User adds API
        ↓
Monitoring Scheduler polls endpoint
        ↓
Telemetry stored in MySQL
        ↓
Intelligence pipeline analyzes logs
        ↓
Incidents generated
        ↓
AI RCA Agent investigates failures
        ↓
Dashboard displays insights

⚙️ 3.1 Monitoring & Telemetry Architecture

The monitoring subsystem continuously polls user-provided APIs and converts raw operational behavior into structured telemetry.

The monitoring scheduler is built using:

AsyncIO
HTTPX
Concurrent polling loops

Each monitored API runs independently at configurable intervals:

5s · 10s · 30s · 1min · 5min · 10min · 15min · 30min

For every API request, the system collects:

Request latency
HTTP status codes
Success/failure state
Timestamps
Connection errors
Timeout behavior

Example telemetry:

{
  "api_id": 5,
  "status_code": 500,
  "latency_ms": 4201.32,
  "success": false
}

The telemetry is stored in:

request_logs

and later consumed by the intelligence pipeline.

🧠 3.2 Intelligence Pipeline Architecture

The intelligence layer is composed of three independent detection stages, each focusing on different operational behavior patterns.

3.2.1 Rule-Based Detection Layer

This layer detects obvious operational failures using deterministic rules.

Example rules:

if latency > 3000ms:
    → HIGH_LATENCY anomaly

if status_code >= 500:
    → SERVER_ERROR anomaly

if connection fails:
    → CONNECTION_FAILURE anomaly

This stage provides immediate anomaly signals with extremely low computational cost.

3.2.2 Statistical Detection Layer

This layer detects behavioral drift using statistical analysis on recent telemetry.

The system computes:

Rolling averages
Standard deviation
Z-score deviation

Formula:

Z = (observed_value - rolling_mean) / std_deviation

If:

Z > threshold

the system generates anomalies.

This helps detect:

gradual latency spikes
unstable APIs
fluctuating response behavior
abnormal operational drift

that static rules may miss.

3.2.3 Isolation Forest ML Layer

The ML layer uses an unsupervised Isolation Forest model to detect unusual operational behavior patterns.

Input features:

latency_ms
status_code
success

The model dynamically trains on fresh telemetry stored in MySQL.

Unlike traditional rules, Isolation Forest can detect:

previously unseen failures
hidden operational anomalies
abnormal latency distributions
unusual API behavior

without requiring labeled training data.

The deployment also includes:

anomaly deduplication
cooldown scheduling
lightweight retraining
memory optimizations

to make ML-based observability production-safe.

🚨 3.3 Incident Correlation Engine

Raw anomalies alone create noisy monitoring systems.

To solve this, SRE Copilot introduces an Incident Correlation Engine.

Responsibilities:

Groups related anomalies
Detects repeated failures
Assigns severity
Deduplicates alerts
Summarizes operational behavior
Reduces LLM context size

Example:

500 individual request failures
        ↓
1 summarized incident

This dramatically improves:

operational clarity
alert fatigue reduction
RCA quality
token efficiency for the AI agent

Generated incidents are stored in:

incidents

🤖 3.4 AI RCA Agent Architecture

The RCA subsystem is powered by:

LangChain + Groq LLaMA 3.3 70B

The AI agent autonomously gathers operational context using database tools.

Available tools:

get_incident_details
get_recent_anomalies
get_incident_history
get_latency_stats
get_recent_request_logs

The agent performs:

root cause analysis
operational debugging
business impact estimation
remediation suggestions
prevention planning

Final RCA output includes:

Root Cause Analysis
Probable Technical Reasons
Business Impact
Recommended Fixes
Prevention Strategies
Severity Assessment

📡 3.5 End-to-End Operational Flow

The complete production flow is:

User adds API
        ↓
Async Monitoring Scheduler polls endpoint
        ↓
Telemetry stored in MySQL
        ↓
Intelligence Pipeline analyzes logs
        ↓
Anomalies generated
        ↓
Incident Engine groups failures
        ↓
LangChain RCA Agent investigates incidents
        ↓
Dashboard displays AI insights + alerts

This architecture enables SRE Copilot to behave like a lightweight autonomous observability engineer rather than a traditional monitoring dashboard.

🗄️ 4. Database Architecture

The backend database is designed around four core operational entities.

monitored_apis

Stores:

API endpoint
HTTP method
Polling interval
Monitoring metadata

request_logs

Stores:

latency
status codes
success/failure state
timestamps

anomalies

Stores:

anomaly type
severity
operational signals
ML/statistical detections

incidents

Stores:

grouped failures
incident summaries
severity levels
operational correlation data

🧠 5. AI Agent Architecture

The RCA Agent is implemented using LangChain’s create_agent.

Example tools:

@tool
def get_incident_details(incident_id: int):
    """Fetches incident + API metadata"""

@tool
def get_recent_anomalies(api_id: int):
    """Fetches recent operational anomalies"""

@tool
def get_incident_history(api_id: int):
    """Fetches recurring incident patterns"""

@tool
def get_latency_stats(api_id: int):
    """Computes latency + failure metrics"""

@tool
def get_recent_request_logs(api_id: int) -> list[dict]:
    """Fetches recent request logs for operational debugging context."""

The Incident Engine acts as a context compressor for the LLM:

Raw request logs
        ↓
Incident grouping
        ↓
Structured operational summary
        ↓
Minimal LLM context
        ↓
High-quality RCA generation

This architecture prevents:

token explosion
noisy context
irrelevant telemetry

while improving RCA quality.

🛠️ 6. Tech Stack

Frontend

Technology	Purpose
React + Vite	UI framework
Tailwind CSS	Styling
Recharts	Observability charts
Framer Motion	Animations
Axios	API communication

Backend

Technology	Purpose
FastAPI	REST API server
AsyncIO	Concurrent monitoring
HTTPX	Async API polling
MySQL Connector	Database layer
python-dotenv	Environment configuration

AI / ML

Technology	Purpose
LangChain	AI agent framework
Groq LLaMA 3.3 70B	LLM inference
Scikit-learn	Isolation Forest
Pandas + NumPy	Data processing

Infrastructure

Technology	Purpose
Vercel	Frontend deployment
Railway	Backend + MySQL hosting
GitHub	CI/CD + auto deployment

📁 7. Repository Structure

SRE-Copilot/
│
├── client/                                  # React + Vite frontend dashboard
│   │
│   ├── src/
│   │   │
│   │   ├── components/
│   │   │   ├── charts/                      # Recharts telemetry visualizations
│   │   │   │   ├── LatencyChart.jsx         # API latency trend visualization
│   │   │   │   ├── ErrorRateChart.jsx       # Error rate monitoring chart
│   │   │   │   └── IncidentFrequencyChart.jsx # Incident frequency analytics
│   │   │   │
│   │   │   ├── incidents/
│   │   │   │   └── IncidentCard.jsx         # Incident preview UI component
│   │   │   │
│   │   │   ├── layout/
│   │   │   │   ├── Sidebar.jsx              # Main dashboard sidebar
│   │   │   │   ├── Navbar.jsx               # Top navigation bar
│   │   │   │   └── AppLayout.jsx            # Shared dashboard layout wrapper
│   │   │   │
│   │   │   └── ui/                          # Shared reusable UI components
│   │   │       ├── APIStatusBadge.jsx       # API health status indicator
│   │   │       ├── LoadingSpinner.jsx       # Loading state animation
│   │   │       ├── EmptyState.jsx           # Empty/fallback UI state
│   │   │       └── RCABox.jsx               # AI RCA response container
│   │   │
│   │   ├── hooks/
│   │   │   └── useDashboardData.js          # Auto-refresh polling hook
│   │   │
│   │   ├── pages/
│   │   │   ├── Dashboard.jsx                # Main observability overview
│   │   │   ├── AddAPI.jsx                   # API monitoring configuration page
│   │   │   ├── IncidentDetails.jsx          # Detailed incident RCA page
│   │   │   ├── APIDetails.jsx               # Per-API analytics page
│   │   │   ├── Landing.jsx                  # Marketing/landing page
│   │   │   ├── Settings.jsx                 # User/system settings page
│   │   │   └── NotFound.jsx                 # 404 fallback page
│   │   │
│   │   ├── services/
│   │   │   └── api.js                       # Centralized Axios API service
│   │   │
│   │   ├── App.jsx                          # Main React application
│   │   ├── main.jsx
│   │
│   ├── .env                                 # Frontend environment variables
│   ├── .gitignore
│   ├── vercel.json                          # Vercel deployment config
│   ├── vite.config.js
│   └── package.json
│
├── server/                                  # FastAPI backend + intelligence layer
│   ├── agent/
│   │   └── llm_agent.py                     # LangChain AI RCA agent
│   │
│   ├── database/
│   │   ├── db.py                            # MySQL connection manager
│   │   └── schema.sql                       # SQL table definitions
│   │
│   ├── intelligence/
│   │   ├── run_pipeline.py                  # Intelligence pipeline orchestrator
│   │   ├── statistical_detector.py          # Rule-based anomaly detection
│   │   ├── isolation_forest.py              # ML anomaly scoring engine
│   │   └── incident_engine.py               # Incident grouping/correlation engine
│   │
│   ├── scheduler/
│   │   └── monitor.py                       # Async API polling scheduler
│   │
│   ├── logs/                                # Runtime-generated monitoring logs
│   │
│   ├── .env                                 # Backend secrets/API keys
│   ├── .gitignore
│   ├── main.py                              # FastAPI routes + API server
│   ├── Procfile                             # Railway deployment start command
│   ├── requirements.txt                     # Backend dependencies
│   ├── test_detetctor.py                    # Rule based testing script
│   ├── test_if.py                           # ML testing script
│   └── test_incident_engine.py              # Pipeline testing script
│
│                           
├── README.md                                
└── architecture.png                         # System architecture diagram

🚀 8. Local Setup

Prerequisites

Python 3.10+
Node.js 18+
MySQL 8+

Backend Setup

cd server
pip install -r requirements.txt

Create .env:

DB_HOST=localhost
DB_USER=root
DB_PASSWORD=your_password
DB_NAME=api_monitoring
DB_PORT=3306
GROQ_API_KEY=your_groq_api_key

Run:

uvicorn main:app --reload

Frontend Setup

cd client
npm install

Create .env:

VITE_API_URL=http://127.0.0.1:8000

Run:

npm run dev

🔌 9. API Endpoints

Method	Endpoint	Description
`POST`	`/api/monitor`	Start monitoring an API
`POST`	`/api/stop/{api_id}`	Stop monitoring
`GET`	`/api/dashboard`	Dashboard KPI stats
`GET`	`/api/apis`	Live monitored APIs
`GET`	`/api/incidents`	All incidents
`GET`	`/api/anomalies`	All anomalies
`GET`	`/api/rca/{incident_id}`	Generate AI RCA

🔮 10. Future Improvements

Multi-Browser API Validation using Playwright/Selenium to test APIs across Chrome, Firefox, Safari, and Edge
Multi-Region Monitoring for detecting region-specific outages, latency spikes, and DNS routing issues
Predictive Failure Forecasting using temporal ML models to predict outages before they occur
Slack / PagerDuty / Discord Integrations for AI-generated operational alerts and incident summaries
WebSocket-Based Real-Time Telemetry Streaming for instant dashboard updates and live anomaly feeds

📬 Interested in a Similar Project?

I build smart, ML-integrated applications and responsive web platforms. Let’s build something powerful together!

📧 shinjansaha00@gmail.com

🔗 LinkedIn Profile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 SRE Copilot — AI-Powered API Failure Detection & Debugging Agent

📌 1. Project Overview

🎥 Demo Video

🌐 Live Links

⚡ 2. Key Features

🏗️ 3. System Architecture

⚙️ 3.1 Monitoring & Telemetry Architecture

🧠 3.2 Intelligence Pipeline Architecture

3.2.1 Rule-Based Detection Layer

3.2.2 Statistical Detection Layer

3.2.3 Isolation Forest ML Layer

🚨 3.3 Incident Correlation Engine

🤖 3.4 AI RCA Agent Architecture

📡 3.5 End-to-End Operational Flow

🗄️ 4. Database Architecture

🧠 5. AI Agent Architecture

🛠️ 6. Tech Stack

Frontend

Backend

AI / ML

Infrastructure

📁 7. Repository Structure

🚀 8. Local Setup

Prerequisites

Backend Setup

Frontend Setup

🔌 9. API Endpoints

🔮 10. Future Improvements

📬 Interested in a Similar Project?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
client		client
server		server
Project_Pitch.mp4		Project_Pitch.mp4
README.md		README.md
architecture.png		architecture.png

Folders and files

Latest commit

History

Repository files navigation

🤖 SRE Copilot — AI-Powered API Failure Detection & Debugging Agent

📌 1. Project Overview

🎥 Demo Video

🌐 Live Links

⚡ 2. Key Features

🏗️ 3. System Architecture

⚙️ 3.1 Monitoring & Telemetry Architecture

🧠 3.2 Intelligence Pipeline Architecture

3.2.1 Rule-Based Detection Layer

3.2.2 Statistical Detection Layer

3.2.3 Isolation Forest ML Layer

🚨 3.3 Incident Correlation Engine

🤖 3.4 AI RCA Agent Architecture

📡 3.5 End-to-End Operational Flow

🗄️ 4. Database Architecture

🧠 5. AI Agent Architecture

🛠️ 6. Tech Stack

Frontend

Backend

AI / ML

Infrastructure

📁 7. Repository Structure

🚀 8. Local Setup

Prerequisites

Backend Setup

Frontend Setup

🔌 9. API Endpoints

🔮 10. Future Improvements

📬 Interested in a Similar Project?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages