Skip to content

feat: port operational hardening (health probes, metrics, logging, batching, GPU monitoring)#3

Open
davidamacey wants to merge 8 commits into
mainfrom
feat/port-operational-hardening
Open

feat: port operational hardening (health probes, metrics, logging, batching, GPU monitoring)#3
davidamacey wants to merge 8 commits into
mainfrom
feat/port-operational-hardening

Conversation

@davidamacey

Copy link
Copy Markdown
Owner

Ports production-proven hardening back to the public main branch:

  • Health: /live (pure liveness) + /ready (real Triton is_server_live() gRPC probe + OpenSearch HTTP probe, 2s bound, 503 with per-service detail). /health stays as a back-compat alias of /ready; the container HEALTHCHECK moves to /live so a degraded dependency can't cascade through depends_on: service_healthy.
  • Metrics: http_request_duration_seconds histogram (method/route/status) + GET /metrics, with optional PROMETHEUS_MULTIPROC_DIR aggregation; Prometheus scrape job included.
  • Logging: request-id contextvar moved to src.core.logging (importable by out-of-process workers), merge_contextvars wired, and a real foreign_pre_chain formatter bug fixed (stdlib records crashed with tuple item-deletion).
  • Clients: gRPC message caps 100→512 MB (large raw detector heads); cluster_distance sort tolerates missing fields/mappings.
  • Triton tuning: 25 ms batch queue delay + 3 YOLO instances (fires batch 8-16 under ingest instead of batch≈1).
  • Monitoring: dcgm-exporter (all GPUs) + a GPU Metrics Grafana dashboard.
  • Guardrails: max-file-size pre-commit ratchet (700 LOC, existing oversize modules grandfathered).
  • Tests: pytest scaffolding + integration tests for health, metrics, request-id, and prometheus scrape topology (14 tests, no live stack required).

New scripts/codegen/check_file_size.py enforces a per-file line ceiling
over src/ and scripts/ so module splits don't silently regress into
monoliths. Existing oversize modules are grandfathered in the hook's
exclude list until they are split.
…reign_pre_chain formatter bug

- request_id_ctx / get_request_id now live in src.core.logging (with new
  bind_request_id / clear_request_id helpers) so service modules and
  out-of-process workers can import them without pulling in the FastAPI
  app; src.main re-exports for backward compatibility.
- merge_contextvars added first in the processor chain so request_id
  auto-attaches to every structlog event on the task.
- ProcessorFormatter.wrap_for_formatter removed from foreign_pre_chain:
  it wraps the event dict in a tuple and is only valid as the LAST
  structlog-native step; in the pre-chain it crashed stdlib log records
  (uvicorn, opensearch-py) with 'tuple' object does not support item
  deletion.
- /live: pure process liveness — never gated on dependencies. The
  Dockerfile HEALTHCHECK now targets it so a degraded downstream dep
  cannot mark the container unhealthy and cascade through
  depends_on: service_healthy.
- /ready: probes Triton via a real is_server_live() gRPC round-trip
  (pool active_connections is 0 until the first infer and deadlocks
  dependent containers at startup) and OpenSearch via HTTP, each
  bounded to 2s; 503 with per-service detail when any dep is down.
- /health: kept as a backward-compat alias for /ready.
- src/core/metrics.py: prometheus_client Histogram labeled by method,
  route template, and status; render_metrics() supports the optional
  PROMETHEUS_MULTIPROC_DIR aggregation mode for multi-worker uvicorn.
- http_duration_middleware records every request using the matched
  route template, with a low-cardinality path fallback for 404s.
- GET /metrics exposition endpoint + prometheus scrape job for
  yolo-api.
…nce sort

- Triton gRPC send/receive caps 100MB -> 512MB: raw detector heads can
  emit hundreds of MB per max-batch response; 100MB rejected legitimate
  full-batch replies.
- cluster_distance sort gains missing:_last + unmapped_type:double so
  paging a cluster tolerates docs without the field and freshly created
  indices without the mapping (previously a 400 shard failure).
Under sustained ingest the per-image arrival rate is too slow to fill
preferred-size batches within 5ms, so Triton fired near batch=1. 25ms
reaches batch 8-16 within latency budget; a third YOLO instance
(~1.5 GB) pipelines concurrent ingest batches through more GPU streams.
dcgm-exporter publishes per-card utilization, VRAM, power, and
temperature for every host GPU (read-only, no compute reservation);
prometheus scrapes it and the new 'GPU Metrics' Grafana dashboard
renders GPU + host panels.
…uest-id, scrape config

- conftest.py excludes standalone live-deployment scripts from pytest
  collection (they define test_*(name,...) helpers pytest miscollects).
- Health probe tests run against the router with probes monkeypatched —
  no live Triton/OpenSearch needed.
- Metrics + request-id tests exercise the real app middleware wiring.
- Scrape-config test pins the prometheus job topology and validates
  every static target against docker-compose services.
- Synthetic fixture image (generated shapes, 640x480).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant