Skip to content

feat: YOLO26 + YOLO11 dual serving with per-family export toolchains#5

Merged
davidamacey merged 8 commits into
mainfrom
feat/yolo26-dual-serving
Jul 4, 2026
Merged

feat: YOLO26 + YOLO11 dual serving with per-family export toolchains#5
davidamacey merged 8 commits into
mainfrom
feat/yolo26-dual-serving

Conversation

@davidamacey

Copy link
Copy Markdown
Owner

Stacked on #4 (auto-retargets when it merges).

Serves YOLO11 and YOLO26 side by side in the same Triton + API instance:

  • Dual export toolchains in one image: the proven YOLO11 EfficientNMS end2end path keeps its exact production pin (ultralytics==8.3.253) in an isolated /opt/venv-y11 (CPU torch, ~5 GB saved); export_models.py transparently re-execs into it, so documented CLI commands are unchanged. The main env moves to ultralytics 8.4.x and gains export/export_yolo26.py — native NMS-free fused export (single (batch,300,6) tensor, no plugin), config.pbtxt written from the introspected ONNX output.
  • Signature-based adapter registry (src/clients/model_adapters.py): the output contract is resolved from Triton model metadata, not names — EfficientNMS 4-tensor and fused single-tensor engines both parse to the same normalized result. Hardcoded model literals removed from triton_client.py.
  • YOLO_MODEL env selects the default detector (default unchanged: yolov11_small_trt_end2end); /detect?model_name= selects per request; /models/{name}/load|unload add/remove either family at runtime.
  • /models upload+export API auto-routes end2end (YOLO26) .pt uploads to the native toolchain (detected via the model's end2end attribute).
  • Ships models/yolo26_small_trt/ repo entry; README + migration docs; 6 adapter behavior tests (20 integration tests total pass).

- Serving/main env moves to ultralytics>=8.4.82,<8.5 (YOLO26 support:
  native NMS-free export + 8.4-era .pt deserialization).
- The proven YOLO11 EfficientNMS export path keeps its exact production
  pin (ultralytics==8.3.253) in an isolated /opt/venv-y11 built from
  requirements-export-y11.txt (CPU torch — engine builds use the
  tensorrt wheels, not torch, saving ~5 GB).
- export_models.py gains a toolchain guard: launched under >=8.4 it
  transparently re-execs into /opt/venv-y11, so the documented CLI keeps
  working unchanged; without the venv it fails with a clear message.
- Both toolchains compile engines against tensorrt-cu13==11.0.0.114
  (Triton 26.06 match). EfficientNMS_TRT verified present in the 26.06
  image's libnvinfer_plugin.so.11.0.0.
…rving

- export/export_yolo26.py: fused single-tensor (batch,300,6) export via
  stock ultralytics >=8.4 (no plugin, no patch), TRT-11 engine build with
  dynamic batch profile, config.pbtxt written from the introspected ONNX
  output, labels from model.names, --custom-model support.
- src/clients/model_adapters.py: detection adapters resolved from Triton
  model METADATA (not names) — End2EndNMSAdapter (num_dets/det_boxes/
  det_scores/det_classes) and FusedDetAdapter ((...,6) single tensor) —
  so YOLO11 end2end and YOLO26 engines serve side by side through the
  same endpoints.
- triton_client: all YOLO paths go through the adapter registry; the two
  hardcoded 'yolov11_small_trt_end2end' literals are gone; batch path
  accepts model_name.
- YOLO_MODEL env var selects the default detector; /detect model_name
  param selects per request.
- /models upload+export API routes end2end (YOLO26) uploads to the
  native toolchain automatically (validated via the model's end2end
  attribute) and auto-loads the fused engine.
- Ships models/yolo26_small_trt/ repo entry (config + labels).
- README + migration guide: YOLO26 export/load/serve alongside YOLO11,
  YOLO_MODEL default switch, runtime load/unload.
- tests/integration/test_model_adapters.py: signature resolution (both
  contracts + reject), end2end truncation, fused zero-score padding drop
  (6 tests).
…guards

TRT 11 removed BuilderFlag.FP16, Builder.platform_has_fast_fp16, and
trtexec --fp16 (strongly-typed builds only; precision follows ONNX
dtypes):

- trt_utils.enable_fp16(): guarded no-op on TRT 11 (typed FP32 engines
  run with TF32 tensor cores on Ampere+); all six legacy export scripts
  route through it instead of touching the removed flag.
- YOLO26: FP16 baked into the ONNX via NVIDIA ModelOpt AutoCast
  (ultralytics helper), then built with an explicit profile that bounds
  EVERY dynamic axis — the export leaves H/W dynamic and an unbounded
  spatial axis makes TRT budget 12+ GB activation tactics that fail on
  consumer GPUs. yolo26_small capped at batch 32 for 12 GB cards.
- nvidia-modelopt[onnx] added to requirements (onnx bound raised to <1.22
  to match); venv-y11 keeps its own tighter pins.
- paddleocr_rec exporter: trtexec path fixed for 26.06 (/usr/bin),
  container name env-overridable (TRITON_CONTAINER), --fp16 dropped.

Verified on GPU 1: all 8 engines build and save under TRT 11.0.0.114
(yolo11 end2end 41 MB, yolo26 FP16 23 MB, scrfd, arcface, mobileclip x2,
paddleocr det+rec).
Verified end-to-end on live hardware (all 8 engines READY, 20/20
endpoint tests, dual-family detect, dynamic load/unload):

- FP16 baking per model family: EfficientNMS graphs use the
  onnxconverter-common rewrite (plugin op block-listed); YOLO26 and the
  CLIP image encoder use ModelOpt AutoCast via the ultralytics wrapper
  (needs ORT-executable graphs + calibration shapes); CLIP text encoder
  stays typed FP32 (token-id input); paddleocr rec baked + built via
  trtexec (container name now env-overridable via TRITON_CONTAINER).
- apt-mark hold on TensorRT packages in Dockerfile.triton: NVIDIA's apt
  repo offers TRT 11.1 over the image's 11.0.0.114 and a silent upgrade
  invalidates every client-built engine.
- yolov11 end2end config: det_boxes/det_scores are TYPE_FP16 (the
  EfficientNMS plugin emits at baked precision); exporter template
  aligned.
- /detect confidence filter is now unconditional: NMS-free YOLO26
  engines emit all top-K candidates (near-zero scores included) and
  rely on it; a no-op for end2end YOLO11 (0.25 baked at export).
- Instance counts right-sized as a 12 GB-card baseline with scale-up
  comments (yolo 2, scrfd/arcface/clip 2, rec/bls 1).
- test_endpoints.sh accepts the /health->/ready status contract.
…ans clean

Trivy (fixable HIGH/CRITICAL, --ignore-unfixed):
- API image: CLEAN
- Triton image: was 13 HIGH — 11 Go-stdlib CVEs in the Nsight Systems
  profiler CLI (dev tooling, unused at inference; removed) and 2 in
  starlette 0.49.3 (upgraded >=1.3.1) — now CLEAN.

Zero fixable HIGH/CRITICAL across both shipped images.
…ep assertion

Policy change per review: engines are always re-exported from ONNX at
deployment, so the image now takes the newest TensorRT from NVIDIA's apt
repo instead of holding the NGC tag's stock version. A build-time
assertion (TRT_VERSION arg) fails the image build the moment apt brings
a different TRT than the client-side tensorrt-cu13 pip pins — the two
can only ever move together, in one commit.

- tensorrt-cu13==11.1.0.106 in requirements.txt, requirements-export-y11
  and pyproject (onnx bound synced to <1.22).
- All 8 engines re-exported and E2E-verified on TRT 11.1: models READY,
  YOLO11 + YOLO26 detection parity, 20/20 endpoint tests.
- tests/test_endpoints.sh gains a 'dual' target: dynamically load
  yolo26_small_trt, detect with both families side by side, unload and
  assert inference is refused, reload. Skips gracefully when the yolo26
  engine hasn't been exported. Full suite: 25 passed, 0 failed, 0
  skipped against the live 26.06/TRT-11.1 stack.
- FP16 bakes now respect each exporter's precision flag; paddleocr_det
  defaults to FP32 — borderline text detection is threshold-sensitive
  to FP16 (verified: synthetic caution_sign detected at FP32, missed at
  FP16) and the engine is small.
- OCR synthetic fixtures generated via scripts/create_ocr_test_images.py
  (test_images/ is gitignored; the suite skips when absent).
@davidamacey davidamacey changed the base branch from feat/triton-2606-cve-refresh to main July 4, 2026 15:06
@davidamacey davidamacey merged commit 49111d6 into main Jul 4, 2026
@davidamacey davidamacey deleted the feat/yolo26-dual-serving branch July 4, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant