feat: YOLO26 + YOLO11 dual serving with per-family export toolchains#5
Merged
Conversation
- Serving/main env moves to ultralytics>=8.4.82,<8.5 (YOLO26 support: native NMS-free export + 8.4-era .pt deserialization). - The proven YOLO11 EfficientNMS export path keeps its exact production pin (ultralytics==8.3.253) in an isolated /opt/venv-y11 built from requirements-export-y11.txt (CPU torch — engine builds use the tensorrt wheels, not torch, saving ~5 GB). - export_models.py gains a toolchain guard: launched under >=8.4 it transparently re-execs into /opt/venv-y11, so the documented CLI keeps working unchanged; without the venv it fails with a clear message. - Both toolchains compile engines against tensorrt-cu13==11.0.0.114 (Triton 26.06 match). EfficientNMS_TRT verified present in the 26.06 image's libnvinfer_plugin.so.11.0.0.
…rving - export/export_yolo26.py: fused single-tensor (batch,300,6) export via stock ultralytics >=8.4 (no plugin, no patch), TRT-11 engine build with dynamic batch profile, config.pbtxt written from the introspected ONNX output, labels from model.names, --custom-model support. - src/clients/model_adapters.py: detection adapters resolved from Triton model METADATA (not names) — End2EndNMSAdapter (num_dets/det_boxes/ det_scores/det_classes) and FusedDetAdapter ((...,6) single tensor) — so YOLO11 end2end and YOLO26 engines serve side by side through the same endpoints. - triton_client: all YOLO paths go through the adapter registry; the two hardcoded 'yolov11_small_trt_end2end' literals are gone; batch path accepts model_name. - YOLO_MODEL env var selects the default detector; /detect model_name param selects per request. - /models upload+export API routes end2end (YOLO26) uploads to the native toolchain automatically (validated via the model's end2end attribute) and auto-loads the fused engine. - Ships models/yolo26_small_trt/ repo entry (config + labels).
- README + migration guide: YOLO26 export/load/serve alongside YOLO11, YOLO_MODEL default switch, runtime load/unload. - tests/integration/test_model_adapters.py: signature resolution (both contracts + reject), end2end truncation, fused zero-score padding drop (6 tests).
…guards TRT 11 removed BuilderFlag.FP16, Builder.platform_has_fast_fp16, and trtexec --fp16 (strongly-typed builds only; precision follows ONNX dtypes): - trt_utils.enable_fp16(): guarded no-op on TRT 11 (typed FP32 engines run with TF32 tensor cores on Ampere+); all six legacy export scripts route through it instead of touching the removed flag. - YOLO26: FP16 baked into the ONNX via NVIDIA ModelOpt AutoCast (ultralytics helper), then built with an explicit profile that bounds EVERY dynamic axis — the export leaves H/W dynamic and an unbounded spatial axis makes TRT budget 12+ GB activation tactics that fail on consumer GPUs. yolo26_small capped at batch 32 for 12 GB cards. - nvidia-modelopt[onnx] added to requirements (onnx bound raised to <1.22 to match); venv-y11 keeps its own tighter pins. - paddleocr_rec exporter: trtexec path fixed for 26.06 (/usr/bin), container name env-overridable (TRITON_CONTAINER), --fp16 dropped. Verified on GPU 1: all 8 engines build and save under TRT 11.0.0.114 (yolo11 end2end 41 MB, yolo26 FP16 23 MB, scrfd, arcface, mobileclip x2, paddleocr det+rec).
Verified end-to-end on live hardware (all 8 engines READY, 20/20 endpoint tests, dual-family detect, dynamic load/unload): - FP16 baking per model family: EfficientNMS graphs use the onnxconverter-common rewrite (plugin op block-listed); YOLO26 and the CLIP image encoder use ModelOpt AutoCast via the ultralytics wrapper (needs ORT-executable graphs + calibration shapes); CLIP text encoder stays typed FP32 (token-id input); paddleocr rec baked + built via trtexec (container name now env-overridable via TRITON_CONTAINER). - apt-mark hold on TensorRT packages in Dockerfile.triton: NVIDIA's apt repo offers TRT 11.1 over the image's 11.0.0.114 and a silent upgrade invalidates every client-built engine. - yolov11 end2end config: det_boxes/det_scores are TYPE_FP16 (the EfficientNMS plugin emits at baked precision); exporter template aligned. - /detect confidence filter is now unconditional: NMS-free YOLO26 engines emit all top-K candidates (near-zero scores included) and rely on it; a no-op for end2end YOLO11 (0.25 baked at export). - Instance counts right-sized as a 12 GB-card baseline with scale-up comments (yolo 2, scrfd/arcface/clip 2, rec/bls 1). - test_endpoints.sh accepts the /health->/ready status contract.
…ans clean Trivy (fixable HIGH/CRITICAL, --ignore-unfixed): - API image: CLEAN - Triton image: was 13 HIGH — 11 Go-stdlib CVEs in the Nsight Systems profiler CLI (dev tooling, unused at inference; removed) and 2 in starlette 0.49.3 (upgraded >=1.3.1) — now CLEAN. Zero fixable HIGH/CRITICAL across both shipped images.
…ep assertion Policy change per review: engines are always re-exported from ONNX at deployment, so the image now takes the newest TensorRT from NVIDIA's apt repo instead of holding the NGC tag's stock version. A build-time assertion (TRT_VERSION arg) fails the image build the moment apt brings a different TRT than the client-side tensorrt-cu13 pip pins — the two can only ever move together, in one commit. - tensorrt-cu13==11.1.0.106 in requirements.txt, requirements-export-y11 and pyproject (onnx bound synced to <1.22). - All 8 engines re-exported and E2E-verified on TRT 11.1: models READY, YOLO11 + YOLO26 detection parity, 20/20 endpoint tests.
- tests/test_endpoints.sh gains a 'dual' target: dynamically load yolo26_small_trt, detect with both families side by side, unload and assert inference is refused, reload. Skips gracefully when the yolo26 engine hasn't been exported. Full suite: 25 passed, 0 failed, 0 skipped against the live 26.06/TRT-11.1 stack. - FP16 bakes now respect each exporter's precision flag; paddleocr_det defaults to FP32 — borderline text detection is threshold-sensitive to FP16 (verified: synthetic caution_sign detected at FP32, missed at FP16) and the engine is small. - OCR synthetic fixtures generated via scripts/create_ocr_test_images.py (test_images/ is gitignored; the suite skips when absent).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #4 (auto-retargets when it merges).
Serves YOLO11 and YOLO26 side by side in the same Triton + API instance:
ultralytics==8.3.253) in an isolated/opt/venv-y11(CPU torch, ~5 GB saved);export_models.pytransparently re-execs into it, so documented CLI commands are unchanged. The main env moves to ultralytics 8.4.x and gainsexport/export_yolo26.py— native NMS-free fused export (single(batch,300,6)tensor, no plugin), config.pbtxt written from the introspected ONNX output.src/clients/model_adapters.py): the output contract is resolved from Triton model metadata, not names — EfficientNMS 4-tensor and fused single-tensor engines both parse to the same normalized result. Hardcoded model literals removed fromtriton_client.py.YOLO_MODELenv selects the default detector (default unchanged:yolov11_small_trt_end2end);/detect?model_name=selects per request;/models/{name}/load|unloadadd/remove either family at runtime./modelsupload+export API auto-routes end2end (YOLO26).ptuploads to the native toolchain (detected via the model'send2endattribute).models/yolo26_small_trt/repo entry; README + migration docs; 6 adapter behavior tests (20 integration tests total pass).