build(web_api): slim CPU/GPU extras + image size reduction by magic-vladyslav · Pull Request #1263 · ZettaAI/zetta_utils

magic-vladyslav · 2026-05-28T16:18:06Z

Summary

Carves the web_api deploy images away from the monolithic modules extra and onto purpose-built, slim web_api (CPU) / web_api-gpu extras that contain only what web_api/app and internal.alignment actually import. Also makes the CPU image ship CPU-only torch (no CUDA wheels) and the GPU image drop the unused TensorRT/CUDA-13 stack, and restores the newest torch for everyone else.

Net effect: dramatically smaller, faster web_api images, and a web_api dev on Apple Silicon no longer has to resolve the other team's tensorrt/training deps.

What web_api actually imports → covering extra

Verified by tracing every import in web_api/app/*.py and recursively through internal.alignment (sift → misalignment_detector → manual_correspondence → field → online_finetuner). Nothing reachable from web_api imports mazepa, mazepa_addons, training, lightning, wandb, torchmetrics, meshing/skeletonization/chunkedgraph/montaging/calcada, or TensorRT.

web_api import	external pkg	covering extra
`task_management.*` (tasks.py)	—	`task_management` (→ databackends, sql, tenacity, pcg_skel→caveclient)
`db_annotations.*` (annotations/collections/layers/...)	—	`databackends`+`tenacity` (via `task_management`)
`layer.volumetric` + `.cloudvol` + `.annotation` (painting/precomputed)	einops	`cloudvol` + `tensorstore` (volumetric `__init__` loads `.tensorstore`) + `tensor_ops`
`internal.alignment.sift`	scipy, cv2, numpy, torch	`tensor_ops`; scipy declared explicitly
`internal.alignment.misalignment_detector`	einops, torch	`convnet`
`internal.alignment.manual_correspondence`/field/online_finetuner	torch, einops, torchfields	`tensor_ops`
`segmentation.py`	cutie, hydra-core, omegaconf, torch, google.cloud.storage	new `cutie` sub-extra; hydra/omegaconf explicit; gcs via base `cloud-files`
`main.py`	google-auth	`google-cloud-iap` (web stack)

Hidden / transitive gaps closed

scipy (internal/alignment/sift.py, alignment.py) — only transitive via scikit-image → declared explicitly.
hydra-core / omegaconf (segmentation.py) — only transitive via the git cutie → declared explicitly.
caveclient / tenacity / google-cloud-storage — verified not gaps (caveclient via pcg_skel, tenacity via task_management, gcs via cloud-files).
cchardet — pruned (doesn't build on 3.12+); cutie needs it → CPU image keeps the faust-cchardet/stub shim.

New / changed extras (`pyproject.toml`)

New cutie sub-extra; segmentation now references zetta_utils[cutie] (modules/segmentation semantics unchanged for the other team).
New web-api-base (shared deps) + web-api / web-api-gpu leaf extras.
CPU-only torch for web-api: a pytorch-cpu index + [tool.uv.sources] bind torch to it for the web-api extra, with [tool.uv] conflicts between web-api/web-api-gpu. requirements.web_api.txt resolves torch==…+cpu with 0 nvidia-* CUDA wheels. uv-only — plain pip install '.[web-api-gpu]' ignores it, so the GPU image keeps its base cu121 torch.
web-api-gpu omits the gpu/tensorrt extra — web_api only calls convnet.load_model(tensorrt_enabled=False), so TensorRT is never imported; layering a CUDA-13 runtime on the CUDA-12.1 base was pure bloat + a version mismatch.
torch floors: restored >= 2.11 on training/alignment/montaging; kept >= 2.5 on the web_api-path extras (convnet, tensor-typing, web-api) so the cu121 base's torch 2.5.1 is honored. --resolution highest still pins 2.12 everywhere else, so non-GPU consumers get the newest torch.

Removed from the web_api images

training (lightning/wandb/torchmetrics), mazepa_addons (kubernetes/awscli/mitmproxy/gcloud SDKs), meshing, skeletonization, montaging, chunkedgraph, calcada, and (CPU) the native abiss/waterz/lsds builds + their numpy==1.26.4/cython/nanobind machinery — plus the full nvidia-* CUDA stack (CPU) and tensorrt-cu13 + CUDA-13 libs (GPU).

Image size wins

CPU image: CPU-only torch, no CUDA libs / no triton → ~4–5 GB lighter.
GPU image: no CUDA-13 TensorRT stack → ~3–5 GB lighter, and the CUDA-12.1/13 mismatch is gone.

Dockerfiles

web_api/Dockerfile (CPU): install requirements.web_api.txt --no-deps with PIP_EXTRA_INDEX_URL=…/whl/cpu; keep the cchardet shim; drop libboost/unixodbc + standalone cutie + abiss/waterz/lsd machinery; replace RUN zetta --help (pulls kubernetes) with a python -c "import app.main" smoke test.
web_api/gpu.Dockerfile: pip install '.[web-api-gpu]' on the pytorch/pytorch:2.5.1-cuda12.1-cudnn9-runtime base, keeping PIP_EXTRA_INDEX_URL=…/cu121; drop numpy/cython/lsd/nanobind.
Deleted web_api/requirements.txt — single source of truth is the extra.

Scripts & CI

update_pinned_requirements.sh: exports requirements.web_api.txt / requirements.web_api_gpu.txt (lock/prune/fork-strategy untouched).
install_zutils.py: --mode gains web_api / web_api_gpu.
build_web_api.py: builds the CPU & GPU variants in parallel by default (--no-parallel to opt out) with per-variant prefixed output.
.github/workflows/testing.yaml: new web-api-extras-build job (py 3.11/3.12/3.14: clean CPU install with the cpu torch index → smoke-import app.main → assert lightning/wandb/torchmetrics/mitmproxy/awscli/kubernetes/tensorrt absent) and web-api-gpu-build job (full GPU docker build). Both added to all-checks-test.

Verification

web_api.app.main imports on CPU (darwin) including internal.alignment + cutie + hydra/omegaconf/scipy; no hardcoded .cuda().
Pinned files confirmed: web_api → torch==…+cpu, 0 CUDA libs, no heavy pkgs; web_api_gpu → CUDA torch, no tensorrt; modules/all → CUDA torch + tensorrt unchanged (other team unaffected).
web_api-gpu resolves cleanly with torch==2.5.1; modules now correctly rejects torch==2.5.1 (requires >=2.11).
update_pinned_requirements.sh runs on Apple Silicon without choking on tensorrt (static-metadata workaround).

Note for the other team

pyproject.toml is shared, but modules/training/segmentation/all semantics are preserved: modules still resolves with tensorrt + CUDA torch, segmentation still includes cutie (via the cutie sub-extra). No undeclared internal dependency was added — scipy/hydra-core/omegaconf were already resolved transitively and are now declared explicitly in the web_api extra only.

🤖 Generated with Claude Code

codecov · 2026-05-28T16:28:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (72cd1e9) to head (06ea78c).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##              main     #1263   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          211       211           
  Lines        11292     11314   +22     
=========================================
+ Hits         11292     11314   +22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dodamih · 2026-05-29T06:07:44Z

This would break tensorrt which requires CUDA 13. What needs CUDA 12.1?

nkemnitz · 2026-05-29T08:32:45Z

I had to modify the dependency installation partially due to cutie (but not only). pip respects the requires_python settings that maintainers specify in their packages, but their upper bounds are often too limiting (often they just haven't been tested, but work fine on newer Python, and/or the package simply has been abandonded - like cutie).
For update_pinned_requirements.sh and install_zutils.sh I am now using uv, which apparently ignores the requires_python upper bounds by design. That should also resolve the need to rebuild the super-outdated cchardet: https://github.com/ZettaAI/zetta_utils/blob/main/install_zutils.py#L779-L780

Also please make sure to rerun update_pinned_requirements.sh whenever you really need to update pyproject.toml dependencies.

nkemnitz · 2026-05-29T10:15:11Z

This would break tensorrt which requires CUDA 13. What needs CUDA 12.1?

I think it's because Cloud Run driver's for L4 GPU can't be updated: https://github.com/ZettaAI/zetta_utils/blob/main/web_api/gpu.Dockerfile#L3-L6

Replace the monolithic `modules` extra in both web_api images with new `web_api` (CPU) and `web_api-gpu` extras that pull only what web_api/app and internal.alignment actually import, dropping training/mazepa_addons/meshing/ skeletonization/chunkedgraph/segmentation-native and tensorrt from the deploy images. - pyproject: add `web-api-base` + `web-api`/`web-api-gpu` leaf extras and a `cutie` sub-extra (still pulled by `segmentation`). Restore torch>=2.11 on the non-web extras (training/alignment/montaging) while the web_api path stays torch>=2.5 so the cu121 GPU base image keeps torch 2.5.1. - web_api resolves CPU-only torch via the pytorch-cpu index ([tool.uv.sources] + conflicts), so installing without a GPU pulls no nvidia-* CUDA wheels. - web_api-gpu omits the gpu/tensorrt extra: web_api only calls convnet.load_model with tensorrt_enabled=False, so the CUDA-13 stack is dead weight on the CUDA-12.1 base. - Dockerfiles install the slim extras (CPU: pinned --no-deps + cpu torch index, cchardet shim retained for cutie; GPU: resolution on the cu121 base). Drop the abiss/waterz/lsd build machinery and the `zetta --help` check (the CLI pulls kubernetes); smoke-import app.main instead. - update_pinned_requirements.sh exports requirements.web_api{,_gpu}.txt; install_zutils gains web_api/web_api_gpu modes; web_api/requirements.txt is removed (single source of truth is the extra). - CI: add web-api-extras-build (clean CPU install + smoke import + heavy-package absence assert) and web-api-gpu-build (full GPU docker build) jobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Run the two image variants concurrently by default (--no-parallel to opt out, and automatic when only one variant is selected), streaming per-variant prefixed, line-buffered output so the interleaved logs stay readable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- web-api-extras-build: set UV_INDEX_STRATEGY=unsafe-best-match so uv finds exact pins on PyPI even though the pytorch CPU index mirrors some packages (e.g. certifi) at older versions; default first-index strategy stopped there. - gpu.Dockerfile: restore the cchardet stub + faust-cchardet shim before the resolution install, since cutie's cchardet>=2.1.7 does not build on the base image's Python. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The bash assert step reported all packages absent but still exited 1 under bash -l {0} (login shell + conda). Replace it with an importlib.metadata check + sys.exit so the result is deterministic and self-documenting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

nkemnitz

See comments, but a major one regarding the web_api/gpu build:

The R535 limitation on Cloud Run does not prevent you from using newer Pytorch / CUDA driver.

Updating to Pytorch 2.12 + CUDA 12.6 in the requirements should be straightforward, thanks to minor-version compatibility. That already resolves the issue with Pytorch 2.5 <-> 2.12
CUDA 13.0 is also possible with apt install cuda-compat-13-0, and prepending LD_LIBRARY_PATH with /usr/local/cuda-13.0/compat, which should contain the libcuda.so.
I would drop the pytorch/pytorch image in either case. and rely on the pinned requirements files to install the version zutils actually prefers. Otherwise we need to keep paying close attention to the requirements and update the base image version to stay in sync with the resolved dependencies

nkemnitz · 2026-05-30T09:12:08Z

+# backend; pip still pulls sub-package deps from the index at install time.
+[[tool.uv.dependency-metadata]]
+name = "tensorrt-cu13"
+requires-dist = []


The install scripts explicitly use --no-deps, so this change drops tensorrt libs and bindings from the pinned requirements and break the main image. Try preserving the dependencies for the metapackage, that might still bypass the macOS issue.

Suggested change

# backend; pip still pulls sub-package deps from the index at install time.

[[tool.uv.dependency-metadata]]

name = "tensorrt-cu13"

requires-dist = []

# backend

[[tool.uv.dependency-metadata]]

name = "tensorrt-cu13"

requires-dist = ["tensorrt-cu13-libs", "tensorrt-cu13-bindings"]

nkemnitz · 2026-05-30T09:33:51Z

+# cutie declares cchardet>=2.1.7, which does not build on the base image's
+# Python. Install an empty stub to satisfy the requirement (so the resolution
+# install below does not try to build the real one) plus faust-cchardet to
+# provide the actual top-level `cchardet` module.
+RUN mkdir -p /tmp/cc_stub \
+    && printf 'from setuptools import setup\nsetup(name="cchardet", version="2.1.7", py_modules=[])\n' > /tmp/cc_stub/setup.py \
+    && pip install --no-deps /tmp/cc_stub \
+    && rm -rf /tmp/cc_stub
+RUN --mount=type=cache,target=/root/.cache/pip pip install faust-cchardet


cchardet is declared by cutie, but never used. You neither need cchardet nor faust-cchardet.

nkemnitz · 2026-05-30T09:40:29Z

@@ -33,7 +32,8 @@ RUN --mount=type=cache,target=/root/.cache/pip \
    pip install faust-cchardet


cchardet is declared by cutie, but never used. You neither need cchardet nor faust-cchardet.

nkemnitz · 2026-05-30T09:46:33Z

Nothing consumes this file. In gpu.Dockerfile you are relying on pip+pyproject.toml to resolve dependencies. Also note that this requirements file here pins torch==2.12 and CUDA 13 libraries, which you wanted to avoid in the web_api/gpu.Dockerfile.

nkemnitz · 2026-05-30T10:17:18Z


 RUN --mount=type=cache,target=/root/.cache/pip \
-    pip install "cutie @ git+https://github.com/hkchengrex/Cutie.git"
+    pip install --no-deps -r requirements.web_api.txt \


Switching to uv pip here (like the base image does via install_zutils.sh) may help with consistency/reproducibility. It's also faster than pip

supersergiy · 2026-06-11T21:20:40Z

Could you please address Nicos input?

fix(web_api): revert pytorch to 2.5 for cuda 12.1 compatibility

4ee2251

magic-vladyslav requested review from dmytroprokopenko-techmagic and volodymyryushko-coder May 28, 2026 16:18

magic-vladyslav marked this pull request as ready for review May 28, 2026 16:18

volodymyryushko-coder approved these changes May 28, 2026

View reviewed changes

supersergiy requested a review from volodymyryushko-coder May 29, 2026 14:09

magic-vladyslav and others added 5 commits May 29, 2026 17:34

fix(deps): provide static metadata for tensorrt cross-platform locking

8089902

magic-vladyslav requested review from dodamih, nkemnitz and supersergiy May 30, 2026 00:11

magic-vladyslav changed the title ~~fix(web_api): revert pytorch to 2.5 for cuda 12.1 compatibility~~ build(web_api): slim CPU/GPU extras + image size reduction May 30, 2026

nkemnitz requested changes May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build(web_api): slim CPU/GPU extras + image size reduction#1263

build(web_api): slim CPU/GPU extras + image size reduction#1263
magic-vladyslav wants to merge 6 commits into
mainfrom
fix/build

magic-vladyslav commented May 28, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 28, 2026 •

edited

Loading

Uh oh!

dodamih commented May 29, 2026

Uh oh!

nkemnitz commented May 29, 2026

Uh oh!

nkemnitz commented May 29, 2026

Uh oh!

nkemnitz left a comment •

edited

Loading

Uh oh!

nkemnitz May 30, 2026

Uh oh!

nkemnitz May 30, 2026

Uh oh!

nkemnitz May 30, 2026

Uh oh!

nkemnitz May 30, 2026

Uh oh!

nkemnitz May 30, 2026

Uh oh!

supersergiy commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -33,7 +32,8 @@ RUN --mount=type=cache,target=/root/.cache/pip \
		pip install faust-cchardet

Conversation

magic-vladyslav commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What web_api actually imports → covering extra

Hidden / transitive gaps closed

New / changed extras (pyproject.toml)

Removed from the web_api images

Image size wins

Dockerfiles

Scripts & CI

Verification

Note for the other team

Uh oh!

codecov Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dodamih commented May 29, 2026

Uh oh!

nkemnitz commented May 29, 2026

Uh oh!

nkemnitz commented May 29, 2026

Uh oh!

nkemnitz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nkemnitz May 30, 2026

Choose a reason for hiding this comment

Uh oh!

nkemnitz May 30, 2026

Choose a reason for hiding this comment

Uh oh!

nkemnitz May 30, 2026

Choose a reason for hiding this comment

Uh oh!

nkemnitz May 30, 2026

Choose a reason for hiding this comment

Uh oh!

nkemnitz May 30, 2026

Choose a reason for hiding this comment

Uh oh!

supersergiy commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

magic-vladyslav commented May 28, 2026 •

edited

Loading

New / changed extras (`pyproject.toml`)

codecov Bot commented May 28, 2026 •

edited

Loading

nkemnitz left a comment •

edited

Loading