Skip to content

fix: resolve ExecuTorch TRT target_device per partition (coalesced multi-engine)#4350

Open
shoumikhin wants to merge 2 commits into
pytorch:mainfrom
shoumikhin:fix/executorch-per-partition-device
Open

fix: resolve ExecuTorch TRT target_device per partition (coalesced multi-engine)#4350
shoumikhin wants to merge 2 commits into
pytorch:mainfrom
shoumikhin:fix/executorch-per-partition-device

Conversation

@shoumikhin

@shoumikhin shoumikhin commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Stacked on #4349. The first commit here is from #4349 (the engine-constant lookup fix); this PR adds the per-partition target_device commit on top. Please review the second commit (fix: resolve ExecuTorch TRT target_device per partition) — once #4349 merges, this diff reduces to just that change.

What's broken

When coalescing TensorRT with other delegates into one .pte, the graph has multiple TensorRT engines. TensorRTPartitioner resolved target_device once for the whole program via _get_engine_info_from_edge_program(), which requires exactly one engine node. With more than one engine it raised, so every TensorRT partition fell back to cuda:0 with a spurious warning:

Could not derive target_device from the TensorRT engine (... expects exactly 1
engine node per partition, found 2); falling back to cuda:0.

On a single GPU this is just noise, but a multi-GPU graph cannot label each delegate with its own device.

The fix

Extract _get_engine_info_for_node() (single-node engine-info extraction) out of _get_engine_info_from_edge_program() — the latter keeps its one-engine contract used by preprocess() — and resolve target_device per partition from that partition's own engine node.

  • Single-GPU behavior is unchanged (still cuda:0), minus the spurious warning.
  • Multi-engine / multi-GPU graphs now get a correct per-delegate device label.

Test

Verified on a coalesced TensorRT -> Another -> TensorRT model (two TRT engines plus one another delegate): the .pte still contains both TensorRTBackend and AnotherBackend, and the "found 2 engines" warning no longer fires.

Saving a partially-TRT-compiled program to ExecuTorch
(output_format="executorch") via the modern torch.export path (retrace=True)
aborts with:

    RuntimeError: execute_engine node 'execute_engine': placeholder engine
    'obj__run_on_acc_0_engine' not found in exp_program.constants

even though the engine is present. torch.export lifts the TRT engine
ScriptObject as a custom-object constant keyed by its graph-signature FQN
(InputSpec.target) and renames the placeholder node (an obj_ prefix), so the
existing constants[node.name] / constants[node.target] lookup misses. The
legacy exporter (retrace=False) only worked by accident: it kept the
placeholder name equal to the constants key.

Resolve the placeholder via the canonical
ExportGraphSignature.inputs_to_lifted_custom_objs mapping, falling back to the
direct lookup only for legacy programs that lack it, and unwrap a
FakeScriptObject to its real object. A shared helper in dynamo/_exporter.py is
used by both the save serializer (_compile.py) and the backend engine-info
extractor (executorch/backend.py), which carried the same latent lookup.

Adds CPU-only unit tests for the resolver (no GPU/executorch required).

This unblocks coalescing TensorRT + CUDA delegates into one .pte via the
modern exporter.
@meta-cla meta-cla Bot added the cla signed label Jun 18, 2026
@github-actions github-actions Bot added component: tests Issues re: Tests component: core Issues re: The core compiler component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jun 18, 2026
@github-actions github-actions Bot requested a review from lanluo-nvidia June 18, 2026 04:35
@shoumikhin shoumikhin force-pushed the fix/executorch-per-partition-device branch from 4323b2f to 380e295 Compare June 18, 2026 04:41
TensorRTPartitioner resolved target_device once for the whole exported program
via _get_engine_info_from_edge_program(), which requires exactly one engine
node. A coalesced graph (TensorRT + CUDA delegates) has multiple TRT engines,
so that call raised and every TRT partition fell back to cuda:0 with a spurious
"expects exactly 1 engine node per partition, found N" warning; multi-GPU
graphs also could not be labeled per partition.

Extract _get_engine_info_for_node() (single-node engine-info extraction) from
_get_engine_info_from_edge_program() and resolve target_device per partition
from that partition's own engine node. Single-GPU behavior is unchanged (still
cuda:0) minus the warning; multi-engine/multi-GPU graphs now label each delegate
correctly.
@shoumikhin shoumikhin force-pushed the fix/executorch-per-partition-device branch from 380e295 to 6971383 Compare June 18, 2026 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant