feat: Allow for users / kv cache to add aliased I/O for inplace operations#4251
feat: Allow for users / kv cache to add aliased I/O for inplace operations#4251narendasan wants to merge 2 commits into
Conversation
354674d to
813e753
Compare
There was a problem hiding this comment.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py 2026-05-12 00:26:56.728308+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py 2026-05-12 00:27:18.993037+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests
from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py 2026-05-12 00:26:56.731194+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py 2026-05-12 00:27:19.634834+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
placeholders into ``get_attr`` reads and registers the buffers as
module state. The result is a plain ``fx.GraphModule`` that
serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect
import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py 2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py 2026-05-12 00:27:21.127481+00:00
@@ -17,10 +17,11 @@
is already visible on the user's input).
These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest
import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py 2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py 2026-05-12 00:27:21.138494+00:00
@@ -15,10 +15,11 @@
``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
transform replaces what used to be an external ``BufferThreadingModule``
wrapper — making the result a plain ``fx.GraphModule`` that exports
naturally without a custom wrapper class.
"""
+
import tempfile
import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py 2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py 2026-05-12 00:27:21.208126+00:00
@@ -36,10 +36,11 @@
workaround that skips ``run_decompositions`` for already-decomposed EPs.
When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest
import torch
import torch_tensorrt
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py 2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py 2026-05-12 00:27:21.275543+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py 2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py 2026-05-12 00:27:21.304553+00:00
@@ -13,10 +13,11 @@
These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py 2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py 2026-05-12 00:27:21.380261+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_engine813e753 to
bcaf725
Compare
There was a problem hiding this comment.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py 2026-05-12 20:26:34.855069+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py 2026-05-12 20:26:58.441876+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests
from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py 2026-05-12 20:26:34.858373+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py 2026-05-12 20:26:59.052592+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
placeholders into ``get_attr`` reads and registers the buffers as
module state. The result is a plain ``fx.GraphModule`` that
serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect
import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py 2026-05-12 20:26:34.860665+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py 2026-05-12 20:27:00.527955+00:00
@@ -17,10 +17,11 @@
is already visible on the user's input).
These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest
import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py 2026-05-12 20:26:34.861069+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py 2026-05-12 20:27:00.554620+00:00
@@ -15,10 +15,11 @@
``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
transform replaces what used to be an external ``BufferThreadingModule``
wrapper — making the result a plain ``fx.GraphModule`` that exports
naturally without a custom wrapper class.
"""
+
import tempfile
import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py 2026-05-12 20:26:34.861069+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py 2026-05-12 20:27:00.602080+00:00
@@ -36,10 +36,11 @@
workaround that skips ``run_decompositions`` for already-decomposed EPs.
When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest
import torch
import torch_tensorrt
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py 2026-05-12 20:26:34.860665+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py 2026-05-12 20:27:00.676156+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py 2026-05-12 20:26:34.861069+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py 2026-05-12 20:27:00.713992+00:00
@@ -13,10 +13,11 @@
These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py 2026-05-12 20:26:34.861069+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py 2026-05-12 20:27:00.785133+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_enginebcaf725 to
3afcfd3
Compare
There was a problem hiding this comment.
There are some changes that do not conform to C++ style guidelines:
diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp b/tmp/changes.txt
index a46ad8f..45dbf63 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp
+++ b/tmp/changes.txt
@@ -335,8 +335,7 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
std::make_unique<torch::autograd::profiler::RecordProfile>(compiled_engine->input_profile_path);
}
- setup_input_tensors(
- inputs, compiled_engine, cudagraphs_enabled, need_cudagraphs_record, bound_inputs_by_name);
+ setup_input_tensors(inputs, compiled_engine, cudagraphs_enabled, need_cudagraphs_record, bound_inputs_by_name);
// Check if input shapes can be inferred.
int32_t const io_size{compiled_engine->cuda_engine->getNbIOTensors()};
std::vector<char const*> names(io_size);
@@ -494,7 +493,6 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
// (validated at engine construction). The bound-inputs map is unused here.
std::unordered_map<std::string, at::Tensor> bound_inputs_by_name;
-
{ // Input Setup
std::unique_ptr<torch::autograd::profiler::RecordProfile> input_profiler_guard;
if (compiled_engine->profile_execution) {
ERROR: Some files do not conform to style guidelinesThere was a problem hiding this comment.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py 2026-06-10 20:10:37.595181+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py 2026-06-10 20:11:00.737050+00:00
@@ -41,10 +41,11 @@
Optional[SerializedTensorRTEngineFmt],
List[str],
List[str],
]
+
def user_output_count(
output_binding_names: List[str], aliased_io: Dict[str, Tuple[str, str]]
) -> int:
"""Derive the boundary between user-visible outputs and side-effect
aliased outputs.
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py 2026-06-10 20:10:37.618585+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py 2026-06-10 20:11:02.717030+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests
from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py 2026-06-10 20:10:37.622167+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py 2026-06-10 20:11:03.643555+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
placeholders into ``get_attr`` reads and registers the buffers as
module state. The result is a plain ``fx.GraphModule`` that
serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect
import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py 2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py 2026-06-10 20:11:05.387147+00:00
@@ -15,10 +15,11 @@
``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
transform replaces what used to be an external ``BufferThreadingModule``
wrapper — making the result a plain ``fx.GraphModule`` that exports
naturally without a custom wrapper class.
"""
+
import tempfile
import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py 2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py 2026-06-10 20:11:05.513263+00:00
@@ -13,10 +13,11 @@
These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py 2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py 2026-06-10 20:11:05.532849+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py 2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py 2026-06-10 20:11:05.670076+00:00
@@ -17,10 +17,11 @@
is already visible on the user's input).
These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest
import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py 2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py 2026-06-10 20:11:05.709848+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_engine
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py 2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py 2026-06-10 20:11:05.721107+00:00
@@ -36,10 +36,11 @@
workaround that skips ``run_decompositions`` for already-decomposed EPs.
When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest
import torch
import torch_tensorrt
3afcfd3 to
a8fce13
Compare
There was a problem hiding this comment.
There are some changes that do not conform to C++ style guidelines:
diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp b/tmp/changes.txt
index a46ad8f..45dbf63 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp
+++ b/tmp/changes.txt
@@ -335,8 +335,7 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
std::make_unique<torch::autograd::profiler::RecordProfile>(compiled_engine->input_profile_path);
}
- setup_input_tensors(
- inputs, compiled_engine, cudagraphs_enabled, need_cudagraphs_record, bound_inputs_by_name);
+ setup_input_tensors(inputs, compiled_engine, cudagraphs_enabled, need_cudagraphs_record, bound_inputs_by_name);
// Check if input shapes can be inferred.
int32_t const io_size{compiled_engine->cuda_engine->getNbIOTensors()};
std::vector<char const*> names(io_size);
@@ -494,7 +493,6 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
// (validated at engine construction). The bound-inputs map is unused here.
std::unordered_map<std::string, at::Tensor> bound_inputs_by_name;
-
{ // Input Setup
std::unique_ptr<torch::autograd::profiler::RecordProfile> input_profiler_guard;
if (compiled_engine->profile_execution) {
ERROR: Some files do not conform to style guidelinesThere was a problem hiding this comment.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py 2026-06-10 20:14:52.123795+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py 2026-06-10 20:15:12.601131+00:00
@@ -41,10 +41,11 @@
Optional[SerializedTensorRTEngineFmt],
List[str],
List[str],
]
+
def user_output_count(
output_binding_names: List[str], aliased_io: Dict[str, Tuple[str, str]]
) -> int:
"""Derive the boundary between user-visible outputs and side-effect
aliased outputs.
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py 2026-06-10 20:14:52.147711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py 2026-06-10 20:15:14.572201+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests
from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py 2026-06-10 20:14:52.151132+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py 2026-06-10 20:15:15.343322+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
placeholders into ``get_attr`` reads and registers the buffers as
module state. The result is a plain ``fx.GraphModule`` that
serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect
import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py 2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py 2026-06-10 20:15:16.838431+00:00
@@ -17,10 +17,11 @@
is already visible on the user's input).
These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest
import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py 2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py 2026-06-10 20:15:16.847891+00:00
@@ -15,10 +15,11 @@
``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
transform replaces what used to be an external ``BufferThreadingModule``
wrapper — making the result a plain ``fx.GraphModule`` that exports
naturally without a custom wrapper class.
"""
+
import tempfile
import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py 2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py 2026-06-10 20:15:16.911013+00:00
@@ -36,10 +36,11 @@
workaround that skips ``run_decompositions`` for already-decomposed EPs.
When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest
import torch
import torch_tensorrt
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py 2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py 2026-06-10 20:15:16.974958+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py 2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py 2026-06-10 20:15:17.016423+00:00
@@ -13,10 +13,11 @@
These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py 2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py 2026-06-10 20:15:17.363571+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_engine|
Anything we could do to avoid manually resetting the KV cache before every run? |
a8fce13 to
94d6ac7
Compare
There was a problem hiding this comment.
There are some changes that do not conform to C++ style guidelines:
diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp b/tmp/changes.txt
index 18006f2..75a8b6d 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp
+++ b/tmp/changes.txt
@@ -354,8 +354,7 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
std::make_unique<torch::autograd::profiler::RecordProfile>(compiled_engine->input_profile_path);
}
- setup_input_tensors(
- inputs, compiled_engine, effective_cudagraphs, need_cudagraphs_record, bound_inputs_by_name);
+ setup_input_tensors(inputs, compiled_engine, effective_cudagraphs, need_cudagraphs_record, bound_inputs_by_name);
// Check if input shapes can be inferred.
int32_t const io_size{compiled_engine->cuda_engine->getNbIOTensors()};
std::vector<char const*> names(io_size);
@@ -515,7 +514,6 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
// (validated at engine construction). The bound-inputs map is unused here.
std::unordered_map<std::string, at::Tensor> bound_inputs_by_name;
-
{ // Input Setup
std::unique_ptr<torch::autograd::profiler::RecordProfile> input_profiler_guard;
if (compiled_engine->profile_execution) {
ERROR: Some files do not conform to style guidelinesThere was a problem hiding this comment.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/lowering/passes/decompose_dynamic_slice_scatter.py 2026-06-17 22:23:27.911413+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/lowering/passes/decompose_dynamic_slice_scatter.py 2026-06-17 22:23:53.349374+00:00
@@ -38,13 +38,11 @@
dim = args[2]
start: Optional[Any] = args[3] if len(args) > 3 else None
end: Optional[Any] = args[4] if len(args) > 4 else None
step: Optional[Any] = args[5] if len(args) > 5 else None
- is_dynamic = any(
- isinstance(x, torch.fx.Node) for x in (start, end, step)
- )
+ is_dynamic = any(isinstance(x, torch.fx.Node) for x in (start, end, step))
if not is_dynamic:
continue
input_val = input_node.meta.get("val")
if input_val is None:
@@ -82,13 +80,11 @@
torch.ops.aten.view.default,
(arange_node, view_shape),
)
expand_size = [
- gm.graph.call_function(
- torch.ops.aten.sym_size.int, (src_node, i)
- )
+ gm.graph.call_function(torch.ops.aten.sym_size.int, (src_node, i))
for i in range(rank)
]
expand_node = gm.graph.call_function(
torch.ops.aten.expand.default,
(view_node, expand_size),
@@ -108,10 +104,8 @@
dim,
)
if changed:
gm = clean_up_graph_after_modifications(gm)
- logger.debug(
- "After decompose_dynamic_slice_scatter:\n%s", gm.graph
- )
+ logger.debug("After decompose_dynamic_slice_scatter:\n%s", gm.graph)
return gm
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py 2026-06-17 22:23:27.913545+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py 2026-06-17 22:23:54.506155+00:00
@@ -42,10 +42,11 @@
Optional[SerializedTensorRTEngineFmt],
List[str],
List[str],
]
+
def user_output_count(
output_binding_names: List[str], aliased_io: Dict[str, Tuple[str, str]]
) -> int:
"""Derive the boundary between user-visible outputs and side-effect
aliased outputs.
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py 2026-06-17 22:23:27.937613+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py 2026-06-17 22:23:56.484974+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests
from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py 2026-06-17 22:23:27.941414+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py 2026-06-17 22:23:57.432798+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
placeholders into ``get_attr`` reads and registers the buffers as
module state. The result is a plain ``fx.GraphModule`` that
serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect
import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_decompose_dynamic_slice_scatter.py 2026-06-17 22:23:27.941414+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_decompose_dynamic_slice_scatter.py 2026-06-17 22:23:57.444168+00:00
@@ -4,10 +4,11 @@
When ``slice_scatter``'s start/end/step is a SymInt (e.g. derived from a
dynamic dim), the static converter path doesn't apply. The lowering pass
rewrites the op into ``arange + view + expand + scatter`` so each piece
hits its existing dynamic-shape converter.
"""
+
import unittest
import torch
import torch_tensorrt
from torch.export import Dim, export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py 2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py 2026-06-17 22:23:59.314499+00:00
@@ -17,10 +17,11 @@
is already visible on the user's input).
These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest
import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py 2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py 2026-06-17 22:23:59.423966+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py 2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py 2026-06-17 22:23:59.464305+00:00
@@ -15,10 +15,11 @@
``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
transform replaces what used to be an external ``BufferThreadingModule``
wrapper — making the result a plain ``fx.GraphModule`` that exports
naturally without a custom wrapper class.
"""
+
import tempfile
import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py 2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py 2026-06-17 22:23:59.510863+00:00
@@ -36,10 +36,11 @@
workaround that skips ``run_decompositions`` for already-decomposed EPs.
When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest
import torch
import torch_tensorrt
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py 2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py 2026-06-17 22:23:59.592871+00:00
@@ -13,10 +13,11 @@
These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py 2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py 2026-06-17 22:23:59.971603+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_engine
Description
Adds support for in-place ATen operators by extending the Torch-TensorRT compile pipeline and C++ runtime with aliased input/output bindings. The motivating case is streaming inference with a key/value cache (e.g. autoregressive decoders, ZoomASR): each step writes a single timestep into the cache, and without aliasing every step pays a full cache-size copy at the engine boundary. With aliased I/O the TensorRT engine writes directly into the user's (or module-held) cache storage; no fresh allocation, no post-engine copy.
Two
aliased_io"kinds" are tracked so the runtime can reason about provenance:kv_cache_update— TensorRT-enforced viaIKVCacheUpdateLayer; reported throughICudaEngine::getAliasedInputTensor.user— Torch-TensorRT-declared; reserved for future expansion if TRT exposes a public non-KV aliasing API.What this PR does
Pipeline (Python)
slice_scatterandindex_copyconverters that recognize KV-cache-update patterns (4-D static cache,dim=2, batch=1) and emitIKVCacheUpdateLayerwith the output aliased to the cache input. Non-eligible cases fall back to scatter in TRT — no graph break.index_copy, two disjoint converters (validator-gated KV fast path atConverterPriority.HIGH+ scatter fallback at standard priority) cleanly split the cases.aliased_ioplumbed throughTRTInterpreter→TRTInterpreterResult→SerializedInterpreterResult→TorchTensorRTModule. Theoutput()step automatically promotes layer outputs that need to be network outputs (KVCacheUpdate requires it) and appends them after user outputs. The user/side-effect boundary is derived at runtime, not stored.Buffer-style support
lift_mutated_bufferslowering pass detectsBUFFER_MUTATIONpatterns (the trailingaten.copy_(get_attr_buffer, ...)thatep.module()emits) and lifts each mutated buffer fromget_attrtoplaceholderso the engine sees it as an input binding.inline_lifted_buffers_into_gmpost-compile transform registers the buffers asnn.Modulestate on the compiledGraphModuleand rewrites the lifted placeholders toget_attrreads. The result is a plainfx.GraphModule(no custom wrapper class) that serializes cleanly throughtorch_tensorrt.save/torch.export.convert_exported_program_to_serialized_trt_enginegainslift_mutable_buffers: bool = Falsefor power users who want to manage the resulting bindings themselves.C++ runtime (ABI v9 → v10)
ABI_VERSIONto"10"; addedALIASED_IO_IDXtoSerializedInfoIndex.serialize_aliased_io/deserialize_aliased_iohelpers (wire format:output@input@kindrecords joined byBINDING_DELIM). Helpers live inruntime_utils.cppalongsideserialize_bindings.TRTEngineconstructor reconciles the build-time map againstICudaEngine::getAliasedInputTensor— the engine API is the source of truth for KV-style aliasing.execute_enginerecords bound input tensors by binding name; for each output binding inaliased_io, binds the samedata_ptras the source input and skips fresh allocation. Pre-allocated outputs are disabled when aliased I/O is present.Docs + examples
docsrc/contributors/inplace_operations.rst— full design doc covering motivation, primitives, pipeline, runtime, serialization format, and known limitations.examples/dynamo/:aliased_io_user_inputs.py— caller-owned cache (simplest case)aliased_io_buffers.py— module-owned cache viaregister_bufferaliased_io_kv_attention.py— realistic single-layer transformer attention block with static KV cacheFixes partially #4240 (in-place custom plugins / multiple outputs — addresses the in-place-operator side; plugin-side aliased I/O is explicitly out of scope here).
Type of change
Checklist
Test summary
38 new tests across 8 files, all passing:
tests/py/dynamo/conversion/test_slice_scatter_aten.pytests/py/dynamo/runtime/test_aliased_io.pytests/py/dynamo/runtime/test_index_copy_kv.pyaten.index_copytests/py/dynamo/runtime/test_lift_mutable_buffers_api.pylift_mutable_buffers=Trueflag round-trip (introspect engine, construct module, run)tests/py/dynamo/runtime/test_aliased_io_serialization.pytorch_tensorrt.save/loadround-trip for user-input + buffer-backed + streaming buffertests/py/dynamo/runtime/test_aliased_io_cudagraphs.pytests/py/dynamo/runtime/test_hf_static_cache_xfail.pytests/py/dynamo/lowering/test_buffer_lifting.pylift_mutated_buffers+inline_lifted_buffers_into_gmunit testsKnown gaps (documented)
StaticCachedon't compile end-to-end yet: torch.export'srun_decompositionsraises internally on the EP thatconvert_and_export_with_cacheproduces. The xfail test asserts the known failure so a future upstream fix surfaces as a test failure. Path forward documented in the design doc.IKVCacheUpdateLayerrequires statics_max. Dynamic-sequence-length cache shapes fall through to the scatter path (still correct, no aliasing).