Skip to content

Add LP-type-aware Torch-JIT surrogate models#241

Draft
sanjaychari wants to merge 13 commits into
codes-org:masterfrom
sanjaychari:digital-twin-sbir-develop-component-ml
Draft

Add LP-type-aware Torch-JIT surrogate models#241
sanjaychari wants to merge 13 commits into
codes-org:masterfrom
sanjaychari:digital-twin-sbir-develop-component-ml

Conversation

@sanjaychari
Copy link
Copy Markdown

This PR extends the Director-controlled Torch-JIT surrogate path so CODES can move beyond a single global packet-latency model and support component/LP-type-specific ML models. The main addition is a new torch_jit_mode path for LP-type-aware surrogate inference, while preserving the existing global packet_latency_model behavior for backward compatibility.

The key motivation is to make the surrogate path more faithful to the PDES structure: terminals, routers, and other LP/component types can have different timing behavior, so the ML interface should allow separate models and feature paths rather than forcing all inference through one global latency predictor.

kevinabrown and others added 13 commits June 17, 2024 00:13
The kronos-develop-director-b branch of CODES
was using an outdated version of ROSS and also
had compilation issues because of zeromq. This
commit changes it to be compatible with the master
branch of ROSS and fixes the zeromq compilation
issues.
Compilation with torch-jit was not occuring even with torch_enable set to 1.
This commit fixes torch-jit compilation with GPU support.
This commit makes the kronos-develop-director-b branch compatible with the master branch
and introduces ML modelling code to be used with the director.
Expand the Torch-JIT packet latency predictor to use an LP-aware feature
vector with terminal, packet, LP gid, router/group, queue, VC occupancy, and
processing-delay context. Update pure PDES trace export so the generated
training CSV records the true ROSS caller LP gid, keeping training data
consistent with runtime inference.

Update the component-level training and runtime feature construction paths
to use the expanded feature order for the two model targets:
travel_end_time_delta and next_packet_delay.
Add a NETWORK_SURROGATE debug_prints option for enabling
diagnostic logging in the Torch-JIT packet latency predictor and
Dragonfly Dally post-switch paths. Keep the diagnostics disabled by
default so normal surrogate runs are not noisy, while preserving an
explicit config knob for debugging inference-time feature construction
and post-switch event behavior.
Extend the Torch-JIT surrogate path so different LP/component types can use
separate ML models while preserving the existing single global packet-latency
model for backward compatibility.

Key changes:
- Keep the existing packet_latency_model path for legacy Torch-JIT modes.
- Add explicit torch_jit_mode support for lp-aware-lp-type-models.
- Add terminal/default packet-latency model loading.
- Add router timing model loading for queueing-delay prediction.
- Add router timing trace output for high-fidelity PDES training runs.
- Add configurable router timing trace stride to avoid excessive trace volume.
- Correct router timing target to measure router-local residence/queueing delay
  from this_router_arrival instead of only propagation_ts.
- Add component-level training scripts for terminal packet latency and router
  queueing delay TorchScript models.
- Support CUDA training while exporting CPU-compatible TorchScript modules.
- Wire debug_prints into the average packet-latency predictor so average debug
  output respects the config flag.

The router model is intentionally timing-only: existing dragonfly routing still
selects valid output ports, channels, and paths, while the ML model predicts only
additional router-local queueing/contention delay.
@sanjaychari sanjaychari changed the title Digital twin sbir develop component ml Add LP-type-aware Torch-JIT surrogate models May 28, 2026
@sanjaychari sanjaychari marked this pull request as draft May 28, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants