Add LP-type-aware Torch-JIT surrogate models by sanjaychari · Pull Request #241 · codes-org/codes

sanjaychari · 2026-05-28T19:20:54Z

This PR extends the Director-controlled Torch-JIT surrogate path so CODES can move beyond a single global packet-latency model and support component/LP-type-specific ML models. The main addition is a new torch_jit_mode path for LP-type-aware surrogate inference, while preserving the existing global packet_latency_model behavior for backward compatibility.

The key motivation is to make the surrogate path more faithful to the PDES structure: terminals, routers, and other LP/component types can have different timing behavior, so the ML interface should allow separate models and feature paths rather than forcing all inference through one global latency predictor.

The kronos-develop-director-b branch of CODES was using an outdated version of ROSS and also had compilation issues because of zeromq. This commit changes it to be compatible with the master branch of ROSS and fixes the zeromq compilation issues.

Compilation with torch-jit was not occuring even with torch_enable set to 1. This commit fixes torch-jit compilation with GPU support.

This commit makes the kronos-develop-director-b branch compatible with the master branch and introduces ML modelling code to be used with the director.

Expand the Torch-JIT packet latency predictor to use an LP-aware feature vector with terminal, packet, LP gid, router/group, queue, VC occupancy, and processing-delay context. Update pure PDES trace export so the generated training CSV records the true ROSS caller LP gid, keeping training data consistent with runtime inference. Update the component-level training and runtime feature construction paths to use the expanded feature order for the two model targets: travel_end_time_delta and next_packet_delay.

Add a NETWORK_SURROGATE debug_prints option for enabling diagnostic logging in the Torch-JIT packet latency predictor and Dragonfly Dally post-switch paths. Keep the diagnostics disabled by default so normal surrogate runs are not noisy, while preserving an explicit config knob for debugging inference-time feature construction and post-switch event behavior.

Extend the Torch-JIT surrogate path so different LP/component types can use separate ML models while preserving the existing single global packet-latency model for backward compatibility. Key changes: - Keep the existing packet_latency_model path for legacy Torch-JIT modes. - Add explicit torch_jit_mode support for lp-aware-lp-type-models. - Add terminal/default packet-latency model loading. - Add router timing model loading for queueing-delay prediction. - Add router timing trace output for high-fidelity PDES training runs. - Add configurable router timing trace stride to avoid excessive trace volume. - Correct router timing target to measure router-local residence/queueing delay from this_router_arrival instead of only propagation_ts. - Add component-level training scripts for terminal packet latency and router queueing delay TorchScript models. - Support CUDA training while exporting CPU-compatible TorchScript modules. - Wire debug_prints into the average packet-latency predictor so average debug output respects the config flag. The router model is intentionally timing-only: existing dragonfly routing still selects valid output ports, channels, and paths, while the ML model predicts only additional router-local queueing/contention delay.

kevinabrown and others added 13 commits June 17, 2024 00:13

MPI Replay: remove print_surrogate_stats() to compile cleanly

976bb79

director-b: started adding director LP for mpi-replay

7f42f4a

director-b: complete initial director LP prototype for mpi-replay

5ae2e7c

zmqml: update zmq server and requester to interface with director LP

138f46a

Fix zmq and ROSS compilation issues

39fbc4f

The kronos-develop-director-b branch of CODES was using an outdated version of ROSS and also had compilation issues because of zeromq. This commit changes it to be compatible with the master branch of ROSS and fixes the zeromq compilation issues.

Fix torch-jit compilation

0651b5e

Compilation with torch-jit was not occuring even with torch_enable set to 1. This commit fixes torch-jit compilation with GPU support.

Allow cpu-based PyTorch usage

01a2b16

Add ML models

51f691b

This commit makes the kronos-develop-director-b branch compatible with the master branch and introduces ML modelling code to be used with the director.

Move ML models to surrogate directory

e42e75a

Improve debug print check mechanism

3c3f78d

sanjaychari changed the title ~~Digital twin sbir develop component ml~~ Add LP-type-aware Torch-JIT surrogate models May 28, 2026

sanjaychari marked this pull request as draft May 28, 2026 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LP-type-aware Torch-JIT surrogate models#241

Add LP-type-aware Torch-JIT surrogate models#241
sanjaychari wants to merge 13 commits into
codes-org:masterfrom
sanjaychari:digital-twin-sbir-develop-component-ml

sanjaychari commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sanjaychari commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants