Add LP-type-aware Torch-JIT surrogate models#241
Draft
sanjaychari wants to merge 13 commits into
Draft
Conversation
The kronos-develop-director-b branch of CODES was using an outdated version of ROSS and also had compilation issues because of zeromq. This commit changes it to be compatible with the master branch of ROSS and fixes the zeromq compilation issues.
Compilation with torch-jit was not occuring even with torch_enable set to 1. This commit fixes torch-jit compilation with GPU support.
This commit makes the kronos-develop-director-b branch compatible with the master branch and introduces ML modelling code to be used with the director.
Expand the Torch-JIT packet latency predictor to use an LP-aware feature vector with terminal, packet, LP gid, router/group, queue, VC occupancy, and processing-delay context. Update pure PDES trace export so the generated training CSV records the true ROSS caller LP gid, keeping training data consistent with runtime inference. Update the component-level training and runtime feature construction paths to use the expanded feature order for the two model targets: travel_end_time_delta and next_packet_delay.
Add a NETWORK_SURROGATE debug_prints option for enabling diagnostic logging in the Torch-JIT packet latency predictor and Dragonfly Dally post-switch paths. Keep the diagnostics disabled by default so normal surrogate runs are not noisy, while preserving an explicit config knob for debugging inference-time feature construction and post-switch event behavior.
Extend the Torch-JIT surrogate path so different LP/component types can use separate ML models while preserving the existing single global packet-latency model for backward compatibility. Key changes: - Keep the existing packet_latency_model path for legacy Torch-JIT modes. - Add explicit torch_jit_mode support for lp-aware-lp-type-models. - Add terminal/default packet-latency model loading. - Add router timing model loading for queueing-delay prediction. - Add router timing trace output for high-fidelity PDES training runs. - Add configurable router timing trace stride to avoid excessive trace volume. - Correct router timing target to measure router-local residence/queueing delay from this_router_arrival instead of only propagation_ts. - Add component-level training scripts for terminal packet latency and router queueing delay TorchScript models. - Support CUDA training while exporting CPU-compatible TorchScript modules. - Wire debug_prints into the average packet-latency predictor so average debug output respects the config flag. The router model is intentionally timing-only: existing dragonfly routing still selects valid output ports, channels, and paths, while the ML model predicts only additional router-local queueing/contention delay.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR extends the Director-controlled Torch-JIT surrogate path so CODES can move beyond a single global packet-latency model and support component/LP-type-specific ML models. The main addition is a new torch_jit_mode path for LP-type-aware surrogate inference, while preserving the existing global packet_latency_model behavior for backward compatibility.
The key motivation is to make the surrogate path more faithful to the PDES structure: terminals, routers, and other LP/component types can have different timing behavior, so the ML interface should allow separate models and feature paths rather than forcing all inference through one global latency predictor.