feat: add pretrained orbax cache for fast WAN inference loads by csgoogle · Pull Request #406 · AI-Hypercomputer/maxdiffusion

csgoogle · 2026-05-15T11:47:00Z

Pretrained Orbax cache for fast WAN inference loads

Caches WAN weights as an Orbax checkpoint so inference can skip the slow diffusers load on repeat runs (~10× faster). The first run loads from diffusers and writes the cache; subsequent runs restore directly from it. Works with both local paths and GCS buckets.

Changes

Added a new pretrained_orbax_dir config option to all 6 WAN configs (empty by default, so the feature is off unless set). Accepts a local path or a gs:// bucket path.
Load priority: training checkpoint → pretrained cache → diffusers (the cache is populated automatically on a miss).
generate_wan and generate_wan_animate now load with use_pretrained_cache=True.
Refactored the shared Orbax logic into the WanCheckpointer base class; subclasses now only declare model_name, checkpoint_state_item_names, and pretrained_state_sources.
Added an animate checkpointer along with tests covering the cache hit / miss / save paths.

github-actions · 2026-05-15T11:47:09Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

Perseus14

Can you also add this for the other WAN models?

csgoogle · 2026-05-25T13:20:05Z

Can you also add this for the other WAN models?

done

github-actions · 2026-05-26T06:57:10Z

🤖 Hi @csgoogle, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces a pretrained Orbax cache for WAN models, providing a significant performance boost (~10×) for subsequent inference loads by skipping the slow diffusers load. The implementation is well-structured, with a clean refactoring of the shared Orbax logic into a base WanCheckpointer class, and includes comprehensive tests for the new caching strategy.

🔍 General Feedback

Architecture: The refactoring of WanCheckpointer into an abstract base class is a great improvement, making it easy to add support for new WAN model variants with minimal boilerplate.
Performance: The automatic population of the cache on the first run is a user-friendly feature that drastically improves the startup time for repeated tasks.
Correctness: The restructuring of multi-transformer checkpoints (e.g., WAN 2.2) is handled correctly to maintain compatibility with existing pipeline loading logic.
Robustness: I've suggested adding exception handling to the cache-saving logic to ensure that intermittent write failures (e.g., GCS permissions) do not disrupt the main execution flow.

Perseus14 · 2026-05-26T08:46:51Z

Would this change have an impact on any training runs?

csgoogle · 2026-05-26T09:25:35Z

Would this change have an impact on any training runs?

It won't there are unit tests covering the training load, also the flag for the change only get's enabled if it's true

csgoogle · 2026-05-26T19:47:46Z

Would this change have an impact on any training runs?

Made it simple, now we are not touching training code.

Perseus14 reviewed May 16, 2026

View reviewed changes

Comment thread src/maxdiffusion/checkpointing/wan_checkpointer_2_2.py Outdated

Comment thread src/maxdiffusion/checkpointing/wan_checkpointer_i2v_2p2.py Outdated

csgoogle force-pushed the wan-orbax-checkpoint-cache branch from 1e3be2d to 4e2c7f1 Compare May 25, 2026 13:13

csgoogle marked this pull request as ready for review May 25, 2026 13:19

csgoogle requested a review from entrpn as a code owner May 25, 2026 13:19

csgoogle added the gemini-review label May 25, 2026

csgoogle self-assigned this May 25, 2026

csgoogle requested a review from mbohlool May 25, 2026 13:29

csgoogle added gemini-review and removed gemini-review labels May 26, 2026

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/maxdiffusion/checkpointing/wan_checkpointer.py Outdated

Comment thread src/maxdiffusion/checkpointing/wan_checkpointer.py Outdated

csgoogle force-pushed the wan-orbax-checkpoint-cache branch 5 times, most recently from dd62933 to c9fa891 Compare May 26, 2026 19:44

feat: add pretrained orbax cache for fast WAN inference loads

874f2c1

csgoogle force-pushed the wan-orbax-checkpoint-cache branch from c9fa891 to 874f2c1 Compare May 26, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add pretrained orbax cache for fast WAN inference loads#406

feat: add pretrained orbax cache for fast WAN inference loads#406
csgoogle wants to merge 1 commit into
mainfrom
wan-orbax-checkpoint-cache

csgoogle commented May 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Perseus14 left a comment

Uh oh!

Uh oh!

Uh oh!

csgoogle commented May 25, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Perseus14 commented May 26, 2026

Uh oh!

csgoogle commented May 26, 2026

Uh oh!

csgoogle commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

csgoogle commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pretrained Orbax cache for fast WAN inference loads

Changes

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Perseus14 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

csgoogle commented May 25, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Perseus14 commented May 26, 2026

Uh oh!

csgoogle commented May 26, 2026

Uh oh!

csgoogle commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

csgoogle commented May 15, 2026 •

edited

Loading