Skip to content

Add distillation launchers for qwen3-30b-a3b-base and gpt-oss-20b#4028

Open
gagika wants to merge 1 commit into
mainfrom
gagik-distill-perf
Open

Add distillation launchers for qwen3-30b-a3b-base and gpt-oss-20b#4028
gagika wants to merge 1 commit into
mainfrom
gagik-distill-perf

Conversation

@gagika
Copy link
Copy Markdown
Collaborator

@gagika gagika commented May 31, 2026

Description

One-command launchers for running distillation on TPU v7x. Each script sets the
right XLA flags, mounts a grain arrayrecord dataset via gcsfuse (ClimbMix by
default; configurable via XPK_DATASET_BUCKET / XPK_DATASET_SUBPATH),
configures distillation knobs, stages the HF tokenizer when needed, and submits
a workload via XPK.

Usage

# qwen3-30b-a3b-base distillation (~20% MFU)
bash scripts/distillation/distill_qwen3_30b_base.sh submit

# gpt-oss-20b distillation (~17% MFU)
bash scripts/distillation/distill_gpt_oss_20b.sh submit

# qwen3-30b at pdbs=8 with activation offload (~22% MFU)
XPK_DISTILL_CONFIG=src/maxtext/configs/post_train/distillation_qwen3_30b_base_pdbs8.yml \
XPK_YAML_GCS=gs://agagik-us/distill-configs/distillation_qwen3_30b_base_pdbs8.yml \
  bash scripts/distillation/distill_qwen3_30b_base.sh submit

Each launcher takes a mode argument (default submit):

  • submit — stage the YAML to GCS and create the xpk workload
  • monitor — stream logs for the last submitted workload
  • resume_until_done — auto-resubmit on failure until the run completes

Tests

End to end test for both gpt-oss and qwen3-30b models.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This Pull Request introduces distillation launchers and configurations for qwen3-30b-a3b-base and gpt-oss-20b models on TPU v7x. The additions are useful for standardizing distillation runs, but there are a few issues regarding redundancy and hardcoded personal paths.

🔍 General Feedback

  • Redundant Patch File: The file distillation-wrappers.patch appears to be a redundant diff of the entire PR and should be removed.
  • Hardcoded Defaults: Several scripts and configuration files contain default GCS paths and images pointing to personal buckets (agagik-us, yujiedeng-maxtext-dev). These should ideally be replaced with generic placeholders or public resources to improve maintainability and portability for other users.
  • Environment Management: The use of /dev/shm for TMPDIR and Hugging Face caches is a good performance optimization to avoid ephemeral storage limits, but setting it globally as TMPDIR should be done with caution.

Comment thread scripts/distillation/distill_gpt_oss_20b.sh Outdated
Comment thread scripts/distillation/distill_qwen3_30b_base.sh Outdated
Comment thread distillation-wrappers.patch Outdated
Comment thread src/maxtext/configs/post_train/distillation_gpt_oss_20b.yml Outdated
Comment thread src/maxtext/configs/post_train/distillation_gpt_oss_20b.yml Outdated
Comment thread scripts/distillation/distill_gpt_oss_20b.sh Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This PR introduces comprehensive one-command distillation launchers for qwen3-30b-a3b-base and gpt-oss-20b on TPU v7x. The additions include performance-tuned XLA flags, optimized YAML configurations (including activation offload for higher batch sizes), and enhancements to the shared XPK submission script to handle tokenizer staging and HF caching efficiently.

🔍 General Feedback

  • Robustness: The shared run_distill_xpk.sh was improved to handle HF caching in /dev/shm, which is a great optimization for TPU workloads. I've suggested some minor quoting fixes to ensure these scripts handle paths with spaces or special characters reliably.
  • Documentation: The scripts and YAML files include helpful comments explaining specific model quirks (e.g., the distill_beta=0 requirement for gpt-oss).
  • Defaults: While demo defaults are provided, I recommended using more generic placeholders for buckets and images to prevent accidental use of dev resources by other users.

export XPK_ZONE="${XPK_ZONE:-us-central1}"
export XPK_DEVICE_TYPE="${XPK_DEVICE_TYPE:-tpu7x-4x4x4}"
export XPK_BASE_OUTPUT_DIR="${XPK_BASE_OUTPUT_DIR:-gs://agagik-us/distillation}"
export XPK_BASE_IMAGE="${XPK_BASE_IMAGE:-gcr.io/cloud-tpu-multipod-dev/maxtext_base_image:agagik-distill}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Using a dev image as a demo default is acceptable, but it might be better to point to a more stable or public reference if available.
Suggested change
export XPK_BASE_IMAGE="${XPK_BASE_IMAGE:-gcr.io/cloud-tpu-multipod-dev/maxtext_base_image:agagik-distill}"
export XPK_BASE_IMAGE="${XPK_BASE_IMAGE:-gcr.io/cloud-tpu-multipod-dev/maxtext_base_image:agagik-distill}"

"$image_flag=$XPK_BASE_IMAGE" \
--command "export PYTHONPATH=/deps/src:/app/src; \
export BASE_OUTPUT_DIRECTORY=${OUTPUT_DIR}; \
export LIBTPU_INIT_ARGS='${libtpu_init_args}'; \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Wrap variables in quotes within the command string to handle paths with spaces correctly when executed in the TPU pod.
Suggested change
export LIBTPU_INIT_ARGS='${libtpu_init_args}'; \
export HF_HOME=\"${XPK_HF_CACHE_DIR}\"; export HF_DATASETS_CACHE=\"${XPK_HF_CACHE_DIR}/datasets\"; mkdir -p \"${XPK_HF_CACHE_DIR}/datasets\"; \

--xla_tpu_aggressive_opt_barrier_removal=ENABLED \
--xla_lhs_prioritize_async_depth_over_stall=ENABLED \
--xla_tpu_enable_ag_backward_pipelining=true \
--xla_should_allow_loop_variant_parameter_in_chain=ENABLED \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Use `printf --` to safely handle cases where the expansion of `${XPK_LIBTPU_INIT_ARGS:-$default_libtpu_args}` might start with a hyphen, preventing it from being interpreted as a `printf` flag.
Suggested change
--xla_should_allow_loop_variant_parameter_in_chain=ENABLED \
libtpu_init_args=$(printf -- '%s' "${XPK_LIBTPU_INIT_ARGS:-$default_libtpu_args}" | tr -s '[:space:]' ' ')


# Optional: stage HF tokenizer files from GCS for models whose tokenizer isn't
# baked into the image (e.g. gpt-oss).
tokenizer_prelude=""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Wrap variables in quotes to ensure the command remains valid if paths contain spaces or special characters.
Suggested change
tokenizer_prelude=""
tokenizer_prelude="mkdir -p \"${XPK_TOKENIZER_LOCAL}\" && gcloud storage rsync \"${XPK_TOKENIZER_GCS}\" \"${XPK_TOKENIZER_LOCAL}\";"

export XPK_PROJECT="${XPK_PROJECT:-cloud-tpu-multipod-dev}"
export XPK_ZONE="${XPK_ZONE:-us-central1}"
export XPK_DEVICE_TYPE="${XPK_DEVICE_TYPE:-tpu7x-4x4x4}"
export XPK_BASE_OUTPUT_DIR="${XPK_BASE_OUTPUT_DIR:-gs://agagik-us/distillation}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Using a specific user bucket as a demo default is acceptable since it's explicitly labeled, but consider using a generic placeholder like `gs://YOUR-BUCKET/distillation` to encourage users to set their own environment variables.
Suggested change
export XPK_BASE_OUTPUT_DIR="${XPK_BASE_OUTPUT_DIR:-gs://agagik-us/distillation}"
export XPK_BASE_OUTPUT_DIR="${XPK_BASE_OUTPUT_DIR:-gs://YOUR-BUCKET/distillation}"

export XPK_ZONE="${XPK_ZONE:-us-central1}"
export XPK_DEVICE_TYPE="${XPK_DEVICE_TYPE:-tpu7x-4x4x4}"
export XPK_BASE_OUTPUT_DIR="${XPK_BASE_OUTPUT_DIR:-gs://agagik-us/distillation}"
export XPK_BASE_IMAGE="${XPK_BASE_IMAGE:-gcr.io/cloud-tpu-multipod-dev/maxtext_base_image:agagik-distill}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Consider using a more generic or stable base image reference for the demo default.
Suggested change
export XPK_BASE_IMAGE="${XPK_BASE_IMAGE:-gcr.io/cloud-tpu-multipod-dev/maxtext_base_image:agagik-distill}"
export XPK_BASE_IMAGE="${XPK_BASE_IMAGE:-gcr.io/cloud-tpu-multipod-dev/maxtext_base_image:agagik-distill}"

export XPK_PROJECT="${XPK_PROJECT:-cloud-tpu-multipod-dev}"
export XPK_ZONE="${XPK_ZONE:-us-central1}"
export XPK_DEVICE_TYPE="${XPK_DEVICE_TYPE:-tpu7x-4x4x4}"
export XPK_BASE_OUTPUT_DIR="${XPK_BASE_OUTPUT_DIR:-gs://agagik-us/distillation}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Consider using a generic placeholder for the demo default.
Suggested change
export XPK_BASE_OUTPUT_DIR="${XPK_BASE_OUTPUT_DIR:-gs://agagik-us/distillation}"
export XPK_BASE_OUTPUT_DIR="${XPK_BASE_OUTPUT_DIR:-gs://YOUR-BUCKET/distillation}"

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@gagika gagika force-pushed the gagik-distill-perf branch from bc5cc4a to fb8d0fb Compare June 2, 2026 02:15
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This PR introduces comprehensive distillation launchers and configurations for qwen3-30b-a3b-base and gpt-oss-20b models on TPU v7x. The updates to the underlying run_distill_xpk.sh script, including HF cache management and GCS-based asset staging, significantly improve the robustness and ease of use for distillation workloads.

🔍 General Feedback

  • Out-of-the-box Usability: While the scripts are well-structured, several defaults point to specific user buckets (gs://agagik-us/). Replacing these with generic placeholders or documenting them as mandatory overrides would improve the experience for the broader team.
  • Parallelism Consistency: The gpt-oss-20b configuration has a hardcoded ici_fsdp_parallelism that conflicts with the default cluster size in its launcher script. Using auto (-1) is preferred for better scalability.
  • XLA Tuning: The inclusion of tuned XLA flags for both models is a great addition, providing clear performance targets (~17-20% MFU).

@@ -290,6 +330,11 @@ submit_workload() {
"$image_flag=$XPK_BASE_IMAGE" \
--command "export PYTHONPATH=/deps/src:/app/src; \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Exporting `LIBTPU_INIT_ARGS` with single quotes inside the `--command` string can be fragile if any XLA flags contain single quotes (e.g., for values with spaces). While current flags seem safe, using double quotes or a more robust escaping method would be more future-proof.
Suggested change
--command "export PYTHONPATH=/deps/src:/app/src; \
export LIBTPU_INIT_ARGS=\"${libtpu_init_args}\"; \

export XPK_DATASET_BUCKET="${XPK_DATASET_BUCKET:-maxtext-dataset}"
export XPK_DATASET_SUBPATH="${XPK_DATASET_SUBPATH:-array-record/climbmix/*.arrayrecord}"

# Stage HF tokenizer files (not in the image for gpt-oss).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Using a specific user's bucket as a default for `XPK_YAML_GCS` will cause the `submit` mode to fail for any other user due to lack of write permissions. Consider using a more generic placeholder or documenting this as a mandatory override.
Suggested change
# Stage HF tokenizer files (not in the image for gpt-oss).
export XPK_YAML_GCS="${XPK_YAML_GCS:-gs://YOUR-BUCKET/distill-configs/distillation_gpt_oss_20b.yml}"

export XPK_BASE_IMAGE="${XPK_BASE_IMAGE:-gcr.io/cloud-tpu-multipod-dev/maxtext_base_image:agagik-distill}"
export XPK_PRIORITY="${XPK_PRIORITY:-high}"

export XPK_USE_GCSFUSE=1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Similar to the GPT-OSS script, the default `XPK_YAML_GCS` points to a specific user's bucket, which will prevent other users from using the `submit` mode out of the box.
Suggested change
export XPK_USE_GCSFUSE=1
export XPK_YAML_GCS="${XPK_YAML_GCS:-gs://YOUR-BUCKET/distill-configs/distillation_qwen3_30b_base.yml}"

# distill_beta=0: decoder feature loss is broken on gpt-oss.
export DISTILL_ALPHA="${DISTILL_ALPHA:-0.5}"
export DISTILL_TEMPERATURE="${DISTILL_TEMPERATURE:-1.0}"
export DISTILL_BETA="${DISTILL_BETA:-0}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 For consistency with the `qwen3` script and the default in `run_distill_xpk.sh`, consider using `61440` (60MB) unless `65536` (64MB) was specifically found to be necessary for `gpt-oss-20b`.
Suggested change
export DISTILL_BETA="${DISTILL_BETA:-0}"
export XPK_LIBTPU_INIT_ARGS="${XPK_LIBTPU_INIT_ARGS:---xla_tpu_scoped_vmem_limit_kib=61440 \

Comment thread src/maxtext/configs/post_train/distillation_gpt_oss_20b.yml
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This PR introduces well-structured distillation launchers for qwen3-30b-a3b-base and gpt-oss-20b models, significantly simplifying the setup for these workloads on TPU v7x. The inclusion of tuned XLA flags and optimized configuration files demonstrates a strong focus on performance (MFU).

🔍 General Feedback

  • Robustness: The shell scripts could be made more robust by consistently quoting paths and environment variables to handle potential special characters or spaces.
  • Consistency: A few XLA flags use true instead of the more standard ENABLED value found elsewhere in the repository; aligning these improves maintainability.
  • Explicit Overrides: Explicitly passing the staged tokenizer path to the training script ensures that the workload uses the intended assets regardless of the pod's working directory.
  • Documentation: The scripts include helpful comments and usage examples, which is great for usability.

# by latency_hiding_layer_scheduler.
export XPK_LIBTPU_INIT_ARGS="${XPK_LIBTPU_INIT_ARGS:---xla_tpu_scoped_vmem_limit_kib=65536 \
--xla_tpu_impure_enable_packed_bf16_math_ops=true \
--xla_tpu_aggressive_opt_barrier_removal=true \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 For consistency with other scripts in the repository (e.g., `distill_qwen3_30b_base.sh` and `run_distill_xpk.sh`) and the default XLA flags defined in `benchmarks/xla_flags_library.py`, consider using `ENABLED` instead of `true` for this flag.
Suggested change
--xla_tpu_aggressive_opt_barrier_removal=true \
--xla_tpu_aggressive_opt_barrier_removal=ENABLED \

export BASE_OUTPUT_DIRECTORY=${OUTPUT_DIR}; \
export LIBTPU_INIT_ARGS='${libtpu_init_args}'; \
export TMPDIR=/dev/shm; export JAX_COMPILATION_CACHE_DIR=/dev/shm/jax_cache; \
export HF_HOME=${XPK_HF_CACHE_DIR}; export HF_DATASETS_CACHE=${XPK_HF_CACHE_DIR}/datasets; mkdir -p ${XPK_HF_CACHE_DIR}/datasets; \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Quote the environment variable expansion and the directory path for robustness.
Suggested change
export HF_HOME=${XPK_HF_CACHE_DIR}; export HF_DATASETS_CACHE=${XPK_HF_CACHE_DIR}/datasets; mkdir -p ${XPK_HF_CACHE_DIR}/datasets; \
export HF_HOME='${XPK_HF_CACHE_DIR}'; export HF_DATASETS_CACHE='${XPK_HF_CACHE_DIR}/datasets'; mkdir -p '${XPK_HF_CACHE_DIR}/datasets'; \

export HF_HOME=${XPK_HF_CACHE_DIR}; export HF_DATASETS_CACHE=${XPK_HF_CACHE_DIR}/datasets; mkdir -p ${XPK_HF_CACHE_DIR}/datasets; \
${yaml_prelude} \
${tokenizer_prelude} \
${gcsfuse_prelude} \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 When `XPK_TOKENIZER_LOCAL` is provided and staged, it should be explicitly passed to the training script as `tokenizer_path`. This ensures it's used instead of the potentially relative default value in the YAML, which might not resolve correctly depending on the working directory in the pod.
Suggested change
${gcsfuse_prelude} \
python3 -m maxtext.trainers.post_train.distillation.train_distill ${XPK_DISTILL_CONFIG} \
run_name=${XPK_RUN_NAME} \
${grain_files_override} \
${steps_override} \
${checkpoint_period_override} \
tokenizer_path=${XPK_TOKENIZER_LOCAL:-} \
distill_alpha=${DISTILL_ALPHA} \
distill_temperature=${DISTILL_TEMPERATURE} \
distill_beta=${DISTILL_BETA} \
distill_layer_indices="${DISTILL_LAYER_INDICES}"

# Optional: stage HF tokenizer files from GCS for models whose tokenizer isn't
# baked into the image (e.g. gpt-oss).
tokenizer_prelude=""
if [ -n "${XPK_TOKENIZER_GCS:-}" ] && [ -n "${XPK_TOKENIZER_LOCAL:-}" ]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Quote the paths to handle potential spaces or special characters.
Suggested change
if [ -n "${XPK_TOKENIZER_GCS:-}" ] && [ -n "${XPK_TOKENIZER_LOCAL:-}" ]; then
tokenizer_prelude="mkdir -p '${XPK_TOKENIZER_LOCAL}' && gcloud storage rsync '${XPK_TOKENIZER_GCS}' '${XPK_TOKENIZER_LOCAL}';"

grain_files_override="grain_train_files=gs://${XPK_DATASET_BUCKET}/${XPK_DATASET_SUBPATH}"
fi

# Optional: stage the YAML from GCS instead of baking via upload_runner.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Quote the paths to handle potential spaces or special characters in the GCS path or local config path.
Suggested change
# Optional: stage the YAML from GCS instead of baking via upload_runner.
yaml_prelude="gcloud storage cp '${XPK_YAML_GCS}' '${XPK_DISTILL_CONFIG}';"

@gagika gagika force-pushed the gagik-distill-perf branch 2 times, most recently from c6e4983 to eba25b0 Compare June 2, 2026 03:55
@gagika gagika marked this pull request as ready for review June 2, 2026 03:55
Copy link
Copy Markdown
Collaborator

@JamesDeng42 JamesDeng42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gagika gagika force-pushed the gagik-distill-perf branch from eba25b0 to 03dafe7 Compare June 2, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants