Skip to content

Add MoE router similarity and expert fraction distillation metrics#4037

Open
JamesDeng42 wants to merge 1 commit into
mainfrom
yujiedeng/add_moe_metrics
Open

Add MoE router similarity and expert fraction distillation metrics#4037
JamesDeng42 wants to merge 1 commit into
mainfrom
yujiedeng/add_moe_metrics

Conversation

@JamesDeng42
Copy link
Copy Markdown
Collaborator

Description

This PR introduces new metrics for evaluating the quality of Mixture-of-Experts (MoE) distillation. When distilling an MoE teacher into an MoE student, it is critical to understand not just the KL divergence of the final logits, but whether the student's routing decisions align with the teacher's, and how evenly the tokens are distributed across experts.

You can also provide a comma-separated list. If you don't want to close a bug but
simply to reference it, use BUGS, e.g.:
BUGS: b/123456

Key Features:

  • Expert Fraction Distribution: Automatically computes and logs distill/student_expert_{i}fraction and distill/teacher_expert{i}_fraction whenever a model has num_experts > 1.
  • Router Similarity: Introduces a new configuration flag record_router_similarity_metrics. When enabled, it computes per-layer Top-1 and Top-K routing overlap between the student and teacher models (e.g.,
    distill/layer_0_router_similarity_top1).
  • Eval Logging: Updated the distillation evaluation loop to conditionally run a stop_gradient forward pass of the teacher model so these metrics are populated during eval steps.

Tests

Added test_router_similarity_metrics to verify that Top-1 and Top-K similarity ratios correctly ignore padded tokens (pad_id=0) and broadcast properly.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

❌ Patch coverage is 56.14035% with 25 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../trainers/post_train/distillation/train_distill.py 0.00% 16 Missing ⚠️
...ners/post_train/distillation/distillation_utils.py 84.21% 2 Missing and 4 partials ⚠️
src/maxtext/layers/moe.py 0.00% 2 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@JamesDeng42 JamesDeng42 force-pushed the yujiedeng/add_moe_metrics branch 3 times, most recently from c2c649d to 63b6123 Compare June 2, 2026 00:17
@JamesDeng42 JamesDeng42 force-pushed the yujiedeng/add_moe_metrics branch from 63b6123 to da39f7c Compare June 2, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant