Block-R1 is a benchmark for multi-domain reinforcement learning with block-based diffusion large language models, designed to enhance block-based reasoning generation in dLLMs. This codebase contains block-based reasoning datasets and the dynamic block-size generation method b1.
Block-R1 standardises RL training recipes, Block-R1 dataset construction, and evaluation across reasoning, code, puzzles, and knowledge domains, where different domains may prefer different block sizes for semi-autoregressive decoding in dLLMs.
Main components:
- Multi-domain RL: Train and compare the latest RL for dLLM algorithms on multiple domains and metrics under one benchmark protocol.
- Benchmark coverage: Diverse domains covering code, maths, puzzles, general knowledge, and advanced reasoning.
- Block-R1 dataset construction: Build block-based training data by comparing a student and a teacher dLLM across different block sizes.
- Dynamic block size generation: Support b1, a dynamic-size reasoning block method for dLLMs.
- RL methods for dLLMs: Reproduce multiple RL algorithm families under a unified codebase.
- Backbone dLLMs: Support LLaDA, LLaDA 1.5, LLaDA2 mini, Dream, SDAR, and TraDo.
- Cross-vendor GPUs: Support both NVIDIA CUDA and AMD ROCm environments.
- Overview
- Key Features
- Installation and Setup
- Quick Start
- Repository Structure
- Supported dLLM Models
- Supported RL for dLLM Methods
- Benchmark Domains and Data
- Block-R1 Dataset
- Pipeline
- SFT
- Performance
- References and Related Resources
- Multi-domain RL benchmark
- Train and compare RL algorithms on multiple domains and metrics under one benchmark protocol.
- Block-based dataset construction
- Build block-based training data by comparing model A and model B across different block sizes.
- Dynamic-size reasoning blocks
- Support b1, a dynamic block size generation method for diffusion large language models.
- Reproducible RL recipes
- Reproduce d1, GRPO, WD1, GDPO, MDPO, StableDRL, and ESPO under
reproduce/.
- Reproduce d1, GRPO, WD1, GDPO, MDPO, StableDRL, and ESPO under
- Cross-vendor GPU support
- Support both NVIDIA CUDA and AMD ROCm environments.
The main experiments in Block-R1 were run on four AMD MI300X GPUs, each with 192 GB of memory. Block-R1 also supports NVIDIA GPUs.
Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activateInstall dependencies for NVIDIA GPUs:
pip install -r requirements_h100.txtInstall dependencies for AMD GPUs:
pip install -r requirements_rocm.txtInstall only one of the two requirement files above for your machine class. Do not install both in the same environment.
Set data and Hugging Face cache paths:
export BASE_DATA=/path/to/data
export HF_HOME=/path/to/hf_cache
export HF_DATASETS_CACHE=/path/to/hf_cacheAll scripts support SLURM systems. We recommend using at least 4 GPUs:
#SBATCH --gres=gpu:4or configure GPU ids directly:
GPU_IDS=(0 1 2 3)The models and datasets can be downloaded via Hugging Face using the links in the code.
Clone the repository:
git clone https://anonymous.4open.science/r/Block-R1-2026/
cd Block-R1Install dependencies:
python -m venv .venv
source .venv/bin/activate
# NVIDIA
pip install -r requirements_h100.txt
# or AMD ROCm
pip install -r requirements_rocm.txtSet your paths:
export BASE_DATA=/path/to/data
export HF_HOME=/path/to/hf_cache
export HF_DATASETS_CACHE=/path/to/hf_cacheBuild the Block-R1 dataset:
bash block_r1_dataset.shRun multi-domain RL on Block-R1:
bash run_block_r1.shRun full RL training sweeps:
bash run_benchmark.shEvaluate backbone or RL checkpoints:
bash eval_backbone.shEvaluate GURU-style checkpoints:
bash eval_guru.shBlock-R1/
├── Logo.png
├── block_r1_dataset.sh
├── run_block_r1.sh
├── run_benchmark.sh
├── eval_backbone.sh
├── eval_guru.sh
├── README.md
├── requirements_h100.txt
├── requirements_rocm.txt
├── data/ # Store all data and model from Hugging Face
├── rl/ # Main function entry
│ ├── block_r1.py
│ ├── eval/
│ └── trainers/
│ ├── block_r1_trainer.py # Block-R1 / R1 wrappers (dynamic block scheduling)
│ ├── diffu_grpo_trainer.py # d1 / diffusion-GRPO (token-level clipped objective)
│ ├── wd1_grpo_trainer.py # WD1 (NSR+PSR reweighting)
│ ├── gdpo_trainer.py # GDPO (sequence-level clipped ratio)
│ ├── mdpo_trainer.py # MDPO
│ ├── espo_trainer.py # ESPO (sequence-level clipped ratio, ELBO-based)
│ ├── stable_drl_trainer.py # StableDRL (SPG/SNIS objective)
│ ├── stable_drl_svpo.py # StableDRL core math (SPG bound + optional SNIS)
│ ├── eval_callback.py # periodic eval + wandb logging
│ ├── diffu_grpo_config.py # config dataclass (CLI/yaml fields)
│ ├── likelihood_estimators.py # GDPO logp estimators
│ ├── dynamic_generate.py # b1 dynamic generation helpers
│ ├── cross_domain_generate.py # cross-domain generation utilities
│ └── train_utils.py
├── reproduce/
│ ├── d1/
│ ├── grpo/
│ ├── wd1/
│ ├── gdpo/
│ ├── mdpo/
│ ├── stable_drl/
│ └── espo/
├── logs/
├── sft/
├── dataset/
├── checkpoints/
└── results/
Block-R1 supports 10 dLLM backbone models. All training and evaluation scripts accept Hugging Face model ids.
| Family | Hugging Face model id |
|---|---|
| GSAI-ML / LLaDA v1 | GSAI-ML/LLaDA-8B-Base |
| GSAI-ML / LLaDA v1 | GSAI-ML/LLaDA-8B-Instruct |
| GSAI-ML / LLaDA 1.5 | GSAI-ML/LLaDA-1.5 |
| InclusionAI / LLaDA 2 Mini | inclusionAI/LLaDA2.0-mini |
| InclusionAI / LLaDA 2 Mini | inclusionAI/LLaDA2.1-mini |
| Dream-org / Dream v0 | Dream-org/Dream-v0-Base-7B |
| Dream-org / Dream v0 | Dream-org/Dream-v0-Instruct-7B |
| JetLM / SDAR | JetLM/SDAR-8B-Chat-b32 |
| Gen-Verse / TraDo | Gen-Verse/TraDo-8B-Instruct |
| Gen-Verse / TraDo | Gen-Verse/TraDo-8B-Thinking |
For example, eval_backbone.sh loops over a configurable MODEL_PATHS array. The default model list includes LLaDA 1.5, SDAR, TraDo, Dream-7B, and LLaDA2 mini.
Block-R1 supports 7 latest RL-for-dLLM methods under reproduce/ (one folder per method):
| Directory | Paper title |
|---|---|
reproduce/d1/ |
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning (Diffusion-GRPO + SFT) |
reproduce/grpo/ |
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning (Diffusion-GRPO) |
reproduce/wd1/ |
WD1: Weighted Policy Optimization for Reasoning in Diffusion Language Models |
reproduce/gdpo/ |
Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization |
reproduce/mdpo/ |
MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models |
reproduce/stable_drl/ |
Stabilizing Reinforcement Learning for Diffusion Language Models |
reproduce/espo/ |
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective |
Beyond these seven RL method families, Block-R1 supports dynamic-size reasoning blocks from b1.
Scripts prefixed with b1_, block_b1_, or r1_b1_ under each method folder implement the dynamic block-size recipe. b1 is orthogonal to the seven algorithm folders and can be composed with them as a block scheduling or reward structure.
The corresponding paper is: Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning.
Block-R1 supports 15 dataset settings. GURU follows Cheng et al., Revisiting Reinforcement Learning for LLM Reasoning from a Cross-Domain Perspective.
| Category | Dataset | Train size | Test size |
|---|---|---|---|
| Code generation | MBPP | 374 | 500 |
| Code generation | HumanEval | N/A | 164 |
| Code generation | KodCode | 9,285 | 500 |
| Mathematical reasoning | GSM8K | 7,473 | 1,319 |
| Mathematical reasoning | MATH500 | 7,500 | 500 |
| Mathematical reasoning | Countdown | 240,632 | 256 |
| Logical puzzles | Knights-and-Knaves | 6,200 | 700 |
| Logical puzzles | Sudoku | 1,000,000 | 256 |
| General capabilities | HellaSwag | 39,905 | 10,003 |
| General capabilities | MMLU | N/A | 14,042 |
| General capabilities | ARC-E | 2,251 | 2,376 |
| Advanced reasoning | MMLU-Pro | N/A | 12,032 |
| Advanced reasoning | ARC-C | 1,119 | 1,172 |
| Advanced reasoning | GPQA | N/A | 448 |
| Cross-domain RL for LLMs | GURU | 91.9K | N/A |
Eval keys in code include:
gsm8k, math, countdown, sudoku, mbpp, humaneval, kodcode,
knights_and_knaves, hellaswag, mmlu, arc_e, arc_c, mmlu_pro,
gpqa
Additionally, GURU-aware training is supported via reproduce/*/r1_*_guru.sh and eval_guru.sh.
The Block-R1 dataset is released on Hugging Face:
https://huggingface.co/datasets/dLLM-R1/Block-R1
The main training dataset file is:
train.jsonl
Each sample is constructed from multi-block signals and selected according to the best A minus B block. The dataset is designed for multi-domain RL training of diffusion large language models. Please download it and place it into dataset/multi/block_r1_A_gt_B_multi_train.
Block-R1 follows a complete pipeline:
Block-R1 Dataset Construction -> Multi-Domain RL -> Benchmark Evaluation
block_r1_dataset.sh is a two-stage driver that (1) materializes multi-block reward signals on TRAIN splits, then (2) exports a train.jsonl for Block-R1 training.
Stage 1 runs multi-block evaluation (via python -m rl.block_r1 eval_multi_block ...) and writes reward shards under the script’s OUTPUT_DIR (default: ./dataset/multi under this repo).
Stage 2 exports train.jsonl (via python -m rl.block_r1 build_block_r1 ...) by selecting examples where model A beats model B at the block that maximizes ((A-B)).
In block_r1_dataset.sh, the key variables you will typically edit are:
MODELS # stage-1: backbone model list to run eval_multi_block on
DATASETS # stage-1/2: comma-separated dataset keys (e.g., gsm8k,math,...)
BLOCK_SIZES # stage-1/2: comma-separated block sizes
OUTPUT_DIR # stage-1/2: output root (edit for your filesystem)
MODEL_A MODEL_B # stage-2: pair for (A-B) selection in build_block_r1
MULTI_TRAIN_SUBDIR # stage-2: where train.jsonl will be written under OUTPUT_DIRRun:
bash block_r1_dataset.shrun_block_r1.sh launches representative Block-R1 multi-domain jobs using method entrypoints under reproduce/.
bash run_block_r1.shYou can also pass explicit script paths to override the default list:
bash run_block_r1.sh reproduce/wd1/block_r1_wd1.shrun_benchmark.sh sequentially runs a large set of RL for dLLM method training scripts under reproduce/.
It covers d1, GRPO, WD1, GDPO, MDPO, StableDRL, ESPO, and b1.
bash run_benchmark.shYou can override the default list by passing script paths:
bash run_benchmark.sh reproduce/d1/r1_d1.sh reproduce/wd1/b1_wd1_math.sheval_backbone.sh evaluates either (a) the raw backbone (CKPT_STEP=0) or (b) a specific RL checkpoint (CKPT_STEP>0) by launching a multi-GPU torch.distributed.run job that runs rl/eval/eval.py.
Set CKPT_STEP=0 to report base or instruct backbone metrics.
Set a nonzero checkpoint step and matching METHOD and TRAIN_DATASET to evaluate RL checkpoints under checkpoints/.
bash eval_backbone.shConfigure the following variables in the script header:
MODEL_PATHS
EVAL_DATASETS
GEN_LENGTHS
GPU_IDS
CKPT_STEP
METHOD
TRAIN_DATASETFor models trained with GURU-style run names, such as r1_wd1_guru, use:
bash eval_guru.shConfigure:
GURU_RUN_NAME
CKPT_STEPS
MODEL_PATHScripts prefixed with b1_* apply the b1 dynamic-size block mechanism on top of an existing RL recipe. They live under reproduce/<base_method>/b1_<base_method>_<dataset>.sh, where:
<base_method>selects the underlying RL algorithm (e.g.wd1,stable_drl); the corresponding--trainer_type(e.g.b1_wllfor wd1,b1_stable_drlfor stable_drl) is set inside each script.<dataset>is one ofcountdown,gsm8k,math,sudoku,kodcode,mbpp,humaneval,knights_and_knaves.
Run a single recipe directly:
bash reproduce/wd1/b1_wd1_countdown.sh
bash reproduce/wd1/b1_wd1_math.shOr dispatch a subset through run_benchmark.sh:
bash run_benchmark.sh reproduce/wd1/b1_wd1_countdown.sh reproduce/wd1/b1_wd1_gsm8k.shInside a b1_* script, the variables you typically edit are:
MODEL_NAME # backbone (e.g. GSAI-ML/LLaDA-8B-Instruct)
DATASET # one of countdown, gsm8k, math, sudoku, kodcode, mbpp, humaneval, knights_and_knaves
NUM_ITER # policy-gradient inner-update iterations
RUN_NAME # auto-built as b1_<base_method>_<dataset>Supervised fine-tuning entry points are under sft/.
For example:
bash sft/run_sft.shUse SFT when the corresponding recipe requires supervised fine-tuning before RL, such as d1.
Block-R1 focuses on one-shot evaluation under both backbone and RL checkpoint settings.
The benchmark is supported to report:
- Base or instruct backbone performance.
- Single-domain RL performance.
- Multi-domain RL performance.
- Block-R1 training performance.
- b1 dynamic block-size performance.
Please refer to the paper for full experimental results.
This benchmark builds on open-sourced RL algorithms, models, and datasets. We sincerely thank all the authors listed below for their awesome work, which makes this benchmark possible. The included methods are:
- Diffusion-GRPO / d1: S. Zhao et al., d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning, NeurIPS 2025.
- WD1: X. Tang et al., WD1: Weighted Policy Optimization for Reasoning in Diffusion Language Models, ICLR 2026.
- GDPO: K. Rojas et al., Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimisation, ICLR 2026.
- MDPO: H. He et al., MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models, arXiv:2508.13148, 2025.
- SPG: C. Wang et al., SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models, ICLR 2026.
- ESPO: J. Ou et al., Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective, ICLR 2026.
- StableDRL: J. Zhong et al., Stabilising Reinforcement Learning for Diffusion Language Models, arXiv:2603.06743, 2026.
- GURU: Z. Cheng et al., Revisiting Reinforcement Learning for LLM Reasoning from a Cross-Domain Perspective, NeurIPS 2025.
- b1: Y. Jiang et al., Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning, ICML 2026.
If you use this benchmark, please cite b1 and Block-R1.
@article{jiang2026breakblock,
title={{Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning}},
author={Jiang, Yan and Qiu, Ruihong and Huang, Zi},
journal={arXiv preprint arXiv:2605.02263},
year={2026}
}
@article{jiang2026blockr1,
title={{Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models}},
author={Jiang, Yan and Qiu, Ruihong and Huang, Zi},
journal={arXiv preprint arXiv:2605.11726},
year={2026}
}
