fix: miscalculation of num_steps when using num_epoch and lmdb#5488
fix: miscalculation of num_steps when using num_epoch and lmdb#5488OutisLi wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adjust LMDB batch counting to reflect auto-probability expansion and align distributed/training step calculations, with added regression tests.
Changes:
- Update
total_batch/indexsemantics to track sampler-expanded batch counts. - Use DataLoader length for LR step calculations when training on
LmdbDataset. - Add tests covering
total_batchalignment and distributed sampler length with auto-prob expansion.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| source/tests/pt/test_lmdb_dataloader.py | Adds tests validating expanded batch counts and distributed sampler __len__() behavior. |
| deepmd/pt/utils/lmdb_dataset.py | Changes index/total_batch to be derived from the batch sampler length (including expansion). |
| deepmd/pt/train/training.py | Uses len(training_dataloader) to compute batch counts for LR scheduling with LMDB datasets. |
| deepmd/dpmodel/utils/lmdb_data.py | Updates total_batch calculation and adjusts distributed sampler __len__() to include expansion via SameNlocBatchSampler. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughBatch-count computation moved to sampler/dataloader-derived totals: ChangesBatch Count Estimation Alignment
Sequence Diagram(s)sequenceDiagram
participant LmdbDataReader
participant SameNlocBatchSampler
participant DistributedSameNlocBatchSampler
participant Trainer
LmdbDataReader->>SameNlocBatchSampler: instantiate (shuffle=False, block_targets) to measure expanded batches
SameNlocBatchSampler-->>LmdbDataReader: return expanded batch count
LmdbDataReader->>DistributedSameNlocBatchSampler: provide total_batch or allow sampler-based estimation
DistributedSameNlocBatchSampler->>DistributedSameNlocBatchSampler: cache _total_batches and compute ceil(_total_batches/world_size)
DistributedSameNlocBatchSampler-->>Trainer: len() returns per-rank batch count
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 1311-1320: The current __len__() returns ceil(total /
self._world_size) for every rank which mismatches the strided partitioning in
__iter__(); change __len__() to compute a rank-aware batch count by getting
total = len(SameNlocBatchSampler(self._reader, shuffle=False,
block_targets=self._block_targets)), then compute base = total //
self._world_size and remainder = total % self._world_size and return base + (1
if self._rank < remainder else 0) so that __len__() matches the actual number of
batches produced by the __iter__() strided partitioning (alternatively, adjust
_partition_batches() to pad/repeat batches so every rank emits
ceil(total/world_size) and keep current __len__()).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8ecc555b-6099-4600-a3c5-6d331cce1d8d
📒 Files selected for processing (4)
deepmd/dpmodel/utils/lmdb_data.pydeepmd/pt/train/training.pydeepmd/pt/utils/lmdb_dataset.pysource/tests/pt/test_lmdb_dataloader.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5488 +/- ##
=======================================
Coverage 81.36% 81.36%
=======================================
Files 868 868
Lines 96567 96571 +4
Branches 4233 4235 +2
=======================================
+ Hits 78570 78575 +5
+ Misses 16697 16696 -1
Partials 1300 1300 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| expected_len = math.ceil(len(ds._batch_sampler) / 2) | ||
| assert len(dist_sampler) == expected_len | ||
| assert len(dist_sampler) == expected_len | ||
| assert calls == 1 |
| bs = self._reader.get_batch_size_for_nloc(nloc) | ||
| total += (len(indices) + bs - 1) // bs | ||
| return math.ceil(total / self._world_size) | ||
| return math.ceil(self._total_batches / self._world_size) |
There was a problem hiding this comment.
__len__ should match this sampler's rank-specific iterator. _partition_batches yields all_batches[self._rank :: self._world_size], whose length is ceil((total - rank) / world_size) (or 0 if rank >= total), not always ceil(total / world_size). With total_batches=3, world_size=2, rank 1 yields only one batch but reports two. This can make len(DataLoader) overestimate num_steps on nonzero ranks.
Could we compute the length from the same strided partition formula, e.g. (self._total_batches + self._world_size - 1 - self._rank) // self._world_size?
— OpenClaw 2026.5.28 (model: custom-chat-jinzhezeng-group/gpt-5.5)
Summary by CodeRabbit
Bug Fixes
Tests