GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters#50008
Open
HuaHuaY wants to merge 5 commits into
Open
GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters#50008HuaHuaY wants to merge 5 commits into
HuaHuaY wants to merge 5 commits into
Conversation
|
|
HuaHuaY
commented
May 21, 2026
| std::map</*column_id=*/int32_t, std::shared_ptr<BloomFilter>>; | ||
| struct RowGroupBloomFilters { | ||
| RowGroupBloomFilters() = default; | ||
| RowGroupBloomFilters(RowGroupBloomFilters&&) noexcept = default; |
Contributor
Author
There was a problem hiding this comment.
I need these to prevent MSVC from attempting to instantiate the copy constructor. See microsoft/STL#5552 and microsoft/STL#5084.
Contributor
Author
|
@wgtmac @alamb @etseidl @emkornfield Please take a look. |
Contributor
|
I am not likely to have time to review C++ code in the arrow repository unfortunately |
wgtmac
reviewed
May 29, 2026
| std::to_string(bloom_filter_options.fpp)); | ||
| } | ||
| if (bloom_filter_options.ndv.has_value() && bloom_filter_options.ndv.value() < 0) { | ||
| throw ParquetException("Bloom filter number of distinct values must be >= 0, got " + |
Member
There was a problem hiding this comment.
What is the expected behavior of 0?
Contributor
Author
There was a problem hiding this comment.
It will create a smallest bloom filter.
Member
wgtmac
approved these changes
May 29, 2026
Member
wgtmac
left a comment
There was a problem hiding this comment.
Generally LGTM. I left some nits.
mapleFU
reviewed
May 29, 2026
| struct BloomFilterEntry { | ||
| std::unique_ptr<BlockSplitBloomFilter> filter; | ||
| double target_fpp; | ||
| bool should_fold; |
| } | ||
|
|
||
| const double avg_fill = | ||
| static_cast<double>(total_set_bits) / (static_cast<double>(num_blocks) * 256.0); |
Member
There was a problem hiding this comment.
use constexpr instead of magic 256.0?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
This PR follows apache/arrow-rs#9628. It supports optimizing the disk usage of the Bloom filter. So specifying an ndv value larger than the actual value will not affect disk usage.
What changes are included in this PR?
BloomFilterBuilderwill try to fold the bloom filter before writing it to the output stream.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
The type of
ndvinBloomFilterOptionsis changed fromint32_ttostd::optional<int64_t>. And the argument type ofOptimalNumOfBytesandOptimalNumOfBitsinBlockSplitBloomFilteris changed fromuint32_t ndvtouint64_t ndv