[SPARK-57185][SQL] Use thread-local ICU collators to fix lock contention in CollationFactory#56236
Open
dejankrak-db wants to merge 1 commit into
Open
[SPARK-57185][SQL] Use thread-local ICU collators to fix lock contention in CollationFactory#56236dejankrak-db wants to merge 1 commit into
dejankrak-db wants to merge 1 commit into
Conversation
…ion in CollationFactory ### What changes were proposed in this pull request? Use thread-local `Collator` instances in `CollationSpecICU.buildCollation()` to eliminate lock contention on ICU's `RuleBasedCollator`. A frozen `RuleBasedCollator` serializes all threads through a `ReentrantLock` on its internal collation buffer (used by `getCollationKey`/`compare`), which causes a significant parallelism loss when many threads compare/hash collated strings concurrently. By creating independent per-thread instances via `Collator.getInstance()`, each thread operates on its own buffer without locking. Each instance is still frozen as a mutation guard. The `Collation.getCollator()` accessor now returns the current thread's instance (or `null` for non-ICU collations). ### Why are the changes needed? To remove a concurrency bottleneck when comparing or hashing collated columns under parallel access. ### Does this PR introduce _any_ user-facing change? No. This is purely a concurrency optimization; collation results are identical. ### How was this patch tested? Added a concurrent test in `CollationFactorySuite` that verifies `comparator`, `sortKeyFunction`, and `getCollator()` produce consistent results under parallel access across `UNICODE`, `en`, `de`, `en_CI`, and `en_AI` collations. Existing `CollationFactorySuite` tests continue to pass.
02311d7 to
a961053
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Use thread-local
Collatorinstances inCollationSpecICU.buildCollation()to eliminate lock contention on ICU'sRuleBasedCollator. A frozenRuleBasedCollatorserializes all threads through aReentrantLockon its internal collation buffer (used bygetCollationKey/compare), which causes a significant parallelism loss when many threads compare/hash collated strings concurrently.By creating independent per-thread instances via
Collator.getInstance(), each thread operates on its own buffer without locking. Each instance is still frozen as a mutation guard. TheCollation.getCollator()accessor now returns the current thread's instance (ornullfor non-ICU collations).Why are the changes needed?
To remove a concurrency bottleneck when comparing or hashing collated columns under parallel access.
Does this PR introduce any user-facing change?
No. This is purely a concurrency optimization; collation results are identical.
How was this patch tested?
Added a concurrent test in
CollationFactorySuitethat verifiescomparator,hashFunction, andgetCollator()produce consistent results under parallel access acrossUNICODE,en,de,en_CI, anden_AIcollations. ExistingCollationFactorySuitetests continue to pass.Was this patch authored or co-authored using generative AI tooling?
Yes, co-authored using Claude code.