Skip to content

[SPARK-57185][SQL] Use thread-local ICU collators to fix lock contention in CollationFactory#56236

Open
dejankrak-db wants to merge 1 commit into
apache:masterfrom
dejankrak-db:thread-local-collator-fix-oss
Open

[SPARK-57185][SQL] Use thread-local ICU collators to fix lock contention in CollationFactory#56236
dejankrak-db wants to merge 1 commit into
apache:masterfrom
dejankrak-db:thread-local-collator-fix-oss

Conversation

@dejankrak-db
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Use thread-local Collator instances in CollationSpecICU.buildCollation() to eliminate lock contention on ICU's RuleBasedCollator. A frozen RuleBasedCollator serializes all threads through a ReentrantLock on its internal collation buffer (used by getCollationKey/compare), which causes a significant parallelism loss when many threads compare/hash collated strings concurrently.

By creating independent per-thread instances via Collator.getInstance(), each thread operates on its own buffer without locking. Each instance is still frozen as a mutation guard. The Collation.getCollator() accessor now returns the current thread's instance (or null for non-ICU collations).

Why are the changes needed?

To remove a concurrency bottleneck when comparing or hashing collated columns under parallel access.

Does this PR introduce any user-facing change?

No. This is purely a concurrency optimization; collation results are identical.

How was this patch tested?

Added a concurrent test in CollationFactorySuite that verifies comparator, hashFunction, and getCollator() produce consistent results under parallel access across UNICODE, en, de, en_CI, and en_AI collations. Existing CollationFactorySuite tests continue to pass.

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored using Claude code.

…ion in CollationFactory

### What changes were proposed in this pull request?

Use thread-local `Collator` instances in `CollationSpecICU.buildCollation()` to
eliminate lock contention on ICU's `RuleBasedCollator`. A frozen
`RuleBasedCollator` serializes all threads through a `ReentrantLock` on its
internal collation buffer (used by `getCollationKey`/`compare`), which causes a
significant parallelism loss when many threads compare/hash collated strings
concurrently.

By creating independent per-thread instances via `Collator.getInstance()`, each
thread operates on its own buffer without locking. Each instance is still frozen
as a mutation guard. The `Collation.getCollator()` accessor now returns the
current thread's instance (or `null` for non-ICU collations).

### Why are the changes needed?

To remove a concurrency bottleneck when comparing or hashing collated columns
under parallel access.

### Does this PR introduce _any_ user-facing change?

No. This is purely a concurrency optimization; collation results are identical.

### How was this patch tested?

Added a concurrent test in `CollationFactorySuite` that verifies `comparator`,
`sortKeyFunction`, and `getCollator()` produce consistent results under parallel
access across `UNICODE`, `en`, `de`, `en_CI`, and `en_AI` collations. Existing
`CollationFactorySuite` tests continue to pass.
@dejankrak-db dejankrak-db force-pushed the thread-local-collator-fix-oss branch from 02311d7 to a961053 Compare May 31, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant