[SPARK-57183][SS] Close LRUCache on RocksDB.close() in unbounded memory mode#56234
[SPARK-57183][SS] Close LRUCache on RocksDB.close() in unbounded memory mode#56234kete1987 wants to merge 1 commit into
Conversation
…ry mode In unbounded memory mode (the default, boundedMemoryUsage=false), RocksDBMemoryManager creates a new LRUCache per RocksDB instance but RocksDB.close() never calls lruCache.close(). The Java LRUCache wrapper holds a C++ shared_ptr<Cache>, so the native object is only freed when the JVM GC finalizes the wrapper -- which rarely happens under low heap pressure. Closing explicitly ensures native memory is reclaimed deterministically when the instance is released. In bounded mode the cache is a shared singleton managed by RocksDBMemoryManager and must not be closed per instance. Add a test that verifies the native handle is released after close() in unbounded mode via LRUCache.isOwningHandle(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d4b935b to
02fd273
Compare
|
I'll ask other folks to review the change (I'm a bit away from recent improvement of RocksDB state store provider), but I'm going to give a general suggestion. Please do not remove the section of PR template and consider filling the section as one of the requirement/duty.
This doesn't replace the requirement the PR template asks about the usage of LLM model and the clarification of the model. The template guides how to describe the model - please check it out again. https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE |
|
@HeartSaVioR Thanks for the feedback! I've updated the PR description to include the AI tooling section. |
| // ensures native memory is reclaimed deterministically when the instance is released. | ||
| // In bounded mode the cache is a shared singleton managed by RocksDBMemoryManager | ||
| // and must not be closed here. | ||
| if (!conf.boundedMemoryUsage && lruCache != null) { |
There was a problem hiding this comment.
So this is just releasing memory more proactively IIUC ?
| val sqlConf = new SQLConf | ||
| sqlConf.setConfString( | ||
| RocksDBConf.ROCKSDB_SQL_CONF_NAME_PREFIX + "." + | ||
| RocksDBConf.BOUNDED_MEMORY_USAGE_CONF_KEY, "false") |
There was a problem hiding this comment.
Can we add a test to verify that the bounded case is not affected by this change ?
What changes were proposed in this pull request?
In unbounded memory mode (the default,
boundedMemoryUsage = false),RocksDBMemoryManagercreates a newLRUCacheper instance:spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBMemoryManager.scala
Line 185 in d7df192
but
RocksDB.close()never callslruCache.close():spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
Lines 2125 to 2135 in d7df192
The Java
LRUCachewrapper holds a C++shared_ptr<Cache>, so the native object is only freed when the JVM GC finalizes the wrapper — which rarely happens under low heap pressure. This causes native memory to accumulate until GC eventually runs, leading to OOM kills in long-running processes or CI runs with many RocksDB-heavy test suites.The fix adds an explicit
lruCache.close()call inRocksDB.close()for unbounded mode. In bounded mode the cache is a shared singleton managed byRocksDBMemoryManagerand must not be closed per instance.This is a separate issue from SPARK-56523 (Statistics native memory leak), which was already fixed.
Why are the changes needed?
Without explicit
close(), eachRocksDBinstance in unbounded mode leaks oneLRUCacheworth of native memory (blockCacheSizeMB, default 8 MB) for as long as GC does not run. The memory is never reclaimed deterministically.A standalone reproducer tool confirms ~8.5 MB of native memory growth per open/close cycle in leak mode vs flat memory in fixed mode:
https://github.com/kete1987/rocksdb-leak-tool
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added a test in
RocksDBSuite(SPARK-57183: LRUCache is closed on RocksDB.close() in unbounded memory mode) that verifies the native handle is released afterclose()viaLRUCache.isOwningHandle().Was this patch authored or co-authored using generative AI tooling?
Co-authored with Claude (Anthropic), used for analysis, code generation and review assistance.
Generated-by: Claude Sonnet 4.6
I affirm that the contribution is my original work and that I license the work to the project under the project's open source license.