fix: cap rclone backup/restore memory to prevent OOM kills#477
Open
mrrobot47 wants to merge 3 commits into
Open
fix: cap rclone backup/restore memory to prevent OOM kills#477mrrobot47 wants to merge 3 commits into
mrrobot47 wants to merge 3 commits into
Conversation
rclone allocates one --buffer-size read-ahead buffer per --transfers and, for S3 backends, an additional --s3-chunk-size x --s3-upload-concurrency multipart buffer per transfer. The previous formula set buffer_size = available_ram / transfers, so the read-ahead buffers alone consumed ~100% of available memory (e.g. --buffer-size 1328M --transfers 2 on a 4 GB host), with the S3 multipart buffers unaccounted on top. This saturated RAM and triggered the OOM killer during backups. Introduce compute_rclone_resources() which budgets rclone's total in-memory footprint -- transfers * (buffer_size + s3_chunk * s3_concurrency) -- against a fraction of currently-available RAM (default 50%, clamped 10-90%). It scales parallelism and buffer size up on larger hosts and shrinks S3 concurrency then transfers on constrained hosts, with buffer-size clamped to [16M, 256M]. Add optional global config knobs rclone-mem-fraction and rclone-max-buffer-size, plus an EE::debug line logging the chosen values and estimated peak memory.
…memory rclone_download() ran 'rclone copy -P --multi-thread-streams min(cpu*2,32)' with no memory awareness and relied on rclone's default --transfers 4. rclone fans out up to transfers * multi-thread-streams concurrent download streams for large files, each holding a --multi-thread-write-buffer-size buffer plus one in-flight --multi-thread-chunk-size range, so a many-core/low-RAM host (e.g. the same 4 GB box) could spawn 4 * 32 = 128 streams and be OOM-killed during restore or rollback. An empty nproc also yielded the degenerate --multi-thread-streams 0. Reuse compute_rclone_resources() (is_s3 = false, since S3 multipart-upload concurrency does not apply to downloads) to derive memory-safe --transfers, --checkers and --buffer-size, then cap --multi-thread-streams against the same available-RAM budget, flooring at 1 (which also fixes the nproc->0 case). Per-stream memory is modeled from rclone's real levers --multi-thread-write-buffer-size and --multi-thread-chunk-size, exposed as the rclone-mt-write-buffer-size and rclone-mt-chunk-size config knobs, and an EE::debug line logs the chosen values and estimated peak.
…ection Follow-up to the rclone OOM fixes, addressing issues found in review: Download (#1): the --multi-thread-streams budget was checked independently of the read-ahead buffers, so the two pools were each sized against the full mem-fraction budget and the combined download footprint could reach ~2x the intended fraction (e.g. ~3.5 GB on an 8-core/4 GB host) and still OOM during restore/rollback. rclone_download() now splits one shared budget between the read-ahead buffer (up to half per transfer, capped at max_buffer) and the multi-thread streams, so transfers * (buffer + streams * per_stream_mem) <= budget by construction, while still scaling streams up on memory-rich hosts. Available-RAM detection (EasyEngine#3): the 'free -m | grep Mem | awk {print $7}' probe assumed an English locale and a fixed available-column index. It is now centralized in get_available_ram_mb(), which pins LC_ALL=C and locates the available column by its header name, falling back to free on older free/procps builds that lack it. multi-thread-write-buffer-size (EasyEngine#4): the rclone-mt-write-buffer-size config value fed the stream-memory model but was never passed to rclone, so the knob had no effect. rclone_download() now emits --multi-thread-write-buffer-size so the model matches the process. Download checkers (EasyEngine#5): rclone_download() no longer forces --checkers = transfers (it set none before, so rclone used its default of 8); checkers allocate no transfer buffers, so binding them to the memory-derived transfer count needlessly throttled the compare phase. Dedupe (EasyEngine#9): compute_rclone_resources() now returns the computed budget and max_buffer, so rclone_download() consumes them instead of recomputing the mem-fraction/budget formula, removing the risk of the two copies drifting.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EasyEngine's rclone-based backup and restore sized
--buffer-sizeand parallelism from available RAM in a way that ignored how rclone actually allocates memory, so a single backup could consume ~100% of available RAM and get OOM-killed. This was observed in production on a 4 GB Oracle Cloud host running 6 WordPress sites (rclone killed mid-backup after consuming ~1.7 GB RSS). This PR replaces the ad-hoc sizing with a shared memory-budget helper that caps rclone's total in-memory footprint at a configurable fraction of currently-available RAM, scaling parallelism and buffers up on large hosts and down on constrained ones.Root cause
rclone does not allocate one buffer for the whole transfer. It allocates buffers per running transfer, so memory is multiplied by
--transfers:Upload. rclone allocates one
--buffer-sizeread-ahead buffer per--transfers, and for S3 backends an additional--s3-chunk-size × --s3-upload-concurrencymultipart buffer per transfer. The old code setbuffer_size = available_ram / transfers, so the read-ahead buffers alone summed back to ~100% of available RAM:transfers × (available_ram / transfers) = available_ram. On the Oracle box this produced--buffer-size 1328M --transfers 2→2 × 1328M = 2656 MBof read-ahead buffers, i.e. essentially all available memory, before the S3 multipart buffers were even counted. With MariaDB, Redis, PHP-FPM and nginx already resident, this saturated RAM and the kernel OOM-killed rclone. (Logged buffer sizes across the 6 sites ranged 1023M–1328M.)Download / rollback.
rclone_download()ranrclone copy -P --multi-thread-streams min(cpu*2, 32)with no memory awareness and relied on rclone's default--transfers 4. For large files rclone fans out up totransfers × multi-thread-streamsconcurrent download streams, each holding a--multi-thread-write-buffer-sizebuffer plus one in-flight--multi-thread-chunk-sizerange. A many-core host could therefore spawn4 × 32 = 128concurrent streams and be OOM-killed during restore/rollback. An emptynprocreading also produced the degenerate--multi-thread-streams 0. The initial download fix bounded the stream count, but sized the read-ahead buffers and the multi-thread streams against the full budget independently, so the two pools could together reach ~2× the intended fraction (e.g. ~3.5 GB on an 8-core/4 GB host) and still OOM — now corrected by splitting one shared budget between them.Changes
fix(backup): cap rclone upload memory to prevent OOM kills— introducescompute_rclone_resources( $cpu_cores, $available_ram, $is_s3 ), which budgets rclone's total footprinttransfers × (buffer_size + s3_chunk_size × s3_concurrency)against a fraction of available RAM (default 50%, clamped 10–90%). It scales parallelism with cores (transfers2–8), and on a constrained host first shrinks S3 multipart concurrency (the largest lever at 64M/chunk) and then--transfersuntil the budget fits, with--buffer-sizeclamped to[16M, 256M].rclone_upload()now calls the helper and logs anEE::debug"rclone upload tuning" line with the chosen values and estimated peak.fix(restore): cap rclone download streams and transfers to available memory—rclone_download()now reusescompute_rclone_resources()withis_s3 = false(S3 multipart concurrency does not apply to downloads) to derive memory-safe--transfers,--checkersand--buffer-size, then caps--multi-thread-streamsso thattransfers × streams × per_stream_memfits the same available-RAM budget, flooring at 1 (which also fixes thenproc → 0degenerate case). Per-stream cost is modeled from rclone's real levers--multi-thread-write-buffer-sizeand--multi-thread-chunk-size, and anEE::debug"rclone download tuning" line logs the chosen values and estimated peak.fix(backup): bound rclone memory within one budget and harden RAM detection— review follow-up to the two fixes above. The download read-ahead buffers and the multi-thread streams were each sized against the full mem-fraction budget independently, so the combined footprint could reach ~2× the intended fraction (e.g. ~3.5 GB on an 8-core/4 GB host) and still OOM;rclone_download()now splits one shared budget —transfers × (buffer + streams × per_stream_mem) ≤ budgetby construction — between the read-ahead buffer (up to half per transfer, capped atmax_buffer) and the streams, while still scaling streams up on memory-rich hosts. The brittlefree -m | grep Mem | awk '{print $7}'probe (English-locale + fixed-column) is replaced by aget_available_ram_mb()helper that pinsLC_ALL=Cand locates the available column by header name (falling back to free on older free/procps), now used by both upload and download. The download command now actually emits--multi-thread-write-buffer-size(previously therclone-mt-write-buffer-sizeknob only fed the memory model and had no effect on rclone) and no longer forces--checkers(it returns to rclone's default of 8 instead of being bound to the memory-derived transfer count, which needlessly throttled the compare phase). Finallycompute_rclone_resources()now returnsbudgetandmax_bufferso the download consumes them instead of recomputing the budget formula.Behavior / impact
Peaks below are derived from the helper logic. The Oracle row uses the production case (1 core, ~2656 MB available — the same value that produced the old
1328Mbuffer), and the large-host row shows resources still scale up.Upload
--buffer-size 1328M --transfers 2→ read-ahead peak ~2656 MB (≈100% of available → OOM), plus S3 multipart on top--transfers 2 --buffer-size 256M --s3-upload-concurrency 2→ peak ~768 MB (~29%)--transfers 2 --buffer-size 256M→ peak ~512 MB--transfers 4 --buffer-size 4096M(capped) → read-ahead peak ~16 GB--transfers 8 --buffer-size 256M --s3-upload-concurrency 16→ peak ~10240 MB (within 50% budget)--transfers 8 --buffer-size 256M→ peak ~2048 MBDownload
--transfers 4×--multi-thread-streams 2= up to 8 streams, memory-unaware--transfers 2 --buffer-size 256M --multi-thread-streams 2→ peak ~768 MB--transfers 4×--multi-thread-streams 16= up to 64 streams, memory-unaware--transfers 8 --buffer-size 125M --multi-thread-streams 1→ peak ~1513 MB (the read-ahead/stream double-spend that previously pushed this to ~3.5 GB is now fixed)--transfers 4×--multi-thread-streams 32= up to 128 concurrent streams--transfers 8 --buffer-size 256M --multi-thread-streams 19→ peak ~11.8 GB (within the 12 GB budget — streams scale up to use spare RAM)nprocreturns empty--multi-thread-streams 0(degenerate)--multi-thread-streams 1The download command now also emits
--multi-thread-write-buffer-sizeand--multi-thread-chunk-size(so the memory model matches the process), and no longer forces--checkers— it returns to rclone's default of 8 instead of being tied to the memory-derived transfer count.Net effect: on the constrained Oracle host the backup/restore footprint drops from ~all available RAM to under a third of it, eliminating the OOM kill; on large-memory hosts parallelism and buffers scale up so spare RAM is still utilized.
Configuration
All knobs are optional global config values read via
get_config_value()and default to rclone's own defaults, so existing installs need no changes.rclone-mem-fraction0.5[0.1, 0.9]. Applies to both upload and download.rclone-max-buffer-size256(MB)--buffer-size. Lower bound is fixed at rclone's 16M default.rclone-mt-write-buffer-size128(KiB)--multi-thread-write-buffer-size). Used to budget the stream count and now also passed through to rclone (previously it only fed the memory model and had no effect on the process).rclone-mt-chunk-size64(MB)--multi-thread-chunk-size), used both to budget streams and passed to rclone.Testing
php -l src/helper/Site_Backup_Restore.phppasses (no syntax errors).Suggested maintainer validation:
EE::debuglines (ee --debug ...): "rclone upload tuning: ..." and "rclone download tuning: ..." should report a sanetransfers, abuffer-sizewithin[16M, 256M], and an estimated peak well under available RAM. The download line now also reports the sharedbudgetand the chosenmt-write-buffer/mt-chunk-size, which should match the flags actually passed to rclone.dmesg -T | grep -i oom,journalctl -k | grep -i oom) and that the backup/restore completes.transfersand stream counts so throughput is preserved.rclone-mem-fractionlower/higher and confirm the chosen values move accordingly.Risk / compatibility
--multi-thread-chunk-sizeand--multi-thread-write-buffer-size, both available in modern rclone (the version EasyEngine installs);--multi-thread-streamswas already in use. It no longer forces--checkerson download, reverting to rclone's default of 8.LC_ALL=Cand locates the available column by header name (with a free-column fallback for older free/procps), so it is no longer dependent on the system locale or column order.