Skip to content

fix: cap rclone backup/restore memory to prevent OOM kills#477

Open
mrrobot47 wants to merge 3 commits into
EasyEngine:developfrom
mrrobot47:fix/rclone-backup-oom-buffer-size
Open

fix: cap rclone backup/restore memory to prevent OOM kills#477
mrrobot47 wants to merge 3 commits into
EasyEngine:developfrom
mrrobot47:fix/rclone-backup-oom-buffer-size

Conversation

@mrrobot47
Copy link
Copy Markdown
Member

Summary

EasyEngine's rclone-based backup and restore sized --buffer-size and parallelism from available RAM in a way that ignored how rclone actually allocates memory, so a single backup could consume ~100% of available RAM and get OOM-killed. This was observed in production on a 4 GB Oracle Cloud host running 6 WordPress sites (rclone killed mid-backup after consuming ~1.7 GB RSS). This PR replaces the ad-hoc sizing with a shared memory-budget helper that caps rclone's total in-memory footprint at a configurable fraction of currently-available RAM, scaling parallelism and buffers up on large hosts and down on constrained ones.

Root cause

rclone does not allocate one buffer for the whole transfer. It allocates buffers per running transfer, so memory is multiplied by --transfers:

  • Upload. rclone allocates one --buffer-size read-ahead buffer per --transfers, and for S3 backends an additional --s3-chunk-size × --s3-upload-concurrency multipart buffer per transfer. The old code set buffer_size = available_ram / transfers, so the read-ahead buffers alone summed back to ~100% of available RAM: transfers × (available_ram / transfers) = available_ram. On the Oracle box this produced --buffer-size 1328M --transfers 22 × 1328M = 2656 MB of read-ahead buffers, i.e. essentially all available memory, before the S3 multipart buffers were even counted. With MariaDB, Redis, PHP-FPM and nginx already resident, this saturated RAM and the kernel OOM-killed rclone. (Logged buffer sizes across the 6 sites ranged 1023M–1328M.)

  • Download / rollback. rclone_download() ran rclone copy -P --multi-thread-streams min(cpu*2, 32) with no memory awareness and relied on rclone's default --transfers 4. For large files rclone fans out up to transfers × multi-thread-streams concurrent download streams, each holding a --multi-thread-write-buffer-size buffer plus one in-flight --multi-thread-chunk-size range. A many-core host could therefore spawn 4 × 32 = 128 concurrent streams and be OOM-killed during restore/rollback. An empty nproc reading also produced the degenerate --multi-thread-streams 0. The initial download fix bounded the stream count, but sized the read-ahead buffers and the multi-thread streams against the full budget independently, so the two pools could together reach ~2× the intended fraction (e.g. ~3.5 GB on an 8-core/4 GB host) and still OOM — now corrected by splitting one shared budget between them.

Changes

  • fix(backup): cap rclone upload memory to prevent OOM kills — introduces compute_rclone_resources( $cpu_cores, $available_ram, $is_s3 ), which budgets rclone's total footprint transfers × (buffer_size + s3_chunk_size × s3_concurrency) against a fraction of available RAM (default 50%, clamped 10–90%). It scales parallelism with cores (transfers 2–8), and on a constrained host first shrinks S3 multipart concurrency (the largest lever at 64M/chunk) and then --transfers until the budget fits, with --buffer-size clamped to [16M, 256M]. rclone_upload() now calls the helper and logs an EE::debug "rclone upload tuning" line with the chosen values and estimated peak.

  • fix(restore): cap rclone download streams and transfers to available memoryrclone_download() now reuses compute_rclone_resources() with is_s3 = false (S3 multipart concurrency does not apply to downloads) to derive memory-safe --transfers, --checkers and --buffer-size, then caps --multi-thread-streams so that transfers × streams × per_stream_mem fits the same available-RAM budget, flooring at 1 (which also fixes the nproc → 0 degenerate case). Per-stream cost is modeled from rclone's real levers --multi-thread-write-buffer-size and --multi-thread-chunk-size, and an EE::debug "rclone download tuning" line logs the chosen values and estimated peak.

  • fix(backup): bound rclone memory within one budget and harden RAM detection — review follow-up to the two fixes above. The download read-ahead buffers and the multi-thread streams were each sized against the full mem-fraction budget independently, so the combined footprint could reach ~2× the intended fraction (e.g. ~3.5 GB on an 8-core/4 GB host) and still OOM; rclone_download() now splits one shared budget — transfers × (buffer + streams × per_stream_mem) ≤ budget by construction — between the read-ahead buffer (up to half per transfer, capped at max_buffer) and the streams, while still scaling streams up on memory-rich hosts. The brittle free -m | grep Mem | awk '{print $7}' probe (English-locale + fixed-column) is replaced by a get_available_ram_mb() helper that pins LC_ALL=C and locates the available column by header name (falling back to free on older free/procps), now used by both upload and download. The download command now actually emits --multi-thread-write-buffer-size (previously the rclone-mt-write-buffer-size knob only fed the memory model and had no effect on rclone) and no longer forces --checkers (it returns to rclone's default of 8 instead of being bound to the memory-derived transfer count, which needlessly throttled the compare phase). Finally compute_rclone_resources() now returns budget and max_buffer so the download consumes them instead of recomputing the budget formula.

Behavior / impact

Peaks below are derived from the helper logic. The Oracle row uses the production case (1 core, ~2656 MB available — the same value that produced the old 1328M buffer), and the large-host row shows resources still scale up.

Upload

Host Old New
Oracle 1 core / ~2656 MB avail (S3) --buffer-size 1328M --transfers 2 → read-ahead peak ~2656 MB (≈100% of available → OOM), plus S3 multipart on top --transfers 2 --buffer-size 256M --s3-upload-concurrency 2 → peak ~768 MB (~29%)
Oracle 1 core / ~2656 MB avail (non-S3) as above --transfers 2 --buffer-size 256M → peak ~512 MB
16 core / 24 GB avail (S3) --transfers 4 --buffer-size 4096M (capped) → read-ahead peak ~16 GB --transfers 8 --buffer-size 256M --s3-upload-concurrency 16 → peak ~10240 MB (within 50% budget)
16 core / 24 GB avail (non-S3) as above --transfers 8 --buffer-size 256M → peak ~2048 MB

Download

Host Old New
Oracle 1 core / ~2656 MB avail default --transfers 4 × --multi-thread-streams 2 = up to 8 streams, memory-unaware --transfers 2 --buffer-size 256M --multi-thread-streams 2 → peak ~768 MB
8 core / ~4000 MB avail default --transfers 4 × --multi-thread-streams 16 = up to 64 streams, memory-unaware --transfers 8 --buffer-size 125M --multi-thread-streams 1 → peak ~1513 MB (the read-ahead/stream double-spend that previously pushed this to ~3.5 GB is now fixed)
16 core / 24 GB avail default --transfers 4 × --multi-thread-streams 32 = up to 128 concurrent streams --transfers 8 --buffer-size 256M --multi-thread-streams 19 → peak ~11.8 GB (within the 12 GB budget — streams scale up to use spare RAM)
nproc returns empty --multi-thread-streams 0 (degenerate) floored to --multi-thread-streams 1

The download command now also emits --multi-thread-write-buffer-size and --multi-thread-chunk-size (so the memory model matches the process), and no longer forces --checkers — it returns to rclone's default of 8 instead of being tied to the memory-derived transfer count.

Net effect: on the constrained Oracle host the backup/restore footprint drops from ~all available RAM to under a third of it, eliminating the OOM kill; on large-memory hosts parallelism and buffers scale up so spare RAM is still utilized.

Configuration

All knobs are optional global config values read via get_config_value() and default to rclone's own defaults, so existing installs need no changes.

Config key Default Effect
rclone-mem-fraction 0.5 Fraction of currently-available RAM rclone's total footprint may use, clamped to [0.1, 0.9]. Applies to both upload and download.
rclone-max-buffer-size 256 (MB) Upper clamp on --buffer-size. Lower bound is fixed at rclone's 16M default.
rclone-mt-write-buffer-size 128 (KiB) Per-download-stream write buffer (--multi-thread-write-buffer-size). Used to budget the stream count and now also passed through to rclone (previously it only fed the memory model and had no effect on the process).
rclone-mt-chunk-size 64 (MB) Per-download-stream in-flight chunk range (--multi-thread-chunk-size), used both to budget streams and passed to rclone.

Testing

  • php -l src/helper/Site_Backup_Restore.php passes (no syntax errors).
  • Budget math was verified against the helper logic for the Oracle (1 core / ~2656 MB) and a large (16 core / 24 GB) profile; figures in the tables above are reproduced from that logic.

Suggested maintainer validation:

  1. Run a backup and a restore/rollback on a low-RAM host (e.g. a 4 GB VM) and inspect the new EE::debug lines (ee --debug ...): "rclone upload tuning: ..." and "rclone download tuning: ..." should report a sane transfers, a buffer-size within [16M, 256M], and an estimated peak well under available RAM. The download line now also reports the shared budget and the chosen mt-write-buffer / mt-chunk-size, which should match the flags actually passed to rclone.
  2. Confirm rclone is no longer OOM-killed (dmesg -T | grep -i oom, journalctl -k | grep -i oom) and that the backup/restore completes.
  3. On a large-memory host, confirm the debug line shows higher transfers and stream counts so throughput is preserved.
  4. Optionally set rclone-mem-fraction lower/higher and confirm the chosen values move accordingly.

Risk / compatibility

  • The download command now passes --multi-thread-chunk-size and --multi-thread-write-buffer-size, both available in modern rclone (the version EasyEngine installs); --multi-thread-streams was already in use. It no longer forces --checkers on download, reverting to rclone's default of 8.
  • The available-RAM probe now pins LC_ALL=C and locates the available column by header name (with a free-column fallback for older free/procps), so it is no longer dependent on the system locale or column order.
  • Otherwise backward compatible: all four config knobs are optional and default to rclone's own defaults, so behavior changes only in the direction of staying within a memory budget. No public command, flag, or stored configuration changes.

mrrobot47 added 3 commits June 2, 2026 20:54
rclone allocates one --buffer-size read-ahead buffer per --transfers and, for S3 backends, an additional --s3-chunk-size x --s3-upload-concurrency multipart buffer per transfer. The previous formula set buffer_size = available_ram / transfers, so the read-ahead buffers alone consumed ~100% of available memory (e.g. --buffer-size 1328M --transfers 2 on a 4 GB host), with the S3 multipart buffers unaccounted on top. This saturated RAM and triggered the OOM killer during backups.

Introduce compute_rclone_resources() which budgets rclone's total in-memory footprint -- transfers * (buffer_size + s3_chunk * s3_concurrency) -- against a fraction of currently-available RAM (default 50%, clamped 10-90%). It scales parallelism and buffer size up on larger hosts and shrinks S3 concurrency then transfers on constrained hosts, with buffer-size clamped to [16M, 256M].

Add optional global config knobs rclone-mem-fraction and rclone-max-buffer-size, plus an EE::debug line logging the chosen values and estimated peak memory.
…memory

rclone_download() ran 'rclone copy -P --multi-thread-streams min(cpu*2,32)' with no memory awareness and relied on rclone's default --transfers 4. rclone fans out up to transfers * multi-thread-streams concurrent download streams for large files, each holding a --multi-thread-write-buffer-size buffer plus one in-flight --multi-thread-chunk-size range, so a many-core/low-RAM host (e.g. the same 4 GB box) could spawn 4 * 32 = 128 streams and be OOM-killed during restore or rollback. An empty nproc also yielded the degenerate --multi-thread-streams 0.

Reuse compute_rclone_resources() (is_s3 = false, since S3 multipart-upload concurrency does not apply to downloads) to derive memory-safe --transfers, --checkers and --buffer-size, then cap --multi-thread-streams against the same available-RAM budget, flooring at 1 (which also fixes the nproc->0 case). Per-stream memory is modeled from rclone's real levers --multi-thread-write-buffer-size and --multi-thread-chunk-size, exposed as the rclone-mt-write-buffer-size and rclone-mt-chunk-size config knobs, and an EE::debug line logs the chosen values and estimated peak.
…ection

Follow-up to the rclone OOM fixes, addressing issues found in review:

Download (#1): the --multi-thread-streams budget was checked independently of the read-ahead buffers, so the two pools were each sized against the full mem-fraction budget and the combined download footprint could reach ~2x the intended fraction (e.g. ~3.5 GB on an 8-core/4 GB host) and still OOM during restore/rollback. rclone_download() now splits one shared budget between the read-ahead buffer (up to half per transfer, capped at max_buffer) and the multi-thread streams, so transfers * (buffer + streams * per_stream_mem) <= budget by construction, while still scaling streams up on memory-rich hosts.

Available-RAM detection (EasyEngine#3): the 'free -m | grep Mem | awk {print $7}' probe assumed an English locale and a fixed available-column index. It is now centralized in get_available_ram_mb(), which pins LC_ALL=C and locates the available column by its header name, falling back to free on older free/procps builds that lack it.

multi-thread-write-buffer-size (EasyEngine#4): the rclone-mt-write-buffer-size config value fed the stream-memory model but was never passed to rclone, so the knob had no effect. rclone_download() now emits --multi-thread-write-buffer-size so the model matches the process.

Download checkers (EasyEngine#5): rclone_download() no longer forces --checkers = transfers (it set none before, so rclone used its default of 8); checkers allocate no transfer buffers, so binding them to the memory-derived transfer count needlessly throttled the compare phase.

Dedupe (EasyEngine#9): compute_rclone_resources() now returns the computed budget and max_buffer, so rclone_download() consumes them instead of recomputing the mem-fraction/budget formula, removing the risk of the two copies drifting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant