Skip to content

NAS backup: automated infrastructure backup (database, configs, certificates)#12900

Open
jmsperu wants to merge 2 commits into
apache:4.22from
jmsperu:feature/infrastructure-backup-to-nas
Open

NAS backup: automated infrastructure backup (database, configs, certificates)#12900
jmsperu wants to merge 2 commits into
apache:4.22from
jmsperu:feature/infrastructure-backup-to-nas

Conversation

@jmsperu
Copy link
Copy Markdown
Collaborator

@jmsperu jmsperu commented Mar 26, 2026

Summary

Adds a new InfrastructureBackupTask to the NAS backup plugin that performs daily backups of CloudStack infrastructure (database, management/agent configs, SSL certs) to NAS storage.

Problem

CloudStack's NAS backup provider only backs up VM disks. The management server database, agent configurations, SSL certificates, and global settings are not backed up. If the management server fails, all metadata is lost unless someone manually configured mysqldump cron.

Solution

A new background poll task that automatically backs up:

  1. MySQL databases (cloud + optionally cloud_usage) using mysqldump --single-transaction for InnoDB-consistent hot backups
  2. Management server configs (/etc/cloudstack/management/)
  3. Agent configs (/etc/cloudstack/agent/) if present
  4. SSL certificates (/etc/cloudstack/management/cert/) if present
  5. Automatic retention management (removes old backup sets)

Database credentials are read from /etc/cloudstack/management/db.properties at runtime (no secrets stored in global config).

Configuration

Setting Scope Default Description
nas.infra.backup.enabled Global false Master switch for infrastructure backup
nas.infra.backup.location Global (empty) NAS mount path (e.g. /mnt/nas-backup)
nas.infra.backup.retention Global 7 Number of backup sets to keep
nas.infra.backup.include.usage.db Global true Include cloud_usage database

Backup Structure

/mnt/nas-backup/infra-backup/
├── 20260327-020000/
│   ├── cloud-20260327-020000.sql.gz
│   ├── cloud_usage-20260327-020000.sql.gz
│   ├── management-config.tar.gz
│   ├── agent-config.tar.gz
│   └── ssl-certs.tar.gz
├── 20260326-020000/
│   └── ...

Changes

  • New: InfrastructureBackupTask.java - background task implementing BackgroundPollTask
  • Modified: NASBackupProvider.java - added 4 ConfigKeys, scheduled the backup task

Test plan

  • Enable backup framework and set provider to NAS
  • Configure nas.infra.backup.enabled=true and nas.infra.backup.location=/mnt/test-backup
  • Verify backup directory structure is created
  • Verify cloud-*.sql.gz database dump is created and restorable
  • Verify management-config.tar.gz contains /etc/cloudstack/management/ files
  • Verify agent-config.tar.gz is created only when agent config dir exists
  • Set retention=2, trigger multiple cycles, verify old backups are cleaned up
  • Verify task is a no-op when disabled
  • Verify task handles missing db.properties gracefully
  • Verify cloud_usage backup is included/excluded based on config

Adds automated backup of CloudStack infrastructure to NAS storage:
- MySQL databases (cloud + cloud_usage if enabled)
- Management server configuration (/etc/cloudstack/management/)
- Agent configuration (/etc/cloudstack/agent/)
- SSL certificates and keystores

Backup runs daily via BackgroundPollManager, configurable via global settings:
- nas.infra.backup.enabled (default: false)
- nas.infra.backup.location (NAS mount path)
- nas.infra.backup.retention (default: 7 backup sets)
- nas.infra.backup.include.usage.db (default: true)

Backups are stored in the NAS backup storage under infra-backup/
with automatic retention management.

Uses mysqldump --single-transaction for hot database backup
(no table locks, InnoDB consistent snapshot). Database credentials
are read from /etc/cloudstack/management/db.properties.
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 33.89831% with 117 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.68%. Comparing base (c1af36f) to head (432c049).
⚠️ Report is 140 commits behind head on 4.22.

Files with missing lines Patch % Lines
...he/cloudstack/backup/InfrastructureBackupTask.java 28.20% 107 Missing and 5 partials ⚠️
...rg/apache/cloudstack/backup/NASBackupProvider.java 76.19% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.22   #12900      +/-   ##
============================================
+ Coverage     17.61%   17.68%   +0.06%     
- Complexity    15676    15801     +125     
============================================
  Files          5917     5923       +6     
  Lines        531537   533336    +1799     
  Branches      64985    65232     +247     
============================================
+ Hits          93610    94297     +687     
- Misses       427369   428387    +1018     
- Partials      10558    10652      +94     
Flag Coverage Δ
uitests 3.69% <ø> (-0.01%) ⬇️
unittests 18.75% <33.89%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@weizhouapache
Copy link
Copy Markdown
Member

Code LGTM. However, I wonder whether backing up the CloudStack database from within CloudStack itself is the right approach.

In most deployments, database backups are usually handled externally by administrators (e.g., via cron jobs or existing backup tooling). That approach may be simpler and more flexible operationally.

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 9, 2026

@weizhouapache fair point — the intent here was never to replace external/operational DB backup tooling, but to give small and edge deployments a one-knob "backup the management plane to the same NAS that already holds my VM backups" option for disaster recovery (rebuild from a fresh management server using only what's on the NAS).

Concrete adjustments I can make:

  1. Make the DB component opt-in via an explicit infrastructure.backup.include.database global setting, defaulting to false — so by default we only ship configs + certs (where there's no real ops alternative anyway).
  2. Document explicitly that for production deployments with existing backup tooling, the DB component should stay disabled and be handled externally.
  3. Keep the unified target (same NAS as VM backups) for the configs+certs path so recovery is in one place.

Does that split land closer to where you'd want it?

@DaanHoogland
Copy link
Copy Markdown
Contributor

@weizhouapache fair point — the intent here was never to replace external/operational DB backup tooling, but to give small and edge deployments a one-knob "backup the management plane to the same NAS that already holds my VM backups" option for disaster recovery (rebuild from a fresh management server using only what's on the NAS).

Concrete adjustments I can make:

  1. Make the DB component opt-in via an explicit infrastructure.backup.include.database global setting, defaulting to false — so by default we only ship configs + certs (where there's no real ops alternative anyway).
  2. Document explicitly that for production deployments with existing backup tooling, the DB component should stay disabled and be handled externally.
  3. Keep the unified target (same NAS as VM backups) for the configs+certs path so recovery is in one place.

Does that split land closer to where you'd want it?

sound like good improvements to the PR @jmsperu (not sure if @weizhouapache agrees ;) )

@weizhouapache
Copy link
Copy Markdown
Member

@weizhouapache fair point — the intent here was never to replace external/operational DB backup tooling, but to give small and edge deployments a one-knob "backup the management plane to the same NAS that already holds my VM backups" option for disaster recovery (rebuild from a fresh management server using only what's on the NAS).

Concrete adjustments I can make:

  1. Make the DB component opt-in via an explicit infrastructure.backup.include.database global setting, defaulting to false — so by default we only ship configs + certs (where there's no real ops alternative anyway).
  2. Document explicitly that for production deployments with existing backup tooling, the DB component should stay disabled and be handled externally.
  3. Keep the unified target (same NAS as VM backups) for the configs+certs path so recovery is in one place.

Does that split land closer to where you'd want it?

@jmsperu
I personally suggest you create a cron job on the management server, it still works even when the cloudstack-management service is not running.

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 22, 2026

Thanks both — and @weizhouapache, the cron-job suggestion is fair for any deployment with existing ops tooling. The hard mode here is the first-day greenfield small deployment where the operator has CloudStack + a NAS and not yet a cron infrastructure; they should be able to opt-in to a unified backup target without standing up a separate process. But that's an opt-in, not a default.

I'll push the three-point scope reduction in the next revision:

  1. New infrastructure.backup.include.database global, default false — DB component skipped unless explicitly enabled
  2. Docs explicitly recommending the cron-job approach for production
  3. Configs + certs path (where there's no realistic alternative) stays as-is, on the same unified NAS target as VM backups

If even the opt-in DB path is undesirable (you'd rather we don't ship that code at all and just keep configs+certs), happy to drop it entirely — let me know.

Also need to fix the codecov failures — adding unit tests in the next push.

Addresses @weizhouapache's review: by default, production deployments should
manage CloudStack DB backups via existing external tooling (cron + mysqldump,
replication, etc), not via the in-process backup task. The DB component is now
gated on a new explicit ConfigKey that defaults to false.

- New ConfigKey: nas.infra.backup.include.database (Boolean, default false,
  Global scope). When false, the InfrastructureBackupTask runs only the
  configs + certs path (where there is no reasonable external alternative).

- nas.infra.backup.enabled description rewritten to make the new split clear
  and to steer production users toward the cron-job pattern for the DB.

- nas.infra.backup.include.usage.db description updated to clarify it has
  no effect unless include.database is also true.

- New InfrastructureBackupTaskTest with 9 tests covering the gating decisions:
  master switch off, empty location, DB-skipped-by-default, both DBs when
  usage enabled, usage flag ignored when DB excluded, props unreadable,
  retention always runs, daily interval. Addresses codecov coverage gap on
  this PR (143 untested lines was the original report).

- backupDatabase/backupDirectory/cleanupOldBackups + loadDbProperties +
  the ConfigKey accessors made protected so tests can override the
  side-effecting paths without standing up ProcessBuilder.
@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 22, 2026

Pushed the scope reduction discussed above.

  • New nas.infra.backup.include.database global, default false — DB component is fully gated on it. Configs + certs path runs regardless.
  • Updated nas.infra.backup.enabled description to make the new split explicit and steer production users to the cron-job approach for the DB.
  • Added InfrastructureBackupTaskTest (9 tests, all passing locally) covering the gating decisions — should materially close the codecov gap @weizhouapache flagged.
  • Refactored a few methods to protected so the side-effecting paths (mysqldump, tar) can be overridden in tests without standing up real ProcessBuilders.

Ready for another look. If you'd still rather drop the DB component entirely and ship configs+certs only, happy to do that in a follow-up — let me know.

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 22, 2026

Friendly nudge — the opt-in DB ConfigKey push + new unit tests are queued but GH Actions has it as action_required. Could one of @weizhouapache / @DaanHoogland click Approve and run when convenient so the codecov / test runs actually fire? Thanks!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the NAS backup plugin beyond VM disk backups by adding a scheduled background task that backs up CloudStack “infrastructure” artifacts (management/agent config + SSL certs, and optionally the DB) to a NAS location, with retention cleanup.

Changes:

  • Adds InfrastructureBackupTask (BackgroundPollTask) that creates timestamped backup sets under {nas.infra.backup.location}/infra-backup/.
  • Adds new global ConfigKeys to enable/locate/configure retention and optional DB + usage DB backups.
  • Adds unit tests covering enablement/location gating and DB include flag behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/InfrastructureBackupTask.java Implements the daily infra backup workflow (mysqldump + tar) and retention cleanup.
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupProvider.java Registers new ConfigKeys and schedules the new background poll task.
plugins/backup/nas/src/test/java/org/apache/cloudstack/backup/InfrastructureBackupTaskTest.java Adds tests for feature gating and DB inclusion toggles.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +193 to +197
// Use --single-transaction for InnoDB hot backup (no table locks, consistent snapshot)
String[] cmd = {"/bin/bash", "-c",
String.format("mysqldump --single-transaction --routines --triggers --events " +
"-h '%s' -u '%s' -p'%s' '%s' | gzip > '%s'",
dbHost, dbUser, dbPassword, dbName, dumpFile)};
Comment on lines +267 to +275
File[] backups = infraDir.listFiles(File::isDirectory);
if (backups == null || backups.length <= retentionCount) {
return;
}

// Sort by name (timestamp-based), oldest first
Arrays.sort(backups, Comparator.comparing(File::getName));

int toDelete = backups.length - retentionCount;
Comment on lines +282 to +288
private void deleteDirectory(File dir) {
File[] files = dir.listFiles();
if (files != null) {
for (File file : files) {
if (file.isDirectory()) {
deleteDirectory(file);
} else {
Comment on lines +190 to +194
@Override
public boolean configure(String name, Map<String, Object> params) throws ConfigurationException {
super.configure(name, params);
backgroundPollManager.submitTask(new InfrastructureBackupTask(this));
return true;
Comment on lines +60 to +63
private final NASBackupProvider provider;

public InfrastructureBackupTask(NASBackupProvider provider) {
this.provider = provider;
Comment on lines +116 to +119
if (!dir.mkdirs()) {
LOG.error("Failed to create backup directory: {}", backupDir);
return;
}
@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 26, 2026

Hi @DaanHoogland @sureshanaparti @kiranchavala — this one's been green for a while now. Adds an automated NAS infrastructure backup (mgmt server db, configs, certs) on top of the existing NAS backup framework. CI is clean, no merge conflicts.

Could one of you take a look when you have a moment? Happy to clarify anything about the design choices in the PR description.

@abh1sar
Copy link
Copy Markdown
Contributor

abh1sar commented May 27, 2026

Hi @jmsperu you have lots of open PRs around improving the NAS provider.
While your contribution is very much appreciated, due diligence is needed in reviewing and testing the PRs, so that they adhere to the coding standard, the code remains easy to read and maintainable and doesn't break the existing functionality.
Since we have limited resources, we need to look at the PRs in a priority order.
Currently I am actively reviewing the incremental backup PR. IMO this PR is low priority. So will get to it later.

We will get to all your PRs, but it might take some time. So please be patient.
In the meantime, may I ask you, on all your open PRs, to

  1. Answer any pending copilot review comments
  2. Test the steps mentioned in the TestPlan (not looking for logs, but just a confirmation that they have been tested. For example, the test cases in the test plan mentioned in this PR are unchecked)

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 27, 2026

Thanks @abh1sar — completely understood on the prioritisation, and no rush from me. Happy for this one to sit until #13074 is through your queue.

In the meantime, on all my open NAS-provider PRs I'll:

  1. Sweep open Copilot threads and respond inline (resolving where addressed, explaining where the suggestion isn't quite right).
  2. Walk each TestPlan and tick the boxes I can legitimately confirm — distinguishing CI-verified items (build, unit tests) from manual cases I've reproduced in my lab; the rest stay unchecked until I genuinely run them.

I'll scope the activity to test-plan confirmations + Copilot replies (no rebases / no force-pushes that would re-trigger reviewers). If you'd prefer I close any of these as low-priority and re-open after #13074 lands, say the word and I'll do that instead.

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 27, 2026

Quick clarification on my previous comment — rather than proactively sweeping the open Copilot threads / TestPlans, I'll stand by for your direction. When you (or whoever picks each PR up) want me to act on a specific thread, tick a specific TestPlan item, push a fix, or close a PR as superseded, just ping and I'll turn it around quickly.

Goal is to not add work to your queue while you're heads-down on #13074. Standing by.

@daviftorres
Copy link
Copy Markdown
Contributor

Dear @jmsperu, this is an excellent idea. Many ACS deployments lack thorough backups for real disaster recovery situations.

Here is our current approach for anyone looking for alternative methods:

  • We run backups from a read-only database replica to avoid impacting performance.
  • A script runs continuously (as a service) in a loop with a 15-minute sleep (instead of a cronjob) to prevent overlapping execution.
  • Script:
    • Dumps each database sequentially (never written to disk).
    • Pipes the dump directly into gpg for PGP encryption.
    • Uploads the file using a write-only API key for security.
    • Credentials and keys are loaded as environment variables on service startup.
  • Uploads alternate between three different S3 buckets across North America (West, North, and East) in a round-robin fashion. Each bucket has versioning and lifecycle policies enabled.

Example of the line that dumps, encrypts, and uploads backups:

mysqldump --no-tablespaces --lock-tables=false -R cloud | gpg -r "<PUB_KEY_ID>" --yes --encrypt - | mc pipe --quiet <BUCKET>/acs/cloud.sql.pgp || success=false

I hope it helps someone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants