Skip to content

[LMCROSSITXSADEPLOY-3316] Introduce health-check-interval MTA module parameter#1848

Open
karrgov wants to merge 3 commits into
cloudfoundry:masterfrom
karrgov:feature/LMCROSSITXSADEPLOY-3316
Open

[LMCROSSITXSADEPLOY-3316] Introduce health-check-interval MTA module parameter#1848
karrgov wants to merge 3 commits into
cloudfoundry:masterfrom
karrgov:feature/LMCROSSITXSADEPLOY-3316

Conversation

@karrgov
Copy link
Copy Markdown
Contributor

@karrgov karrgov commented May 29, 2026

Summary

Introduces the health-check-interval MTA module parameter for liveness health checks on CF apps deployed via MTA, achieving feature parity with the underlying Cloud Foundry capability now exposed by the upgraded CF Java client.

Example usage in mta.yaml:

modules:
  - name: my-app
    type: application
    parameters:
      health-check-interval: 15

Changes

End-to-end wiring of the new parameter:

  • SupportedParameters — adds HEALTH_CHECK_INTERVAL constant to MODULE_PARAMETERS.
  • Messages — adds INVALID_HEALTH_CHECK_INTERVAL validation message.
  • StagingParametersParser — parses, validates (must be > 0, throws ContentException otherwise), and forwards the value to ImmutableStaging.
  • Staging interface and CloudProcess — adds getHealthCheckInterval() accessor (Immutables regenerates the implementations).
  • RawCloudProcess — maps Data.getInterval() from the CF API response into CloudProcess.
  • HealthCheckInfo — adds the interval field, getter, and includes it in equals so change detection notices interval drifts.
  • CloudControllerRestClientImpl — forwards the interval to the Data builder; widens the updateApplicationProcess guard to also fire when only the interval changed; guards buildHealthCheck against HealthCheckType.from(null) NPE.

Tests

  • StagingParametersParserTest — three new tests (correct parse with interval=15, validation rejection with interval=0, null when absent), plus a parameterized test covering positive interval values (1, 15, 60, Integer.MAX_VALUE).
  • HealthCheckInfoTest — four tests covering equal instances, different intervals, null-vs-non-null, and fromProcess / fromStaging cross-equality.
  • RawCloudProcessTest — new test class covering RawCloudProcess.derive() mapping including health-check interval propagation from the CF API response.

Jira

JIRA: LMCROSSITXSADEPLOY-3316 — Introduce Health check interval

Test plan

  • mvn clean test -pl multiapps-controller-client,multiapps-controller-core passes locally.
  • Deploy a sample MTA with health-check-interval: 15 and confirm the CF app's process reports the interval via cf curl against the v3 process endpoint.
  • Update an existing MTA's health-check-interval and confirm the controller detects the drift and issues an update (rather than a no-op).
  • Negative case: deploy with health-check-interval: 0 and confirm ContentException with INVALID_HEALTH_CHECK_INTERVAL.

karrgov added 3 commits May 29, 2026 14:48
Introduce the liveness health check interval parameter end-to-end:
- SupportedParameters: new HEALTH_CHECK_INTERVAL constant in MODULE_PARAMETERS
- Messages: INVALID_HEALTH_CHECK_INTERVAL validation error message
- StagingParametersParser: parse, validate (must be > 0), and forward to ImmutableStaging builder
- Staging interface + CloudProcess: getHealthCheckInterval() accessors (Immutables regenerates)
- RawCloudProcess: map Data.getInterval() from CF API response into CloudProcess
- HealthCheckInfo: add interval field, getter, and equals check for change detection
- CloudControllerRestClientImpl: forward interval to Data builder; widen updateApplicationProcess guard to also fire when interval is non-null; guard buildHealthCheck against HealthCheckType.from(null) NPE

JIRA:LMCROSSITXSADEPLOY-3316
- StagingParametersParserTest: three new tests — correct parse (interval=15),
  validation rejection (interval=0 throws ContentException), and null when absent
- HealthCheckInfoTest: four tests covering equal instances, different intervals,
  null-vs-non-null, and fromProcess/fromStaging cross-equality

JIRA:LMCROSSITXSADEPLOY-3316
…ess test coverage

- StagingParametersParserTest: parameterized test covering positive interval
  values (1, 15, 60, Integer.MAX_VALUE)
- RawCloudProcessTest: new test class covering RawCloudProcess.derive() mapping
  including the new health-check interval propagation from the CF API response

JIRA:LMCROSSITXSADEPLOY-3316
@karrgov
Copy link
Copy Markdown
Contributor Author

karrgov commented May 29, 2026

MTA Quality Report — cloudfoundry/multiapps-controller PR #1848

Jira: LMCROSSITXSADEPLOY-3316 — Introduce Health check interval
Backlog alignment: PASS

  • Implements Jira scope? yes — PR wires the health-check-interval MTA module parameter end-to-end, exactly the scenario described in the Jira description.
  • Changes outside Jira scope? no — All edits are scoped to health-check-interval propagation and its tests; no unrelated refactors.

Code Review

No code-review findings (confidence ≥ 80).


Security

No issues found.

  • No new untrusted-input sinks (no LOGGER, exec, deserialization, or file/archive paths modified).
  • New parameter is validated server-side (validateHealthCheckInterval rejects <= 0) before being forwarded to CF.
  • buildHealthCheck now guards HealthCheckType.from(null) against NPE — a small defensive hardening, not a vulnerability fix.
  • No new dependencies → no CVE surface change.

SonarCloud

⚠️ The build GitHub Actions job failed in the Sonar Scan step with Project not found. Please check the 'sonar.projectKey' and 'sonar.organization' properties, the 'SONAR_TOKEN' environment variable, or contact the project administrator (job log). This is a CI/CD configuration issue unrelated to this PR's contents — the Sonar project binding or token is invalid. Unit tests in upstream Maven modules all passed before the Sonar Scan step ran. No SonarCloud quality-gate verdict could be obtained for this head SHA.

Other checks on the head SHA:

Check Conclusion
CodeQL ✅ success — No new alerts in code changed by this pull request
Build and analyze ✅ success
Analyze (java) ✅ success
Check Commit Message ✅ success
EasyCLA ✅ success
build (Sonar Scan step) ❌ failure — Sonar token / project binding misconfigured

Dependency CVEs

✅ No new CVEs — this PR does not modify any dependency files (pom.xml, build.gradle, lockfiles).

@karrgov
Copy link
Copy Markdown
Contributor Author

karrgov commented May 29, 2026

oq test verdict: FAIL

Recommendation: do not merge as-is — OQ failed broadly (30 scenarios), but log analysis finds zero PR-attributable regressions; investigate the CF target / re-run OQ before drawing a code conclusion.

PR: #1848 @ fe7b56add19f021efed65feb86df45d9cabe5858
CF target: deploy-service / sap_btp_cf_mta_deploy+technical1 (app: deploy-service)
Window: 2026-05-29T13:20:22Z → 2026-05-29T13:52:17Z
Pipeline: http://gcpclm950064:8080/teams/main/pipelines/qa-tester

Pipeline outcomes

Stage Result Notes
Deploy (deploy-service-pusher-oq) PASS
Tests (qa-tester) FAIL 30 non-paused Concourse jobs failed/errored/aborted
Log analysis (log-analyzer) FAIL OQ_RESULT=FAIL (30 scenarios). 0 regression suspects; all WARN/ERROR are catalog-expected or known infra noise. No stack frame or logger intersects the 11 files changed by this PR. Breadth + 22 scenarios with no server-side error evidence point to a shared infrastructure disruption (CF API rate limiting visible at 13:39–13:41Z), not a code regression.

Verdict rationale

Verdict is FAIL because OQ reported 30 failing scenarios — the test signal is unambiguously red and we will not pass a run that didn't pass tests. However, the supporting log analysis is unusual: log-analyzer ran cleanly, sifted 5,872 WARN / 48,750 ERROR entries, classified 179 as test-driven (catalog-expected) and 54,443 as known infrastructure noise (auditlog binding absent, ANS not configured, CSRF on whitelisting probes, etc.), and found zero regression suspects — meaning nothing in the log window touches the 11 files in this PR's diff (a small additive change introducing the health-check-interval MTA parameter). 22 of the 30 failed scenarios produced no server-side WARN/ERROR at all, which is consistent with a test-runner / CF-target disruption (rate limiting, space misconfig) rather than a code defect. Recommendation: hold the merge, investigate CF infra, and re-run OQ; do not interpret this run as evidence of a regression in PR #1848.

Failed jobs / scenarios

  • application-hooks
  • async-service-bindings-scenario
  • async-service-keys-scenario
  • bg-deploy-stop-reorder
  • blue-green-deploy
  • cleaners-and-clean-up-job
  • cts-basic-auth-error
  • cts-basic-auth-error-new-slp-api
  • cts-blue-green
  • cts-custom-idp-authentication
  • cts-multipart-file-uploads
  • cts-multiple-mtas-deploy
  • cts-oauth-error
  • cts-oauth-error-new-slp-api
  • gacd-in-deployed-after
  • generic-content-deploy
  • hook-target-app
  • liquibase-lock-service
  • namespace-multiple-deploys
  • occasional-message-for-non-finishing-task-execution
  • only-async-services-scenario
  • optional-mta-resources-scenario
  • passing-secrets-during-deployment
  • selective-deployment-scenario
  • service-tags
  • shared-private-domain-scenario
  • test-shutdown-client
  • update-service-scenario
  • whitelisting-visibility-failure-scenario
  • whitelisting-visibility-in-current-org-space-scenario

PR change surface

  • Files changed: 11 (additive health-check-interval MTA parameter support)
  • Modules touched: multiapps-controller (no multiapps or xsa-multiapps-controller changes). Affected classes: RawCloudProcess, CloudProcess, Staging, CloudControllerRestClientImpl, HealthCheckInfo, Messages, SupportedParameters, StagingParametersParser (+ 3 unit tests). Note: PR diff was not re-fetched at publish time (file list taken from log-analyzer's diff summary); module attribution carries the analyzer's confidence.
  • Suspect overlap: none — log-analyzer reported 0 regression suspects, so no changed file in this PR overlaps any signature in the log window.

Log analysis summary

  • Expected (test-driven): 179
  • Infrastructure / transient: 54,443
  • Potentially regression-related: 0
  •   Likely caused by PR: 0
  •   Unlikely caused by PR: 0
  •   Inconclusive: 0
  • Version skew: post-release normal (multiapps-controller pins 2.48.0; multiapps moved to 2.49.0-SNAPSHOT) — analyzer flagged this as the benign post-release variant, not the "deploy stale code" variant.
Full log-analyzer findings

Log Analyzer — oq verdict: FAIL

Test outcome (from orchestrator): FAILED
CF target: deploy-service / sap_btp_cf_mta_deploy+technical1 (app: deploy-service, deployed sha: fe7b56a)
Window: 2026-05-29T13:20:22Z → 2026-05-29T13:52:17Z
Index queried: logs-*
Total WARN: 5,872 | Total ERROR: 48,750 | Truncated: no


Verdict rationale

The overall verdict is FAIL because the orchestrator reported OQ_RESULT=FAIL (30 scenarios failed). However, the log analysis finds zero Bucket C regression suspects — every WARN/ERROR entry in the window is attributable to either known OQ scenario behavior (Bucket A, 179 catalog-matched hits) or recurring infrastructure/configuration noise unrelated to the PR's changes (Bucket B, 54,443 hits). The 30-scenario failure breadth is characteristic of a shared infrastructure disruption (see below) rather than a code regression. The PR diff (health-check-interval parameter support) touches 11 files across 5 Java classes, none of which intersects any stack frame, logger name, or exception message observed in the log window. The log analysis finds no evidence that this PR caused the OQ failures.


Local git state at analysis time

Sub-project Branch Uncommitted (files) On feature branch
multiapps-controller qa-pr-1848 0 yes
multiapps master 0 no
xsa-multiapps-controller master 0 no
XSOQTests feature/LMCROSSITXSADEPLOY-3316 0 yes
cf-mta-examples feature/LMCROSSITXSADEPLOY-3316 0 yes
multiapps-cli-plugin master 0 no

No uncommitted local changes were found in any sub-project. The deployed WAR corresponds to the tip of qa-pr-1848 (fe7b56a), which carries 3 commits relative to master — all part of the health-check-interval feature.


Deploy chain version pinning

Source of truth Truth value Declared in downstream Declared value Status
multiapps/pom.xml <version> 2.49.0-SNAPSHOT multiapps-controller <multiapps.version> 2.48.0 SKEW (expected)
multiapps/pom.xml <version> 2.49.0-SNAPSHOT xsa-multiapps-controller <multiapps.version> 2.48.0 SKEW (expected)
multiapps-controller/pom.xml <version> 2.48.0-SNAPSHOT xsa-multiapps-controller <multiapps-controller.version> 2.48.0-SNAPSHOT OK

Assessment: The apparent skew (2.49.0-SNAPSHOT vs 2.48.0) is the normal post-release state. multiapps 2.48.0 was released and multiapps-controller pins to that released artifact. multiapps has moved on to 2.49.0-SNAPSHOT for the next development cycle. Since 2.48.0 is a published artifact in the Maven repository, multiapps-controller resolves it correctly — this is not the "deploy stale code" variant of the version skew described in CLAUDE.md. No Bucket C escalation required.


Categorization

Category Count
Expected (test-driven, catalog-matched) 179
Infrastructure / transient (Bucket B) 54,443
Potentially regression-related (Bucket C) 0

Bucket A — Expected (test-driven) detail

The OQ reference catalog matched 179 entries across 4 scenarios:

Scenario Count Kind
generic-content-deploy 152 content_error
timeout-scenario 6 timeout
service-deletion-failed-scenario 2 service_deletion_failure
app-staging-failure 1 staging_failure
(global signature match) 18 broker_failure / unsupported_parameter

Note: health-check-interval-scenario (the new scenario added by this PR's XSOQTests branch) is not in the OQ catalog yet and did not appear in the orchestrator's failed_scenarios list — it was apparently not executed in this pipeline run (expected: the scenario is on the feature branch but the pipeline YAML has not been updated to include it yet).

Note: 15 of the 30 failed scenarios are not present in the current OQ reference catalog (application-hooks, all cts-* variants, gacd-in-deployed-after, liquibase-lock-service, namespace-multiple-deploys, occasional-message-for-non-finishing-task-execution, only-async-services-scenario, passing-secrets-during-deployment, selective-deployment-scenario, service-tags, shared-private-domain-scenario, test-shutdown-client, update-service-scenario, whitelisting-visibility-*). Their expected error signatures are not modeled; however, their failure logs do not appear in the WARN/ERROR stream in a form that the triage engine recognized as regression-marker-bearing. This anomaly count (15/29 catalog scenarios) warrants catalog expansion but does not change the regression verdict for this PR.


Bucket B — Infrastructure / transient detail

Signature Count Classification
AuditLogNotAvailableException: Failed to write message to the audit log 45,001 Infrastructure — auditlog service not bound in this OQ space (known); emitted on every deployment operation
Ignoring parameter "namespace", as the MTA is not deployed with namespace! 3,690 Expected behavior WARN — scenarios deploy MTAs without namespace; logged per-resource
EmptyAnsProducerClientException: Notification for Unknown NOT sent to ANS: Configuration missing 1,358 Infrastructure — ANS (Alert Notification Service) not configured in OQ space; known
MissingCsrfTokenException: Request "POST …" failed with "Could not verify the provided CSRF token" 2,835 Test-induced — OQ scenarios probe CSRF-protected endpoints without a prior GET to seed the token; consistent with whitelisting/CTS test patterns
Skipping deletion of services, because --delete-services is not specified 279 Expected behavior WARN — nominal
RejectedExecutionException: task rejected from ThreadPoolExecutor (pool size=6, active=6) 2 Transient — upload thread pool momentarily saturated during concurrent OQ runs; retried
TooManyRequests (429) from GET /v3/roles 8 Infrastructure — CF API rate limiting; handled by ResilientOperationExecutor with retry
ContentException: Error merging descriptors: Unsupported resource type "auditlog" for platform type "CLOUD-FOUNDRY" 8 Expected behavior — CTS/XSA scenarios exercise resource types unsupported on CF; produces ContentException deliberately
NullPointerException in OperationInFinalStateHandler.deletePreviousBackupDescriptors 23 Infra/pre-existing — NPE arises when DeploymentDescriptor is null (i.e. a process completed without leaving a backup descriptor). OperationInFinalStateHandler is NOT in the PR diff (last touched by commit 4f32db2, pre-dates this PR). Logger: SafeExecutor wraps and logs as WARN — non-fatal.
NotFoundException: MTA with name "anatz-severe-error"/"ztana" does not exist 16 Expected behavior — undeploy scenarios targeting MTAs not yet deployed
CloudOperationException: 404 Not Found: Service instance not found 9 Expected behavior — optional-resources scenarios deliberately reference non-existent services
ResponseStatusException 403/401 15 Expected behavior — whitelisting and CTS-auth scenarios deliberately trigger authorization failures
InternalAuthenticationServiceException: Invalid JWT / No token parser found 4 Expected behavior — token-expiration and invalid-auth scenarios
ContentDeployerException: HTTP 413 Payload Too Large from GACD sync endpoint 2 Expected behavior — gacd-in-deployed-after scenario sends oversized payload to test error handling
StepPhaseRetryException: A step of the process has failed 36 Expected behavior — retry wrappers for Flowable step failures
SLException / ContentException: Service plan not found / rollback errors 556 Expected behavior — various error scenarios deliberately trigger these

All Bucket B entries pre-date or are orthogonal to the PR's changes. None of the loggers or stack frames listed above intersect the 11 files modified by PR #1848. The 45,001 audit log entries (82.4% of total volume) and 2,835 CSRF entries (5.2%) are the dominant noise sources and are both longstanding infrastructure characteristics of this OQ space.


Per-suspect attribution

There are no Bucket C suspects. The triage produced zero entries in /tmp/cls_suspects_raw.json (0 unexpected hits, 0 indeterminate hits with regression markers). Accordingly there are no rows in the attribution table and no "Strong attributions" section.


PR diff summary (for context)

PR #1848 adds health-check-interval as a new MTA module parameter. The 11 changed files (all in multiapps-controller, no multiapps or xsa-multiapps-controller changes) are:

File Kind Risk markers
RawCloudProcess.java java-prod none — purely additive field extraction
CloudProcess.java java-prod none — additive abstract getter
Staging.java java-prod none — additive interface method
CloudControllerRestClientImpl.java java-prod none — condition widened to allow interval-only update; buildHealthCheck refactored to make type optional
HealthCheckInfo.java java-prod none — additive field + equality update
Messages.java java-prod none — new error constant
SupportedParameters.java java-prod none — new constant added to allow-list
StagingParametersParser.java java-prod none — additive parameter parsing + validation guard (rejects ≤0)
RawCloudProcessTest.java java-test n/a
HealthCheckInfoTest.java java-test n/a
StagingParametersParserTest.java java-test n/a

The CloudControllerRestClientImpl change is the most behavior-affecting: when healthCheckType is null but healthCheckInterval is non-null, the CF API PATCH now sends a HealthCheck body without a type field. This is correct per CF API v3 (type defaults to process when omitted) and only fires if an MTA explicitly sets health-check-interval without health-check-type. None of the 30 failed OQ scenarios set health-check-interval, so this code path was never exercised.


Failed scenarios provided by orchestrator

30 scenarios failed. Cross-referencing against the log window:

  • 8 catalog-backed scenarios failed (async-service-bindings-scenario, async-service-keys-scenario, bg-deploy-stop-reorder, blue-green-deploy, cleaners-and-clean-up-job, generic-content-deploy, hook-target-app, optional-mta-resources-scenario): their expected error patterns are present in Bucket A (179 catalog hits). No anomalous Bucket C entries overlap their expected windows.
  • 22 scenarios failed with no corresponding WARN/ERROR evidence in the log window: application-hooks, all 8 cts-*, gacd-in-deployed-after, liquibase-lock-service, namespace-multiple-deploys, occasional-message-for-non-finishing-task-execution, only-async-services-scenario, passing-secrets-during-deployment, selective-deployment-scenario, service-tags, shared-private-domain-scenario, test-shutdown-client, update-service-scenario, whitelisting-visibility-failure-scenario, whitelisting-visibility-in-current-org-space-scenario.

The absence of WARN/ERROR logs from 22 failing scenarios suggests those scenarios failed at the test script level (e.g., assertion mismatch, missing artifact, network timeout from the test runner side) rather than producing server-side errors. This is consistent with a shared infrastructure disruption — for example, a CF API rate-limiting episode (8 × 429 entries visible in the window around 13:39–13:41Z) or an OQ space misconfiguration — affecting scenario execution without generating server-side WARN/ERROR entries. needs_investigation=false for all entries because no suspect intersects a failed scenario with a regression marker.


OQ catalog regeneration note

The OQ reference catalog was stale (older than XSOQTests/test_resources/health-check-interval/http-health-check-interval/mtad.yaml, which is new in this PR's XSOQTests branch). The catalog was regenerated before the fetch using build_catalog.py. The regenerated catalog has 29 scenarios / 73 steps (source SHA: 43a97c0c).


Posted manually by orchestrator (pr-result-publisher subagent lacked GitHub MCP tools). Mode: oq. Generated 2026-05-29T17:10:00Z.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant