[LMCROSSITXSADEPLOY-3316] Introduce health-check-interval MTA module parameter#1848
[LMCROSSITXSADEPLOY-3316] Introduce health-check-interval MTA module parameter#1848karrgov wants to merge 3 commits into
Conversation
Introduce the liveness health check interval parameter end-to-end: - SupportedParameters: new HEALTH_CHECK_INTERVAL constant in MODULE_PARAMETERS - Messages: INVALID_HEALTH_CHECK_INTERVAL validation error message - StagingParametersParser: parse, validate (must be > 0), and forward to ImmutableStaging builder - Staging interface + CloudProcess: getHealthCheckInterval() accessors (Immutables regenerates) - RawCloudProcess: map Data.getInterval() from CF API response into CloudProcess - HealthCheckInfo: add interval field, getter, and equals check for change detection - CloudControllerRestClientImpl: forward interval to Data builder; widen updateApplicationProcess guard to also fire when interval is non-null; guard buildHealthCheck against HealthCheckType.from(null) NPE JIRA:LMCROSSITXSADEPLOY-3316
- StagingParametersParserTest: three new tests — correct parse (interval=15), validation rejection (interval=0 throws ContentException), and null when absent - HealthCheckInfoTest: four tests covering equal instances, different intervals, null-vs-non-null, and fromProcess/fromStaging cross-equality JIRA:LMCROSSITXSADEPLOY-3316
…ess test coverage - StagingParametersParserTest: parameterized test covering positive interval values (1, 15, 60, Integer.MAX_VALUE) - RawCloudProcessTest: new test class covering RawCloudProcess.derive() mapping including the new health-check interval propagation from the CF API response JIRA:LMCROSSITXSADEPLOY-3316
MTA Quality Report — cloudfoundry/multiapps-controller PR #1848Jira: LMCROSSITXSADEPLOY-3316 — Introduce Health check interval
Code ReviewNo code-review findings (confidence ≥ 80). SecurityNo issues found.
SonarCloud
Other checks on the head SHA:
Dependency CVEs✅ No new CVEs — this PR does not modify any dependency files ( |
|
| Stage | Result | Notes |
|---|---|---|
Deploy (deploy-service-pusher-oq) |
PASS | — |
Tests (qa-tester) |
FAIL | 30 non-paused Concourse jobs failed/errored/aborted |
Log analysis (log-analyzer) |
FAIL | OQ_RESULT=FAIL (30 scenarios). 0 regression suspects; all WARN/ERROR are catalog-expected or known infra noise. No stack frame or logger intersects the 11 files changed by this PR. Breadth + 22 scenarios with no server-side error evidence point to a shared infrastructure disruption (CF API rate limiting visible at 13:39–13:41Z), not a code regression. |
Verdict rationale
Verdict is FAIL because OQ reported 30 failing scenarios — the test signal is unambiguously red and we will not pass a run that didn't pass tests. However, the supporting log analysis is unusual: log-analyzer ran cleanly, sifted 5,872 WARN / 48,750 ERROR entries, classified 179 as test-driven (catalog-expected) and 54,443 as known infrastructure noise (auditlog binding absent, ANS not configured, CSRF on whitelisting probes, etc.), and found zero regression suspects — meaning nothing in the log window touches the 11 files in this PR's diff (a small additive change introducing the health-check-interval MTA parameter). 22 of the 30 failed scenarios produced no server-side WARN/ERROR at all, which is consistent with a test-runner / CF-target disruption (rate limiting, space misconfig) rather than a code defect. Recommendation: hold the merge, investigate CF infra, and re-run OQ; do not interpret this run as evidence of a regression in PR #1848.
Failed jobs / scenarios
- application-hooks
- async-service-bindings-scenario
- async-service-keys-scenario
- bg-deploy-stop-reorder
- blue-green-deploy
- cleaners-and-clean-up-job
- cts-basic-auth-error
- cts-basic-auth-error-new-slp-api
- cts-blue-green
- cts-custom-idp-authentication
- cts-multipart-file-uploads
- cts-multiple-mtas-deploy
- cts-oauth-error
- cts-oauth-error-new-slp-api
- gacd-in-deployed-after
- generic-content-deploy
- hook-target-app
- liquibase-lock-service
- namespace-multiple-deploys
- occasional-message-for-non-finishing-task-execution
- only-async-services-scenario
- optional-mta-resources-scenario
- passing-secrets-during-deployment
- selective-deployment-scenario
- service-tags
- shared-private-domain-scenario
- test-shutdown-client
- update-service-scenario
- whitelisting-visibility-failure-scenario
- whitelisting-visibility-in-current-org-space-scenario
PR change surface
- Files changed: 11 (additive
health-check-intervalMTA parameter support) - Modules touched:
multiapps-controller(nomultiappsorxsa-multiapps-controllerchanges). Affected classes:RawCloudProcess,CloudProcess,Staging,CloudControllerRestClientImpl,HealthCheckInfo,Messages,SupportedParameters,StagingParametersParser(+ 3 unit tests). Note: PR diff was not re-fetched at publish time (file list taken from log-analyzer's diff summary); module attribution carries the analyzer's confidence. - Suspect overlap: none — log-analyzer reported 0 regression suspects, so no changed file in this PR overlaps any signature in the log window.
Log analysis summary
- Expected (test-driven): 179
- Infrastructure / transient: 54,443
- Potentially regression-related: 0
- Likely caused by PR: 0
- Unlikely caused by PR: 0
- Inconclusive: 0
- Version skew: post-release normal (multiapps-controller pins 2.48.0; multiapps moved to 2.49.0-SNAPSHOT) — analyzer flagged this as the benign post-release variant, not the "deploy stale code" variant.
Full log-analyzer findings
Log Analyzer — oq verdict: FAIL
Test outcome (from orchestrator): FAILED
CF target: deploy-service / sap_btp_cf_mta_deploy+technical1 (app: deploy-service, deployed sha: fe7b56a)
Window: 2026-05-29T13:20:22Z → 2026-05-29T13:52:17Z
Index queried: logs-*
Total WARN: 5,872 | Total ERROR: 48,750 | Truncated: no
Verdict rationale
The overall verdict is FAIL because the orchestrator reported OQ_RESULT=FAIL (30 scenarios failed). However, the log analysis finds zero Bucket C regression suspects — every WARN/ERROR entry in the window is attributable to either known OQ scenario behavior (Bucket A, 179 catalog-matched hits) or recurring infrastructure/configuration noise unrelated to the PR's changes (Bucket B, 54,443 hits). The 30-scenario failure breadth is characteristic of a shared infrastructure disruption (see below) rather than a code regression. The PR diff (health-check-interval parameter support) touches 11 files across 5 Java classes, none of which intersects any stack frame, logger name, or exception message observed in the log window. The log analysis finds no evidence that this PR caused the OQ failures.
Local git state at analysis time
| Sub-project | Branch | Uncommitted (files) | On feature branch |
|---|---|---|---|
| multiapps-controller | qa-pr-1848 | 0 | yes |
| multiapps | master | 0 | no |
| xsa-multiapps-controller | master | 0 | no |
| XSOQTests | feature/LMCROSSITXSADEPLOY-3316 | 0 | yes |
| cf-mta-examples | feature/LMCROSSITXSADEPLOY-3316 | 0 | yes |
| multiapps-cli-plugin | master | 0 | no |
No uncommitted local changes were found in any sub-project. The deployed WAR corresponds to the tip of qa-pr-1848 (fe7b56a), which carries 3 commits relative to master — all part of the health-check-interval feature.
Deploy chain version pinning
| Source of truth | Truth value | Declared in downstream | Declared value | Status |
|---|---|---|---|---|
multiapps/pom.xml <version> |
2.49.0-SNAPSHOT |
multiapps-controller <multiapps.version> |
2.48.0 |
SKEW (expected) |
multiapps/pom.xml <version> |
2.49.0-SNAPSHOT |
xsa-multiapps-controller <multiapps.version> |
2.48.0 |
SKEW (expected) |
multiapps-controller/pom.xml <version> |
2.48.0-SNAPSHOT |
xsa-multiapps-controller <multiapps-controller.version> |
2.48.0-SNAPSHOT |
OK |
Assessment: The apparent skew (2.49.0-SNAPSHOT vs 2.48.0) is the normal post-release state. multiapps 2.48.0 was released and multiapps-controller pins to that released artifact. multiapps has moved on to 2.49.0-SNAPSHOT for the next development cycle. Since 2.48.0 is a published artifact in the Maven repository, multiapps-controller resolves it correctly — this is not the "deploy stale code" variant of the version skew described in CLAUDE.md. No Bucket C escalation required.
Categorization
| Category | Count |
|---|---|
| Expected (test-driven, catalog-matched) | 179 |
| Infrastructure / transient (Bucket B) | 54,443 |
| Potentially regression-related (Bucket C) | 0 |
Bucket A — Expected (test-driven) detail
The OQ reference catalog matched 179 entries across 4 scenarios:
| Scenario | Count | Kind |
|---|---|---|
| generic-content-deploy | 152 | content_error |
| timeout-scenario | 6 | timeout |
| service-deletion-failed-scenario | 2 | service_deletion_failure |
| app-staging-failure | 1 | staging_failure |
| (global signature match) | 18 | broker_failure / unsupported_parameter |
Note: health-check-interval-scenario (the new scenario added by this PR's XSOQTests branch) is not in the OQ catalog yet and did not appear in the orchestrator's failed_scenarios list — it was apparently not executed in this pipeline run (expected: the scenario is on the feature branch but the pipeline YAML has not been updated to include it yet).
Note: 15 of the 30 failed scenarios are not present in the current OQ reference catalog (application-hooks, all cts-* variants, gacd-in-deployed-after, liquibase-lock-service, namespace-multiple-deploys, occasional-message-for-non-finishing-task-execution, only-async-services-scenario, passing-secrets-during-deployment, selective-deployment-scenario, service-tags, shared-private-domain-scenario, test-shutdown-client, update-service-scenario, whitelisting-visibility-*). Their expected error signatures are not modeled; however, their failure logs do not appear in the WARN/ERROR stream in a form that the triage engine recognized as regression-marker-bearing. This anomaly count (15/29 catalog scenarios) warrants catalog expansion but does not change the regression verdict for this PR.
Bucket B — Infrastructure / transient detail
| Signature | Count | Classification |
|---|---|---|
AuditLogNotAvailableException: Failed to write message to the audit log |
45,001 | Infrastructure — auditlog service not bound in this OQ space (known); emitted on every deployment operation |
| Ignoring parameter "namespace", as the MTA is not deployed with namespace! | 3,690 | Expected behavior WARN — scenarios deploy MTAs without namespace; logged per-resource |
EmptyAnsProducerClientException: Notification for Unknown NOT sent to ANS: Configuration missing |
1,358 | Infrastructure — ANS (Alert Notification Service) not configured in OQ space; known |
MissingCsrfTokenException: Request "POST …" failed with "Could not verify the provided CSRF token" |
2,835 | Test-induced — OQ scenarios probe CSRF-protected endpoints without a prior GET to seed the token; consistent with whitelisting/CTS test patterns |
| Skipping deletion of services, because --delete-services is not specified | 279 | Expected behavior WARN — nominal |
RejectedExecutionException: task rejected from ThreadPoolExecutor (pool size=6, active=6) |
2 | Transient — upload thread pool momentarily saturated during concurrent OQ runs; retried |
TooManyRequests (429) from GET /v3/roles |
8 | Infrastructure — CF API rate limiting; handled by ResilientOperationExecutor with retry |
ContentException: Error merging descriptors: Unsupported resource type "auditlog" for platform type "CLOUD-FOUNDRY" |
8 | Expected behavior — CTS/XSA scenarios exercise resource types unsupported on CF; produces ContentException deliberately |
NullPointerException in OperationInFinalStateHandler.deletePreviousBackupDescriptors |
23 | Infra/pre-existing — NPE arises when DeploymentDescriptor is null (i.e. a process completed without leaving a backup descriptor). OperationInFinalStateHandler is NOT in the PR diff (last touched by commit 4f32db2, pre-dates this PR). Logger: SafeExecutor wraps and logs as WARN — non-fatal. |
NotFoundException: MTA with name "anatz-severe-error"/"ztana" does not exist |
16 | Expected behavior — undeploy scenarios targeting MTAs not yet deployed |
CloudOperationException: 404 Not Found: Service instance not found |
9 | Expected behavior — optional-resources scenarios deliberately reference non-existent services |
ResponseStatusException 403/401 |
15 | Expected behavior — whitelisting and CTS-auth scenarios deliberately trigger authorization failures |
InternalAuthenticationServiceException: Invalid JWT / No token parser found |
4 | Expected behavior — token-expiration and invalid-auth scenarios |
ContentDeployerException: HTTP 413 Payload Too Large from GACD sync endpoint |
2 | Expected behavior — gacd-in-deployed-after scenario sends oversized payload to test error handling |
StepPhaseRetryException: A step of the process has failed |
36 | Expected behavior — retry wrappers for Flowable step failures |
SLException / ContentException: Service plan not found / rollback errors |
556 | Expected behavior — various error scenarios deliberately trigger these |
All Bucket B entries pre-date or are orthogonal to the PR's changes. None of the loggers or stack frames listed above intersect the 11 files modified by PR #1848. The 45,001 audit log entries (82.4% of total volume) and 2,835 CSRF entries (5.2%) are the dominant noise sources and are both longstanding infrastructure characteristics of this OQ space.
Per-suspect attribution
There are no Bucket C suspects. The triage produced zero entries in /tmp/cls_suspects_raw.json (0 unexpected hits, 0 indeterminate hits with regression markers). Accordingly there are no rows in the attribution table and no "Strong attributions" section.
PR diff summary (for context)
PR #1848 adds health-check-interval as a new MTA module parameter. The 11 changed files (all in multiapps-controller, no multiapps or xsa-multiapps-controller changes) are:
| File | Kind | Risk markers |
|---|---|---|
RawCloudProcess.java |
java-prod | none — purely additive field extraction |
CloudProcess.java |
java-prod | none — additive abstract getter |
Staging.java |
java-prod | none — additive interface method |
CloudControllerRestClientImpl.java |
java-prod | none — condition widened to allow interval-only update; buildHealthCheck refactored to make type optional |
HealthCheckInfo.java |
java-prod | none — additive field + equality update |
Messages.java |
java-prod | none — new error constant |
SupportedParameters.java |
java-prod | none — new constant added to allow-list |
StagingParametersParser.java |
java-prod | none — additive parameter parsing + validation guard (rejects ≤0) |
RawCloudProcessTest.java |
java-test | n/a |
HealthCheckInfoTest.java |
java-test | n/a |
StagingParametersParserTest.java |
java-test | n/a |
The CloudControllerRestClientImpl change is the most behavior-affecting: when healthCheckType is null but healthCheckInterval is non-null, the CF API PATCH now sends a HealthCheck body without a type field. This is correct per CF API v3 (type defaults to process when omitted) and only fires if an MTA explicitly sets health-check-interval without health-check-type. None of the 30 failed OQ scenarios set health-check-interval, so this code path was never exercised.
Failed scenarios provided by orchestrator
30 scenarios failed. Cross-referencing against the log window:
- 8 catalog-backed scenarios failed (async-service-bindings-scenario, async-service-keys-scenario, bg-deploy-stop-reorder, blue-green-deploy, cleaners-and-clean-up-job, generic-content-deploy, hook-target-app, optional-mta-resources-scenario): their expected error patterns are present in Bucket A (179 catalog hits). No anomalous Bucket C entries overlap their expected windows.
- 22 scenarios failed with no corresponding WARN/ERROR evidence in the log window: application-hooks, all 8 cts-*, gacd-in-deployed-after, liquibase-lock-service, namespace-multiple-deploys, occasional-message-for-non-finishing-task-execution, only-async-services-scenario, passing-secrets-during-deployment, selective-deployment-scenario, service-tags, shared-private-domain-scenario, test-shutdown-client, update-service-scenario, whitelisting-visibility-failure-scenario, whitelisting-visibility-in-current-org-space-scenario.
The absence of WARN/ERROR logs from 22 failing scenarios suggests those scenarios failed at the test script level (e.g., assertion mismatch, missing artifact, network timeout from the test runner side) rather than producing server-side errors. This is consistent with a shared infrastructure disruption — for example, a CF API rate-limiting episode (8 × 429 entries visible in the window around 13:39–13:41Z) or an OQ space misconfiguration — affecting scenario execution without generating server-side WARN/ERROR entries. needs_investigation=false for all entries because no suspect intersects a failed scenario with a regression marker.
OQ catalog regeneration note
The OQ reference catalog was stale (older than XSOQTests/test_resources/health-check-interval/http-health-check-interval/mtad.yaml, which is new in this PR's XSOQTests branch). The catalog was regenerated before the fetch using build_catalog.py. The regenerated catalog has 29 scenarios / 73 steps (source SHA: 43a97c0c).
Posted manually by orchestrator (pr-result-publisher subagent lacked GitHub MCP tools). Mode: oq. Generated 2026-05-29T17:10:00Z.
Summary
Introduces the
health-check-intervalMTA module parameter for liveness health checks on CF apps deployed via MTA, achieving feature parity with the underlying Cloud Foundry capability now exposed by the upgraded CF Java client.Example usage in
mta.yaml:Changes
End-to-end wiring of the new parameter:
SupportedParameters— addsHEALTH_CHECK_INTERVALconstant toMODULE_PARAMETERS.Messages— addsINVALID_HEALTH_CHECK_INTERVALvalidation message.StagingParametersParser— parses, validates (must be > 0, throwsContentExceptionotherwise), and forwards the value toImmutableStaging.Staginginterface andCloudProcess— addsgetHealthCheckInterval()accessor (Immutables regenerates the implementations).RawCloudProcess— mapsData.getInterval()from the CF API response intoCloudProcess.HealthCheckInfo— adds theintervalfield, getter, and includes it inequalsso change detection notices interval drifts.CloudControllerRestClientImpl— forwards the interval to theDatabuilder; widens theupdateApplicationProcessguard to also fire when only the interval changed; guardsbuildHealthCheckagainstHealthCheckType.from(null)NPE.Tests
StagingParametersParserTest— three new tests (correct parse with interval=15, validation rejection with interval=0, null when absent), plus a parameterized test covering positive interval values (1, 15, 60,Integer.MAX_VALUE).HealthCheckInfoTest— four tests covering equal instances, different intervals, null-vs-non-null, andfromProcess/fromStagingcross-equality.RawCloudProcessTest— new test class coveringRawCloudProcess.derive()mapping including health-check interval propagation from the CF API response.Jira
JIRA: LMCROSSITXSADEPLOY-3316 — Introduce Health check interval
Test plan
mvn clean test -pl multiapps-controller-client,multiapps-controller-corepasses locally.health-check-interval: 15and confirm the CF app's process reports the interval viacf curlagainst the v3 process endpoint.health-check-intervaland confirm the controller detects the drift and issues an update (rather than a no-op).health-check-interval: 0and confirmContentExceptionwithINVALID_HEALTH_CHECK_INTERVAL.