Skip to content

Show DNS Configuration step: az containerapp extension breaks DNS output #905

@raix

Description

@raix

Summary

The Show DNS Configuration step in .github/workflows/_deploy-infrastructure.yml (and the matching block in cloud-infrastructure/cluster/deploy-cluster.sh) uses the az containerapp extension to read the Container Apps Environment, then parses the result with jq. This produces two failure modes that both prevent the operator from seeing the DNS records they need:

  1. Crash (exit 5). The step runs under bash -e. az containerapp is an extension that dynamic-installs on first use, printing a
    version-prefixed progress banner to stderr. The code captured output with 2>&1, merging that banner into the JSON variable, so jq failed:
    jq: parse error: Invalid numeric literal at line 1, column 6
    Error: Process completed with exit code 5.
    
  2. Silent false-negative. Guarding the jq call (validate JSON first) stops the crash but exposes a second bug: when the extension command returns
    empty/errors on the runner, the step falls through to "DNS configuration instructions will be shown after the Container Apps Environment is created"
    even when the environment already exists. The records are silently hidden.

Root cause

The az containerapp extension is the wrong tool here:

  • It dynamic-installs (banner → crash vector under 2>&1).
  • Its availability on the runner is non-deterministic (install can fail/be skipped → false-negative).

The data we need lives on the plain ARM resource and is readable with the core CLI — no extension, no dynamic-install, no banner:

  • Microsoft.App/managedEnvironments.properties.customDomainConfiguration.customDomainVerificationId, .properties.defaultDomain
  • Microsoft.App/containerApps.properties.configuration.ingress.customDomains

Suggested fix

Replace az containerapp env show / az containerapp show with core az resource show --resource-type …, keeping stdout-only capture (-o json 2>/dev/null) and jq empty validation as defense-in-depth. Affected calls:

  • _deploy-infrastructure.yml: env_details, app_gateway_details, back_office_details
  • deploy-cluster.sh: the env_details read in the InvalidCustomHostNameValidation / FailedCnameValidation branch

Verified locally with no extension installed: core az resource show returns valid JSON and the step prints the correct TXT/CNAME records and exits
0.

Reproduction (no Azure required, for the crash)

Run as a real bash -e script (don't wrap in &&/|| — that disables set -e and hides it):

cat > /tmp/repro.sh <<'EOF'
set -e
env_details="0.2.3 The command requires the extension containerapp"  # az dynamic-install banner via 2>&1
[[ "$env_details" != "" ]] && echo "$env_details" | jq -r '.properties.defaultDomain'
EOF
bash /tmp/repro.sh; echo "exit: $?"
# -> jq: parse error: Invalid numeric literal at line 1, column 6 ; exit: 5

Environment

  • GitHub-hosted runner, bash -e
  • Azure CLI with az containerapp extension (dynamic-install)

AI prompt

In .github/workflows/_deploy-infrastructure.yml ("Show DNS Configuration" step,
runs under bash -e) and cloud-infrastructure/cluster/deploy-cluster.sh, the code
reads the Container Apps Environment with the az containerapp EXTENSION and parses
it with jq. This has two bugs:

  1. Crash: az containerapp dynamic-installs on first use and prints a
    version-prefixed banner to stderr; the old 2>&1 merged it into the JSON, so jq
    died ("Invalid numeric literal ... column 6") and bash -e aborted with exit 5.
  2. False-negative: when the extension command returns empty on the runner, the step
    falls through to "instructions will be shown after the environment is created"
    even though the environment exists, hiding the DNS records.

Fix: replace the az containerapp extension calls with the CORE az resource show,
which needs no extension and no dynamic install. Same ARM property paths:

- managedEnvironments:
    az resource show --name "$RG" --resource-group "$RG" \
      --resource-type Microsoft.App/managedEnvironments -o json
    -> .properties.customDomainConfiguration.customDomainVerificationId
    -> .properties.defaultDomain
- containerApps (app-gateway / back-office):
    az resource show --name app-gateway --resource-group "$RG" \
      --resource-type Microsoft.App/containerApps -o json
    -> .properties.configuration.ingress.customDomains

Apply to these reads:
- _deploy-infrastructure.yml: env_details, app_gateway_details, back_office_details
- deploy-cluster.sh: env_details in the InvalidCustomHostNameValidation /
FailedCnameValidation branch

Keep stdout-only capture (-o json 2>/dev/null || echo "") and validate with
[[ -n "$x" ]] && echo "$x" | jq empty 2>/dev/null before parsing, falling through
to the existing "not created yet" message when missing/invalid. Do not add any
az extension add. Leave az containerapp revision list alone (it uses --query/-o
tsv, not jq). Keep printed record formats identical.

Verify: run the env/app-gateway branch as a standalone bash -e script (NOT wrapped
in &&/||) against a real environment with the containerapp extension NOT installed;
confirm it prints the TXT/CNAME records and exits 0.


Severity

Low

Is this bug security related?

  • This bug is related to security

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions