Skip to content

Parallelize team-provisioner Google Admin SDK calls#130

Merged
Alexanderamiri merged 1 commit into
mainfrom
fix/parallelize-team-provisioner
May 27, 2026
Merged

Parallelize team-provisioner Google Admin SDK calls#130
Alexanderamiri merged 1 commit into
mainfrom
fix/parallelize-team-provisioner

Conversation

@Alexanderamiri
Copy link
Copy Markdown
Member

Summary

The registry's Provision Groups workflow has been failing with a
Read timeout on endpoint URL ... /javabin-team-provisioner/invocations.
The Lambda itself isn't timing out — its config has Timeout=300s and
real executions complete in ~90s. What's tripping is the AWS CLI's
default 60s read-timeout, which then triggers client-side retries
and three concurrent Lambda invocations per CI run before exiting with
a false-positive failure.

CloudWatch confirms: invocation 96299b07-… on 2026-05-27 ran cleanly
end-to-end with Duration: 90,978 ms — the data was synced but CI was
already red.

Root cause: per-hero and per-group Google Admin Directory API calls
were sequential. With 55 heroes and 35 groups, the ~230 round-trips at
~400 ms each sum to ~90 s.

Changes

  • Extract _sync_hero_account, _sync_hero_alias, _sync_one_group,
    and _sync_one_access_group from handle_sync_groups_and_heros.
  • Dispatch each step's per-item work through ThreadPoolExecutor with
    GROUP_SYNC_WORKERS workers (default 5, env-tunable on the Lambda
    without redeploying code).
  • Pre-fetch the Google OAuth token once before parallel work starts so
    workers don't race to refresh it.
  • Add --cli-read-timeout 0 to scripts/provision-groups.py and
    scripts/provision-teams.sh as a safety net against future regressions.

Sizing rationale

Google Directory API per-user quota: 1,500 queries / 100 s (≈15 QPS).
We impersonate a single admin via domain-wide delegation, so per-user
applies. 5 workers × ~3 QPS each ≈ 15 QPS — at the budget ceiling, well
below the 40 QPS project quota. If we hit 429s in practice, drop
GROUP_SYNC_WORKERS to 3 via env-var (no redeploy needed).

Expected effect

Phase Sequential Parallel (5w)
Step 1 — hero account checks (55) 26 s ~6 s
Step 3 — groups + IC sync (35) 60 s ~12-15 s
Step 6 — access groups (5) 4 s ~1 s
Total ~91 s ~20-25 s

Test plan

  • Platform CI plan + apply (deploys new Lambda zip).
  • Next merge to javaBin/registry's groups/** triggers the
    Provision Groups workflow; verify it completes green and the
    Lambda Duration metric is well under 60 s.
  • Spot-check CloudWatch logs for any 429 from Google Directory
    API — if observed, drop GROUP_SYNC_WORKERS env var to 3.

Related: false-positive timeout in Provision Groups #26418541309.

The Provision Groups CI in registry has been failing with a
read-timeout on `aws lambda invoke`. The Lambda itself was running
~90s end-to-end (well under its 300s timeout) but the AWS CLI's
default 60s read-timeout was tripping and triggering retries.

Root cause: per-hero and per-group Google Admin Directory API calls
were issued sequentially. With 55 heroes and 35 groups, the
~230 round-trips at ~400ms each sum to ~90s.

Fix in this PR:
- Extract per-hero (`_sync_hero_account`, `_sync_hero_alias`),
  per-group (`_sync_one_group`) and per-access-group helpers from
  `handle_sync_groups_and_heros`.
- Dispatch each step's per-item work through a ThreadPoolExecutor
  with `GROUP_SYNC_WORKERS` workers (default 5, env-tunable).
- Pre-fetch the OAuth token once before parallel work so workers
  don't race to refresh it.
- Worker count sized to stay under Google's per-user quota
  (1,500 queries / 100s ≈ 15 QPS). 5 workers × ~3 QPS each sits
  at the budget ceiling, well below the 40 QPS project quota.

Belt-and-suspenders in the same PR:
- Add `--cli-read-timeout 0` to `scripts/provision-groups.py` and
  `scripts/provision-teams.sh` so a future regression past 60s
  won't surface as a false CI failure.

Expected effect: total Lambda duration drops from ~90s to ~20-25s,
comfortably under the 60s CLI default (which we've also removed)
and the 300s Lambda timeout.
@Alexanderamiri Alexanderamiri requested a review from a team as a code owner May 27, 2026 21:23
@github-actions
Copy link
Copy Markdown

Terraform Plan

🚧 Changes detected — Plan: 0 to add, 2 to change, 0 to destroy.

Plan output
Acquiring state lock. This may take a few moments...

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.lambdas.aws_lambda_function.compliance_reporter will be updated in-place
  ~ resource "aws_lambda_function" "compliance_reporter" {
        id                             = "javabin-compliance-reporter"
      ~ last_modified                  = "2026-03-26T19:54:44.000+0000" -> (known after apply)
      ~ source_code_hash               = "NkzoGnYQCnG8BbKoHzVSvDqxAipFexXz/n+v2/6ZgrU=" -> "o1N0a7gdUF6vvPsR/ehFQoGWWpQCNjUUhbm76Ba2NJc="
        tags                           = {}
        # (21 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

  # module.lambdas.aws_lambda_function.team_provisioner will be updated in-place
  ~ resource "aws_lambda_function" "team_provisioner" {
        id                             = "javabin-team-provisioner"
      ~ last_modified                  = "2026-04-14T20:38:43.000+0000" -> (known after apply)
      ~ source_code_hash               = "/tK4IAED6H3qJU5jH2WkSd1SstJ2tcnzCMThkQEBq/0=" -> "i1n+uCTcgMExISw4NMQHPZPfcNfF4gcnRruD2Fpod1k="
        tags                           = {}
        # (21 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

Plan: 0 to add, 2 to change, 0 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: tfplan

To perform exactly these actions, run the following command to apply:
    terraform apply "tfplan"

LLM Review

Risk: 🟢 LOW

Routine Lambda function updates with source code hash changes for compliance_reporter and team_provisioner functions.

  • [routine] Lambda function compliance_reporter being updated in-place with new source code hash (NkzoGnYQCnG8BbKoHzVSvDqxAipFexXz/n+v2/6ZgrU= → o1N0a7gdUF6vvPsR/ehFQoGWWpQCNjUUhbm76Ba2NJc=). No configuration changes, only code deployment.
  • [routine] Lambda function team_provisioner being updated in-place with new source code hash (/tK4IAED6H3qJU5jH2WkSd1SstJ2tcnzCMThkQEBq/0= → i1n+uCTcgMExISw4NMQHPfF4gcnRruD2Fpod1k=). No configuration changes, only code deployment.
  • [routine] No resources being created or destroyed. Plan shows 0 additions and 0 deletions, only 2 in-place updates.
  • [routine] No security group, IAM policy, or permission boundary changes detected. All security configurations remain unchanged.
  • [routine] No cost implications. Updates are to existing Lambda functions with no changes to compute resources, storage, or networking infrastructure.

@Alexanderamiri Alexanderamiri merged commit b9b800d into main May 27, 2026
3 checks passed
@Alexanderamiri Alexanderamiri deleted the fix/parallelize-team-provisioner branch May 27, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant