Parallelize team-provisioner Google Admin SDK calls by Alexanderamiri · Pull Request #130 · javaBin/platform

Alexanderamiri · 2026-05-27T21:23:34Z

Summary

The registry's Provision Groups workflow has been failing with a
Read timeout on endpoint URL ... /javabin-team-provisioner/invocations.
The Lambda itself isn't timing out — its config has Timeout=300s and
real executions complete in ~90s. What's tripping is the AWS CLI's
default 60s read-timeout, which then triggers client-side retries
and three concurrent Lambda invocations per CI run before exiting with
a false-positive failure.

CloudWatch confirms: invocation 96299b07-… on 2026-05-27 ran cleanly
end-to-end with Duration: 90,978 ms — the data was synced but CI was
already red.

Root cause: per-hero and per-group Google Admin Directory API calls
were sequential. With 55 heroes and 35 groups, the ~230 round-trips at
~400 ms each sum to ~90 s.

Changes

Extract _sync_hero_account, _sync_hero_alias, _sync_one_group,
and _sync_one_access_group from handle_sync_groups_and_heros.
Dispatch each step's per-item work through ThreadPoolExecutor with
GROUP_SYNC_WORKERS workers (default 5, env-tunable on the Lambda
without redeploying code).
Pre-fetch the Google OAuth token once before parallel work starts so
workers don't race to refresh it.
Add --cli-read-timeout 0 to scripts/provision-groups.py and
scripts/provision-teams.sh as a safety net against future regressions.

Sizing rationale

Google Directory API per-user quota: 1,500 queries / 100 s (≈15 QPS).
We impersonate a single admin via domain-wide delegation, so per-user
applies. 5 workers × ~3 QPS each ≈ 15 QPS — at the budget ceiling, well
below the 40 QPS project quota. If we hit 429s in practice, drop
GROUP_SYNC_WORKERS to 3 via env-var (no redeploy needed).

Expected effect

Phase	Sequential	Parallel (5w)
Step 1 — hero account checks (55)	26 s	~6 s
Step 3 — groups + IC sync (35)	60 s	~12-15 s
Step 6 — access groups (5)	4 s	~1 s
Total	~91 s	~20-25 s

Test plan

Platform CI plan + apply (deploys new Lambda zip).
Next merge to javaBin/registry's groups/** triggers the
Provision Groups workflow; verify it completes green and the
Lambda Duration metric is well under 60 s.
Spot-check CloudWatch logs for any 429 from Google Directory
API — if observed, drop GROUP_SYNC_WORKERS env var to 3.

Related: false-positive timeout in Provision Groups #26418541309.

The Provision Groups CI in registry has been failing with a read-timeout on `aws lambda invoke`. The Lambda itself was running ~90s end-to-end (well under its 300s timeout) but the AWS CLI's default 60s read-timeout was tripping and triggering retries. Root cause: per-hero and per-group Google Admin Directory API calls were issued sequentially. With 55 heroes and 35 groups, the ~230 round-trips at ~400ms each sum to ~90s. Fix in this PR: - Extract per-hero (`_sync_hero_account`, `_sync_hero_alias`), per-group (`_sync_one_group`) and per-access-group helpers from `handle_sync_groups_and_heros`. - Dispatch each step's per-item work through a ThreadPoolExecutor with `GROUP_SYNC_WORKERS` workers (default 5, env-tunable). - Pre-fetch the OAuth token once before parallel work so workers don't race to refresh it. - Worker count sized to stay under Google's per-user quota (1,500 queries / 100s ≈ 15 QPS). 5 workers × ~3 QPS each sits at the budget ceiling, well below the 40 QPS project quota. Belt-and-suspenders in the same PR: - Add `--cli-read-timeout 0` to `scripts/provision-groups.py` and `scripts/provision-teams.sh` so a future regression past 60s won't surface as a false CI failure. Expected effect: total Lambda duration drops from ~90s to ~20-25s, comfortably under the 60s CLI default (which we've also removed) and the 300s Lambda timeout.

github-actions · 2026-05-27T21:24:39Z

Terraform Plan

🚧 Changes detected — Plan: 0 to add, 2 to change, 0 to destroy.

Plan output

Acquiring state lock. This may take a few moments...

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.lambdas.aws_lambda_function.compliance_reporter will be updated in-place
  ~ resource "aws_lambda_function" "compliance_reporter" {
        id                             = "javabin-compliance-reporter"
      ~ last_modified                  = "2026-03-26T19:54:44.000+0000" -> (known after apply)
      ~ source_code_hash               = "NkzoGnYQCnG8BbKoHzVSvDqxAipFexXz/n+v2/6ZgrU=" -> "o1N0a7gdUF6vvPsR/ehFQoGWWpQCNjUUhbm76Ba2NJc="
        tags                           = {}
        # (21 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

  # module.lambdas.aws_lambda_function.team_provisioner will be updated in-place
  ~ resource "aws_lambda_function" "team_provisioner" {
        id                             = "javabin-team-provisioner"
      ~ last_modified                  = "2026-04-14T20:38:43.000+0000" -> (known after apply)
      ~ source_code_hash               = "/tK4IAED6H3qJU5jH2WkSd1SstJ2tcnzCMThkQEBq/0=" -> "i1n+uCTcgMExISw4NMQHPZPfcNfF4gcnRruD2Fpod1k="
        tags                           = {}
        # (21 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

Plan: 0 to add, 2 to change, 0 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: tfplan

To perform exactly these actions, run the following command to apply:
    terraform apply "tfplan"

LLM Review

Risk: 🟢 LOW

Routine Lambda function updates with source code hash changes for compliance_reporter and team_provisioner functions.

✅ [routine] Lambda function compliance_reporter being updated in-place with new source code hash (NkzoGnYQCnG8BbKoHzVSvDqxAipFexXz/n+v2/6ZgrU= → o1N0a7gdUF6vvPsR/ehFQoGWWpQCNjUUhbm76Ba2NJc=). No configuration changes, only code deployment.
✅ [routine] Lambda function team_provisioner being updated in-place with new source code hash (/tK4IAED6H3qJU5jH2WkSd1SstJ2tcnzCMThkQEBq/0= → i1n+uCTcgMExISw4NMQHPfF4gcnRruD2Fpod1k=). No configuration changes, only code deployment.
✅ [routine] No resources being created or destroyed. Plan shows 0 additions and 0 deletions, only 2 in-place updates.
✅ [routine] No security group, IAM policy, or permission boundary changes detected. All security configurations remain unchanged.
✅ [routine] No cost implications. Updates are to existing Lambda functions with no changes to compute resources, storage, or networking infrastructure.

Alexanderamiri requested a review from a team as a code owner May 27, 2026 21:23

Alexanderamiri merged commit b9b800d into main May 27, 2026
3 checks passed

Alexanderamiri deleted the fix/parallelize-team-provisioner branch May 27, 2026 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize team-provisioner Google Admin SDK calls#130

Parallelize team-provisioner Google Admin SDK calls#130
Alexanderamiri merged 1 commit into
mainfrom
fix/parallelize-team-provisioner

Alexanderamiri commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alexanderamiri commented May 27, 2026

Summary

Changes

Sizing rationale

Expected effect

Test plan

Uh oh!

github-actions Bot commented May 27, 2026

Terraform Plan

LLM Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant