Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/review-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,11 @@ on:
required: false
type: string
default: 'moderate'
incremental:
description: "When true (default), re-reviews only the commits pushed since the last completed docker-agent review instead of the full PR diff. Falls back to a full review when no previous review exists, the history was rewritten (force-push/rebase), or the base branch was merged in. Set to false to force full reviews."
required: false
type: boolean
default: true
trigger-run-id:
description: "Workflow run ID from pr-review-trigger.yml — used to download the event context artifact. When set, resolve-context job downloads the artifact and routes to the appropriate job."
required: false
Expand Down Expand Up @@ -418,6 +423,7 @@ jobs:
exclude-paths: ${{ inputs.exclude-paths }}
max-diff-lines: ${{ inputs.max-diff-lines }}
confidence-threshold: ${{ inputs.confidence-threshold }}
incremental: ${{ inputs.incremental }}
model: ${{ inputs.model }}
github-token: ${{ env.GITHUB_APP_TOKEN || github.token }}
anthropic-api-key: ${{ env.ANTHROPIC_API_KEY_FROM_SSM || secrets.ANTHROPIC_API_KEY }}
Expand Down
9 changes: 9 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,18 @@ Anything else here (workflows under `.github/workflows/`, scripts, tests) exists
│ │ ├── aws-credentials.ts
│ │ ├── github-app.ts # Reads docker-agent-action/github-app from Secrets Manager; exports GITHUB_APP_TOKEN (a PAT) + ORG_MEMBERSHIP_TOKEN.
│ │ └── __tests__/
│ ├── dedupe-findings/ # Drops review comments duplicating findings from previous review cycles.
│ │ ├── index.ts # CLI entry → bundled to dist/dedupe-findings.js (staged at /tmp/dedupe-findings.js for the agent).
│ │ ├── dedupe-findings.ts # Core dedupeComments() pure function (path + line proximity + heading similarity).
│ │ └── __tests__/
│ ├── filter-diff/ # Strips excluded-path sections from a unified diff.
│ │ ├── index.ts # CLI entry → bundled to dist/filter-diff.js
│ │ ├── filter-diff.ts # Core filterDiff() pure function + applyFilter() I/O wrapper.
│ │ └── __tests__/
│ ├── incremental-review/ # Narrows pr.diff to commits since the last completed review.
│ │ ├── index.ts # CLI entry → bundled to dist/incremental-review.js
│ │ ├── incremental-review.ts # Core planIncrementalReview()/findLastReviewedSha() pure functions.
│ │ └── __tests__/
│ ├── score-confidence/ # Per-finding confidence scoring for the PR review pipeline.
│ │ ├── index.ts # CLI entry → bundled to dist/score-confidence.js
│ │ ├── score-confidence.ts # Core scoreFinding()/scoreFindings() pure functions + posting policy.
Expand Down Expand Up @@ -172,6 +180,7 @@ The action runs untrusted input (PR titles, bodies, comments, diffs) through an
- `pull_request` action `review_requested` when `github.event.requested_reviewer.login == 'docker-agent'`
- `@docker-agent` mentions on PR/issue comments — these run the `.github/actions/mention-reply` handler (sets `should-reply` and builds the context prompt) and then the `review-pr/mention-reply` sub-action (referenced from a pinned SHA, not present as a local path on every commit). The `pr-review-mention-reply.yaml` agent handles the actual reply.
- Diffs over 1500 lines are **chunked at file boundaries** in `review-pr/action.yml` (see "Split diff into chunks"). Per-file **risk scoring** (security paths, line counts, error-handling patterns) prioritizes verifier attention.
- **Incremental reviews** (default on, `incremental` input): a re-review only diffs `last-reviewed-SHA..HEAD` instead of the full PR diff. The last reviewed SHA is read from the `commit_id` GitHub records on the bot's completed reviews (assessment or LGTM bodies only — timeout/failure fallbacks don't count), so state survives across runs with no extra writes. `src/incremental-review/` plans the mode and falls back to a full review on force-push/rebase (SHA missing or not an ancestor of HEAD), base-branch merge-ins, net-zero changes, or any error. In incremental mode the original full diff is preserved at `pr_full.diff` for stale-thread resolution and suggestion-anchor validation (GitHub validates anchors against the full PR diff). On re-reviews, `src/dedupe-findings/` (staged at `/tmp/dedupe-findings.js`, run by the agent per `posting-format.md` against the pre-fetched `/tmp/existing_review_comments.json`) drops findings matching already-posted bot comments by file path + line proximity (±3) + finding-heading similarity, so a full re-review after a rebase doesn't duplicate threads.
- Per-finding **confidence scoring** assigns each verified finding a precise 0–100 score (band: strong/moderate/weak/negligible) from the verifier's `verdict`, `evidence_strength`, and `context_completeness`, plus drafter↔verifier severity concordance and scope. `src/score-confidence/score-confidence.ts` is the **single source of truth** for the model (weights, bands, threshold, posting policy); the "Confidence Scoring" section of `review-pr/agents/pr-review.yaml` mirrors it as a strict lookup table so the orchestrator can apply it inline (the gitignored `dist/` is not available at agent runtime). Change one, change both — the unit tests pin every value. Security and high-severity CONFIRMED/LIKELY findings are always posted regardless of score; below-threshold findings are surfaced in a summary rather than silently dropped. The inline-posting cutoff is configurable via the `confidence-threshold` action input (a band name or a number clamped to 30–100, default `moderate` = 55); the action resolves it by invoking the bundled `dist/score-confidence.js resolve-threshold` CLI (so the resolution logic stays in TypeScript, not bash) and injects it into the agent prompt, and `scoreFinding`/`scoreFindings` accept a matching `postThreshold` option.
- Stale review threads on lines no longer in the diff are auto-resolved via GraphQL `resolveReviewThread`. Threads with no `<!-- docker-agent-review -->` marker are never touched.

Expand Down
40 changes: 40 additions & 0 deletions review-pr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,7 @@ When using `docker/docker-agent-action/.github/workflows/review-pr.yml`:
| `model` | Model override (e.g., `anthropic/claude-haiku-4-5`) | - |
| `add-prompt-files` | Comma-separated files to append to the prompt | - |
| `confidence-threshold` | Min confidence to post a finding inline: band (`strong`/`moderate`/`medium`/`weak`) or a number (clamped to 30–100) | `moderate` |
| `incremental` | Review only the commits pushed since the last completed review (falls back to a full review when unsafe) | `true` |

### `review-pr` (Composite Action)

Expand All @@ -285,6 +286,7 @@ PR number and comment ID are auto-detected from `github.event` when not provided
| `github-token` | GitHub token | No |
| `add-prompt-files` | Comma-separated files to append to the prompt | No |
| `confidence-threshold` | Min confidence to post a finding inline: band (`strong`/`moderate`/`medium`/`weak`) or a number clamped to 30–100 (default `moderate`) | No |
| `incremental` | Review only commits since the last completed review (default `true`; falls back to full when unsafe) | No |

\*API keys are optional when using the reusable workflow (credentials are fetched via OIDC). Only required when using the composite action directly without OIDC.

Expand Down Expand Up @@ -357,6 +359,44 @@ AGENTS.md + PR Diff → Drafter (hypotheses) → Verifier (confirm + evidence si
→ Confidence score (0–100) → Post Comments
```

### Incremental Reviews

By default, a re-review only covers the **commits pushed since the last completed
review** instead of re-reviewing the full PR diff. This saves tokens, avoids
duplicate comments, and skips code that has not changed since the previous cycle.

How it works:

1. The last reviewed commit is read from the metadata GitHub records on every
posted review (`commit_id` on the Reviews API is the PR head SHA at posting
time), so the state survives across workflow runs with no extra bookkeeping.
Only completed reviews count — a timed-out or failed run does not mark
commits as reviewed.
2. The diff handed to the agent becomes `git diff <last-reviewed-sha>..HEAD`,
restricted to files that are still part of the full PR diff (so inline
comments never anchor outside what GitHub accepts).
3. The action **falls back to a full review** whenever incremental diffing
would be unsafe or meaningless:

| Situation | Behavior |
| --------- | -------- |
| No previous completed review | Full review |
| Force-push or rebase rewrote the history | Full review |
| Base branch merged into the PR since last review | Full review |
| Last reviewed SHA missing from the clone | Full review |
| Changes since last review cancel out | Full review |
| `incremental: false` input | Full review |

4. On any full re-review, findings are **deduplicated** against the comments
already posted on the PR (matched by file path, line proximity, and finding
heading similarity — see [`src/dedupe-findings/`](../src/dedupe-findings/dedupe-findings.ts)),
so a rebase does not produce duplicate threads. The plumbing that decides
between incremental and full mode is implemented and unit-tested in
[`src/incremental-review/`](../src/incremental-review/incremental-review.ts).

Set `incremental: false` (workflow or action input) to force a full review on
every trigger.

Each verified finding gets a precise **confidence score** (0–100) and a band
(strong / moderate / weak / negligible), computed deterministically from the
verifier's verdict, evidence strength, and context completeness, plus the
Expand Down
74 changes: 71 additions & 3 deletions review-pr/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,10 @@ inputs:
description: "Minimum confidence for posting a finding as an inline comment. Accepts a band name (`strong`=80, `moderate`/`medium`=55, `weak`=30) or a number (clamped to 30-100; the weak-band floor is the lowest meaningful cutoff, so e.g. `10` is raised to `30`). Findings that score below the threshold (and are not security or high-severity) are collapsed into the lower-confidence summary instead of posted inline; they are never silently dropped. Security findings and high-severity CONFIRMED/LIKELY findings are always posted regardless. Default: moderate (55), which preserves the prior behavior."
required: false
default: "moderate"
incremental:
description: "When 'true' (default), re-reviews only the commits pushed since the last completed docker-agent review instead of the full PR diff. Falls back to a full review when no previous review exists, the history was rewritten (force-push/rebase), or the base branch was merged in. Set to 'false' to force a full review."
required: false
default: "true"

outputs:
exit-code:
Expand Down Expand Up @@ -266,6 +270,25 @@ runs:
fi
fi

# Incremental review: when a previous completed review exists, narrow pr.diff
# to the commits pushed since it. The original full diff is preserved at
# pr_full.diff for stale-thread resolution and suggestion-anchor validation.
# All decision logic (state tracking via review commit_id metadata, rebase and
# force-push detection, base-merge detection) lives in src/incremental-review.
- name: Compute incremental review range
if: hashFiles('pr.diff') != '' && steps.lock-check.outputs.skip != 'true'
id: incremental
shell: bash
env:
ACTION_PATH: ${{ github.action_path }}
GITHUB_TOKEN: ${{ steps.resolve-token.outputs.token }}
PR_NUMBER: ${{ steps.resolve-context.outputs.pr-number }}
INCREMENTAL: ${{ inputs.incremental }}
run: |
BASE_REF=$(jq -r '.baseRefName // ""' pr_metadata.json 2>/dev/null || echo "")
export BASE_REF
node "$ACTION_PATH/../dist/incremental-review.js" pr.diff

- name: Filter excluded paths from diff
if: hashFiles('pr.diff') != '' && inputs.exclude-paths != ''
shell: bash
Expand Down Expand Up @@ -398,8 +421,14 @@ runs:
# A. Parse diff → file:line set
# Tracks current file from "diff --git" headers, line numbers from @@ hunks,
# and emits "file:line" for each added (+) line.
if [ ! -f pr.diff ]; then
echo "ℹ️ No pr.diff found — skipping stale thread resolution"
# In incremental mode pr.diff only covers commits since the last review, so
# staleness is judged against the preserved full PR diff (pr_full.diff).
DIFF_FILE=pr.diff
if [ -f pr_full.diff ]; then
DIFF_FILE=pr_full.diff
fi
if [ ! -f "$DIFF_FILE" ]; then
echo "ℹ️ No $DIFF_FILE found — skipping stale thread resolution"
exit 0
fi

Expand All @@ -426,7 +455,7 @@ runs:
}
/^ / { line++ }
/^-/ { next }
' pr.diff | sort -u > $DIFF_LINES_FILE
' "$DIFF_FILE" | sort -u > $DIFF_LINES_FILE

DIFF_LINE_COUNT=$(wc -l < $DIFF_LINES_FILE | tr -d ' ')
echo "📄 Found $DIFF_LINE_COUNT unique file:line pairs in diff"
Expand Down Expand Up @@ -709,6 +738,8 @@ runs:
REPO: ${{ github.repository }}
CONFIDENCE_SCORE: ${{ steps.resolve-confidence.outputs.score }}
CONFIDENCE_LABEL: ${{ steps.resolve-confidence.outputs.label }}
INCREMENTAL_MODE: ${{ steps.incremental.outputs.mode }}
LAST_REVIEWED_SHA: ${{ steps.incremental.outputs.last-reviewed-sha }}
run: |
title=$(jq -r '.title' pr_metadata.json)
author=$(jq -r '.author.login' pr_metadata.json)
Expand Down Expand Up @@ -740,6 +771,15 @@ runs:
echo "- **Inline confidence threshold**: ${CONFIDENCE_SCORE}/100 (${CONFIDENCE_LABEL} band). When applying the Confidence Scoring posting policy, post a non-forced finding as an inline comment only when its confidence score is **${CONFIDENCE_SCORE} or higher**. A finding that is in scope and not dismissed but scores **below ${CONFIDENCE_SCORE}** must NOT be posted inline — surface it under \"Lower-confidence findings (not posted inline)\" in the review body instead, exactly as the posting policy already prescribes for below-threshold findings (the policy's negligible-band rules still govern what is summarized vs. dropped)."
echo "- This threshold does NOT override the security floor or the high-severity always-post rule: \`security\` findings and high-severity CONFIRMED/LIKELY findings are ALWAYS posted inline regardless of the threshold."
echo ""
if [ "$INCREMENTAL_MODE" = "incremental" ]; then
echo "## Incremental Review"
echo ""
echo "This is an INCREMENTAL review: \`pr.diff\` contains only the changes pushed since the last completed review (commit \`${LAST_REVIEWED_SHA}\`). Earlier commits were already reviewed — do NOT re-review them or fetch the full PR diff yourself."
echo "- The \"Changed Files\" list above still describes the whole PR; review only the files present in \`pr.diff\`."
echo "- Start the review body with a note that this review covers only the commits since \`${LAST_REVIEWED_SHA:0:12}\`, and scope the assessment label to those changes."
echo "- If \`pr.diff\` is missing, do NOT fall back to fetching the full diff — post a COMMENT review explaining the incremental diff was unavailable."
echo ""
fi
echo "---"
echo ""
echo "## Instructions"
Expand Down Expand Up @@ -786,6 +826,34 @@ runs:
else
echo "::warning::validate-suggestions.js not found in dist — suggestion validation will be skipped"
fi
# Stage the finding deduplicator the same way (see posting-format.md).
if [ -f "$ACTION_PATH/../dist/dedupe-findings.js" ]; then
cp "$ACTION_PATH/../dist/dedupe-findings.js" /tmp/dedupe-findings.js
else
echo "::warning::dedupe-findings.js not found in dist — finding deduplication will be skipped"
fi

# Pre-fetch the PR's existing inline review comments so the agent can drop
# findings already posted in a previous review cycle (the dedupe CLI is
# fail-open: a missing or empty file simply skips deduplication).
- name: Fetch existing review comments
if: steps.lock-check.outputs.skip != 'true'
continue-on-error: true
shell: bash
env:
GH_TOKEN: ${{ steps.resolve-token.outputs.token }}
PR_NUMBER: ${{ steps.resolve-context.outputs.pr-number }}
REPO: ${{ github.repository }}
run: |
set -o pipefail
if gh api --paginate "repos/$REPO/pulls/$PR_NUMBER/comments" \
| jq -s 'map(if type == "array" then . else [.] end) | add // []' \
> /tmp/existing_review_comments.json; then
echo "✅ Fetched $(jq 'length' /tmp/existing_review_comments.json) existing review comment(s)"
else
rm -f /tmp/existing_review_comments.json
echo "::warning::Failed to fetch existing review comments — deduplication will be skipped"
fi

# ========================================
# RUN REVIEW using root docker-agent-action
Expand Down
11 changes: 8 additions & 3 deletions review-pr/agents/pr-review.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -242,7 +242,10 @@ agents:
security findings (review manually)" citing the verifier's stated mitigation.
Before posting, verify each inline comment per Confidence Scoring: its body ends with a
valid confidence table, and no non-forced comment scores below the inline threshold T.
Then post the review.
Then run the deduplication step from the posting template
(`node /tmp/dedupe-findings.js /tmp/review_comments.json /tmp/existing_review_comments.json`)
so findings already posted in a previous review cycle are not posted twice, and post
the review.
10. Order inline comments forced-first: list the forced comments (high-severity and
security CONFIRMED/LIKELY) first, then the remaining (non-forced) comments by confidence,
highest first — so a forced finding is never buried beneath a higher-scoring non-forced
Expand Down Expand Up @@ -462,8 +465,10 @@ agents:
otherwise keep the suggestion as prose. See `/tmp/refs/posting-format.md` for the exact
field layout. GitHub is strict: a suggestion anchored outside the diff or on a deleted
line makes the WHOLE review fail (422). The mandatory validation step in the posting
template (`node /tmp/validate-suggestions.js /tmp/review_comments.json pr.diff`) strips
any malformed suggestion block before posting — run it; do not skip it.
template strips any malformed suggestion block before posting — run it; do not skip it.
It validates against `pr_full.diff` when that file exists (incremental mode preserves
the full PR diff there — the diff GitHub validates anchors against) and `pr.diff`
otherwise; the posting template selects the right file.

## Domain-Specific Review Output

Expand Down
Loading