Safe-output schema validation in safeoutputs MCP server: drop unknown keys before forwarding

## Feature request

The safeoutputs MCP server currently forwards every key it receives in a safe-output JSONL record to the GitHub API (via the safe_outputs job's downstream handler). When an LLM emits a key that is not in the documented schema for that safe-output type, the result is either a hard failure (e.g. HTTP 422 from `workflow_dispatch` if the receiver doesn't declare the extra input) or silent data loss.

Request: the safeoutputs MCP server should validate each emitted JSON object against the documented schema for its `type` and either reject (fail-loud) or strip (fail-graceful) unknown keys before forwarding.

## Concrete reproduction (issue #153 in our policy-driven agent POC)

We use `gpt-5-mini` for the policy-dispatcher prompt. The dispatcher is documented in the prompt to emit `dispatch_workflow` records with shape:

```json
{"type": "dispatch_workflow", "workflow_name": "<tier>", "inputs": {"issue_number": "<N>"}}
```

The model, however, generalizes from `min-integrity` (a key in the gh-aw MCP-guard env-var family that ALSO appears in its context) and adds hallucinated `integrity` / `secrecy` keys to the safe-output object:

```json
{"type": "dispatch_workflow", "workflow_name": "tier-substantial", "inputs": {...}, "integrity": "high", "secrecy": "medium"}
```

The safeoutputs handler forwards these as `workflow_dispatch` inputs. GitHub's API rejects with HTTP 422 because the receiver workflow's `on.workflow_dispatch.inputs` block doesn't declare `integrity` or `secrecy`. The chain wedges and the issue cannot be processed.

## Defense-in-depth we built downstream (workaround)

We shipped a deterministic post-emission sanitizer that injects a `Sanitize Safe Outputs` step into every `*.lock.yml` agent job, after agent emission and before the safe_outputs processor. The step reads the agent's output JSONL, parses each line, intersects keys against a schema file generated from each receiver workflow's declared `on.workflow_dispatch.inputs`, and rewrites the file in place. Unknown keys are dropped with a structured `[sanitize-safe-outputs] dropped key=<key> from type=<type>` log line. Fail-closed if the sanitizer or schema is missing.

The patcher and the schema file:
- [`scripts/inject-safe-output-sanitizer.sh`](https://github.com/dfrysinger/policy-driven-agent-poc/blob/main/scripts/inject-safe-output-sanitizer.sh)
- [`.github/scripts/sanitize-safe-outputs.py`](https://github.com/dfrysinger/policy-driven-agent-poc/blob/main/.github/scripts/sanitize-safe-outputs.py)
- [`scripts/refresh-safe-output-schemas.sh`](https://github.com/dfrysinger/policy-driven-agent-poc/blob/main/scripts/refresh-safe-output-schemas.sh)

This works, but it's a per-repo patch on top of generated lock files (which then needs a sidecar hash manifest to survive re-compiles). The right place for the sanitization is upstream in the safeoutputs MCP server itself: ONE source of truth, every gh-aw user benefits.

## Why this matters beyond #153

The hallucination pattern is robust across models. We've also seen GPT-5-mini hallucinate extra fields on `add_comment` (security-axis names like `integrity`, `secrecy`, `min-integrity`) and on `add_labels` (label-namespace names like `scope`, `severity`). Any LLM-emitted safe-output is at risk of this class of bug as long as the MCP server forwards unknown keys.

## Proposed shape

`gh-aw`'s safeoutputs MCP server already has the schema for each safe-output type internally (it has to, in order to construct the GitHub API request). Validate emitted JSON against that schema at MCP-server time. Options:

1. **Strict**: reject the emission with a clear error back to the agent ("unknown key 'integrity' on type 'dispatch_workflow'"). Agent can self-correct.
2. **Lenient**: strip unknown keys silently, log them. Same observable behavior as our downstream sanitizer.
3. **Configurable**: a workflow frontmatter option `safe-outputs.validate: strict|lenient|off` lets users choose.

Option 3 is probably the cleanest -- defaults to `strict` for new workflows, `lenient` migration path for existing.

## Filed by

[@dfrysinger](https://github.com/dfrysinger) via the policy-driven-agent-poc project. Happy to review a PR and provide additional reproductions if useful.

Cross-reference: [`dfrysinger/policy-driven-agent-poc#153`](https://github.com/dfrysinger/policy-driven-agent-poc/issues/153) (root cause), [PR #162](https://github.com/dfrysinger/policy-driven-agent-poc/pull/162) (downstream sanitizer).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safe-output schema validation in safeoutputs MCP server: drop unknown keys before forwarding #34885

Feature request

Concrete reproduction (issue #153 in our policy-driven agent POC)

Defense-in-depth we built downstream (workaround)

Why this matters beyond #153

Proposed shape

Filed by

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Safe-output schema validation in safeoutputs MCP server: drop unknown keys before forwarding #34885

Description

Feature request

Concrete reproduction (issue #153 in our policy-driven agent POC)

Defense-in-depth we built downstream (workaround)

Why this matters beyond #153

Proposed shape

Filed by

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions