Feature request
The safeoutputs MCP server currently forwards every key it receives in a safe-output JSONL record to the GitHub API (via the safe_outputs job's downstream handler). When an LLM emits a key that is not in the documented schema for that safe-output type, the result is either a hard failure (e.g. HTTP 422 from workflow_dispatch if the receiver doesn't declare the extra input) or silent data loss.
Request: the safeoutputs MCP server should validate each emitted JSON object against the documented schema for its type and either reject (fail-loud) or strip (fail-graceful) unknown keys before forwarding.
Concrete reproduction (issue #153 in our policy-driven agent POC)
We use gpt-5-mini for the policy-dispatcher prompt. The dispatcher is documented in the prompt to emit dispatch_workflow records with shape:
{"type": "dispatch_workflow", "workflow_name": "<tier>", "inputs": {"issue_number": "<N>"}}
The model, however, generalizes from min-integrity (a key in the gh-aw MCP-guard env-var family that ALSO appears in its context) and adds hallucinated integrity / secrecy keys to the safe-output object:
{"type": "dispatch_workflow", "workflow_name": "tier-substantial", "inputs": {...}, "integrity": "high", "secrecy": "medium"}
The safeoutputs handler forwards these as workflow_dispatch inputs. GitHub's API rejects with HTTP 422 because the receiver workflow's on.workflow_dispatch.inputs block doesn't declare integrity or secrecy. The chain wedges and the issue cannot be processed.
Defense-in-depth we built downstream (workaround)
We shipped a deterministic post-emission sanitizer that injects a Sanitize Safe Outputs step into every *.lock.yml agent job, after agent emission and before the safe_outputs processor. The step reads the agent's output JSONL, parses each line, intersects keys against a schema file generated from each receiver workflow's declared on.workflow_dispatch.inputs, and rewrites the file in place. Unknown keys are dropped with a structured [sanitize-safe-outputs] dropped key=<key> from type=<type> log line. Fail-closed if the sanitizer or schema is missing.
The patcher and the schema file:
This works, but it's a per-repo patch on top of generated lock files (which then needs a sidecar hash manifest to survive re-compiles). The right place for the sanitization is upstream in the safeoutputs MCP server itself: ONE source of truth, every gh-aw user benefits.
Why this matters beyond #153
The hallucination pattern is robust across models. We've also seen GPT-5-mini hallucinate extra fields on add_comment (security-axis names like integrity, secrecy, min-integrity) and on add_labels (label-namespace names like scope, severity). Any LLM-emitted safe-output is at risk of this class of bug as long as the MCP server forwards unknown keys.
Proposed shape
gh-aw's safeoutputs MCP server already has the schema for each safe-output type internally (it has to, in order to construct the GitHub API request). Validate emitted JSON against that schema at MCP-server time. Options:
- Strict: reject the emission with a clear error back to the agent ("unknown key 'integrity' on type 'dispatch_workflow'"). Agent can self-correct.
- Lenient: strip unknown keys silently, log them. Same observable behavior as our downstream sanitizer.
- Configurable: a workflow frontmatter option
safe-outputs.validate: strict|lenient|off lets users choose.
Option 3 is probably the cleanest -- defaults to strict for new workflows, lenient migration path for existing.
Filed by
@dfrysinger via the policy-driven-agent-poc project. Happy to review a PR and provide additional reproductions if useful.
Cross-reference: dfrysinger/policy-driven-agent-poc#153 (root cause), PR #162 (downstream sanitizer).
Feature request
The safeoutputs MCP server currently forwards every key it receives in a safe-output JSONL record to the GitHub API (via the safe_outputs job's downstream handler). When an LLM emits a key that is not in the documented schema for that safe-output type, the result is either a hard failure (e.g. HTTP 422 from
workflow_dispatchif the receiver doesn't declare the extra input) or silent data loss.Request: the safeoutputs MCP server should validate each emitted JSON object against the documented schema for its
typeand either reject (fail-loud) or strip (fail-graceful) unknown keys before forwarding.Concrete reproduction (issue #153 in our policy-driven agent POC)
We use
gpt-5-minifor the policy-dispatcher prompt. The dispatcher is documented in the prompt to emitdispatch_workflowrecords with shape:{"type": "dispatch_workflow", "workflow_name": "<tier>", "inputs": {"issue_number": "<N>"}}The model, however, generalizes from
min-integrity(a key in the gh-aw MCP-guard env-var family that ALSO appears in its context) and adds hallucinatedintegrity/secrecykeys to the safe-output object:{"type": "dispatch_workflow", "workflow_name": "tier-substantial", "inputs": {...}, "integrity": "high", "secrecy": "medium"}The safeoutputs handler forwards these as
workflow_dispatchinputs. GitHub's API rejects with HTTP 422 because the receiver workflow'son.workflow_dispatch.inputsblock doesn't declareintegrityorsecrecy. The chain wedges and the issue cannot be processed.Defense-in-depth we built downstream (workaround)
We shipped a deterministic post-emission sanitizer that injects a
Sanitize Safe Outputsstep into every*.lock.ymlagent job, after agent emission and before the safe_outputs processor. The step reads the agent's output JSONL, parses each line, intersects keys against a schema file generated from each receiver workflow's declaredon.workflow_dispatch.inputs, and rewrites the file in place. Unknown keys are dropped with a structured[sanitize-safe-outputs] dropped key=<key> from type=<type>log line. Fail-closed if the sanitizer or schema is missing.The patcher and the schema file:
scripts/inject-safe-output-sanitizer.sh.github/scripts/sanitize-safe-outputs.pyscripts/refresh-safe-output-schemas.shThis works, but it's a per-repo patch on top of generated lock files (which then needs a sidecar hash manifest to survive re-compiles). The right place for the sanitization is upstream in the safeoutputs MCP server itself: ONE source of truth, every gh-aw user benefits.
Why this matters beyond #153
The hallucination pattern is robust across models. We've also seen GPT-5-mini hallucinate extra fields on
add_comment(security-axis names likeintegrity,secrecy,min-integrity) and onadd_labels(label-namespace names likescope,severity). Any LLM-emitted safe-output is at risk of this class of bug as long as the MCP server forwards unknown keys.Proposed shape
gh-aw's safeoutputs MCP server already has the schema for each safe-output type internally (it has to, in order to construct the GitHub API request). Validate emitted JSON against that schema at MCP-server time. Options:safe-outputs.validate: strict|lenient|offlets users choose.Option 3 is probably the cleanest -- defaults to
strictfor new workflows,lenientmigration path for existing.Filed by
@dfrysinger via the policy-driven-agent-poc project. Happy to review a PR and provide additional reproductions if useful.
Cross-reference:
dfrysinger/policy-driven-agent-poc#153(root cause), PR #162 (downstream sanitizer).