feat(trace): audit-trail observability — events, query, stats, OCSF export, replay, bounded memory (#175, #176, #177, #179, #182, #213)#233
Open
dgenio wants to merge 5 commits into
Conversation
…replay, bounded memory Implements the TraceStore / audit-trail observability group as one cohesive change set, all anchored on TraceStore + ActionTrace + the kernel recording path: - #175 Record handle expansions and policy denials as first-class audit events. ActionTrace gains additive `event_type` (invoke/expand/deny) + `reason_code`; Kernel.expand() records an `expand` event and fills Provenance.principal_id; a denied grant records a `deny` event with the stable reason code. - #177 TraceQuery + pure query_traces() + TraceStore.query() across all backends (added to TraceStoreProtocol); filter by principal/capability/event/outcome/ reason/time window with deterministic ordering and pagination. - #179 KernelStats programmatic counters (Kernel.stats / reset_stats), dependency-free and lock-guarded. - #176 OCSF/AOS SIEM export: trace_to_ocsf() / traces_to_ocsf(), pure mapping. - #213 Policy-replay harness: DecisionRecord, record_decision(), replay() -> DecisionDiff, with rate-limit flips surfaced separately. - #182 Bounded memory: TraceStore oldest-first eviction (max_entries, evicted_count, loud first eviction) + revocation expiry tracking and sweep that never un-revokes a live token (RevocationStoreProtocol.track() gains expires_at; adds sweep_expired()). Docs (architecture/security/integrations/capabilities/trace_export), CHANGELOG, two runnable offline examples, and full test coverage added. `make ci` passes (715 passed, 1 skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019446VfpRWBaPqU4WX5KgTX
There was a problem hiding this comment.
Pull request overview
This PR expands the audit-trail / TraceStore observability subsystem by adding new audited event kinds (deny/expand), a shared trace query surface, bounded in-memory retention + revocation sweeping, and new operator-focused tooling (KernelStats, OCSF export, policy replay), with docs/examples/tests updated accordingly.
Changes:
- Extend
ActionTracewith additiveevent_typeandreason_code, and recorddeny/expandtraces from kernel choke points. - Add
TraceQuery+query_traces()and implementTraceStoreProtocol.query()across in-memory/SQLite/JSONL backends. - Add bounded memory mechanisms:
TraceStore(max_entries, evicted_count)and revocationexpires_attracking +sweep_expired()(plusKernelStats, OCSF export, and replay harness).
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_trace.py | Adds eviction/bounding tests for TraceStore and export field assertions. |
| tests/test_trace_query.py | New unit/integration tests for trace filtering, ordering, and pagination. |
| tests/test_tokens.py | Adds tests for revocation expiry sweeping and bounded growth. |
| tests/test_stores_sqlite.py | Updates revocation tracking API and adds SQLite sweep-expired test. |
| tests/test_stats.py | New tests for KernelStats collector and kernel integration counters. |
| tests/test_replay.py | New tests for policy decision recording + deterministic replay diffs. |
| tests/test_ocsf.py | New tests for OCSF/AOS mapping shape, determinism, and no-args leakage. |
| tests/test_kernel.py | Adds kernel-level tests for deny/expand trace recording and querying. |
| tests/test_handles.py | Verifies expansion frames now include Provenance.principal_id. |
| src/weaver_kernel/trace.py | Adds bounded TraceStore, exports event_type/reason_code, and integrates query support. |
| src/weaver_kernel/trace_query.py | Introduces TraceQuery and pure deterministic query_traces() implementation. |
| src/weaver_kernel/tokens.py | Threads expires_at into revocation tracking; adds provider sweep_revocations(). |
| src/weaver_kernel/stores/sqlite.py | Implements trace query; adds token expiry table + sweep-expired revocation cleanup. |
| src/weaver_kernel/stores/memory.py | Adds expiry tracking + lazy/explicit revocation sweeping to bound memory. |
| src/weaver_kernel/stores/jsonl.py | Implements trace query using shared query_traces() semantics. |
| src/weaver_kernel/stores/_trace_codec.py | Decodes persisted event_type/reason_code with backward-compatible defaults. |
| src/weaver_kernel/stores/_protocols.py | Extends protocols: TraceStoreProtocol.query(), RevocationStoreProtocol.track(expires_at) + sweep_expired(). |
| src/weaver_kernel/stats.py | New KernelStats collector and immutable StatsSnapshot. |
| src/weaver_kernel/replay.py | New policy decision record + replay/diff harness with rate-limit separation. |
| src/weaver_kernel/ocsf.py | New pure mapping from ActionTrace → OCSF API Activity (6003), AOS-enriched. |
| src/weaver_kernel/models.py | Adds TraceEventType and new ActionTrace fields with defaults. |
| src/weaver_kernel/kernel/_stream.py | Updates streaming pipeline to bump stats for handles/invocations. |
| src/weaver_kernel/kernel/_invoke.py | Threads fallback signal + adds invocation/handle stats increments. |
| src/weaver_kernel/kernel/_audit.py | New helpers to build/store deny and expansion traces. |
| src/weaver_kernel/kernel/init.py | Records deny/expand traces, exposes Kernel.query_traces(), and wires Kernel.stats. |
| src/weaver_kernel/handles.py | Fills expansion Provenance.principal_id from the expanding principal. |
| src/weaver_kernel/init.py | Re-exports new public APIs (query, stats, OCSF export, replay types/functions). |
| Makefile | Adds new examples to make example. |
| examples/trace_replay_demo.py | Offline runnable replay demo showing allow→deny flip behavior. |
| examples/ocsf_export_demo.py | Offline runnable demo exporting traces to OCSF and printing stats snapshot. |
| docs/trace_export.md | Documents new exported event_type/reason_code and related query/OCSF sections. |
| docs/security.md | Documents new audited event types and bounded retention/sweeping behavior. |
| docs/integrations.md | Adds SIEM export section and mapping table for OCSF/AOS. |
| docs/capabilities.md | Documents replay harness usage and determinism/fidelity caveats. |
| docs/architecture.md | Documents audited event types, query API semantics, retention bounding, and kernel counters. |
| CHANGELOG.md | Adds changelog entries covering new audit/query/stats/OCSF/replay/bounding features. |
| .github/workflows/ci.yml | Runs new examples in CI. |
…atetimes, fix stream redaction count, refresh export docstring - trace_query: treat naive TraceQuery.since/until as UTC so filtering against the always-aware ActionTrace.invoked_at never raises TypeError. - stores/memory: normalize naive expires_at (track) and now (sweep_expired) to UTC, matching SQLiteRevocationStore, preventing naive-vs-aware comparison errors. - kernel/_stream: count a redaction event if *any* streamed frame carried a warning (apply_stream attaches warnings per chunk), not just the final frame. - trace: export_action_trace docstring now describes deny/expand events instead of claiming denials never produce a trace. Regression tests added for naive datetime handling (query + revocation) and the streaming redaction-count fix. make ci passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019446VfpRWBaPqU4WX5KgTX
… grant-time Addresses audit-pass findings on PR #233 (no functional/behavioral regressions): - grant(): record the "deny" audit trace best-effort. A trace-store write failure inside the PolicyDenied handler previously could mask the denial with a storage error; the denial already fails closed (no token issued), so the write is now wrapped and logged, and PolicyDenied always propagates. - _audit.py: tighten the module docstring so its auditability claim matches behavior — only grant-time policy denials are recorded as "deny" traces; expansion-time access failures remain exceptions/logs, not traces. Docstrings updated to match. Deferred (documented tradeoffs): O(n) query() on durable backends and single-warning trace eviction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DcmCN68NkKQd9Ja9Awz7vb
…rouping-k70jmg # Conflicts: # CHANGELOG.md
…semantics Audit follow-ups (no behavior change): - Kernel.expand(): note the expansion trace is intentionally NOT best-effort (unlike the denial trace) so a served expansion is never left unaudited (I-02). - execute_with_fallback(): document that only DriverError counts as a failed attempt; an unregistered driver is skipped and does not set fell_back. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01BwRZZvDVMaW5LpRJpGpBDa
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
One cohesive change set across the TraceStore / audit-trail observability area (everything anchored on
TraceStore,ActionTrace, and the kernel recording path), closing the recommended triage group.models.py—ActionTracegains additiveevent_type(invoke/expand/deny) andreason_code(defaults preserve the originalinvokemeaning).trace.py/ newtrace_query.py—TraceQuery+ purequery_traces()(filter by principal, capability, event type, outcome, reason code,since-inclusive/until-exclusive window; deterministic(invoked_at, action_id)order + pagination).TraceStoreis now bounded (max_entries, oldest-first eviction, loud first eviction,evicted_count).stores/_protocols.py,sqlite.py,jsonl.py,memory.py,_trace_codec.py—query()added toTraceStoreProtocoland all trace backends; revocation backends track each token'sexpires_atandsweep_expired()(never un-revoking a live token). Breaking:RevocationStoreProtocol.track()now takesexpires_at.tokens.py— threadsexpires_atintotrack; addsHMACTokenProvider.sweep_revocations().stats.py—KernelStatscollector + immutableStatsSnapshot; wired at kernel choke points (grant, invoke, fallback, firewall warnings, downgrade, handle store/expand). Exposed asKernel.stats/Kernel.reset_stats().ocsf.py—trace_to_ocsf()/traces_to_ocsf()map any record to OCSF API Activity (6003) events, AOS-enriched; pure, dependency-free.replay.py—DecisionRecord,record_decision(),replay() -> DecisionDiff(allow→deny / deny→allow / reason-code flips; rate-limit flips surfaced separately).kernel/__init__.py+ newkernel/_audit.py— recorddenytraces onPolicyDeniedandexpandtraces onKernel.expand();Kernel.query_traces().handles.py— expansion Frames now carryProvenance.principal_id.architecture.md,security.md,integrations.md(OCSF mapping table),capabilities.md(replay),trace_export.md,CHANGELOG.md; runnable offlineexamples/ocsf_export_demo.pyandexamples/trace_replay_demo.py(added tomake exampleand CI).Closes #175, #176, #177, #179, #182, #213.
Why
These six issues share one code area, data model, and implementation path — what/how-much the audit subsystem records (#175, #182), how it is read (#177, #179), and how it is exported/replayed (#176, #213) — so they are cleanest as one change. Developed in Mode B per the requester: new latitude allowed, retro-compat not required (hence the
track()signature change), one combined PR.How verified
make ci(ruff format --check→ruff check→mypy src/→pytest -q --cov→ all examples):ruff format --check— 107 files already formattedruff check src/ tests/ examples/— All checks passedmypy src/— Success: no issues found in 57 source filespytest -q— 715 passed, 1 skipped (new suites:test_trace_query.py,test_stats.py,test_ocsf.py,test_replay.py; extendedtest_trace.py,test_tokens.py,test_handles.py,test_kernel.py,test_stores_sqlite.py)make ciexit code0; both new examples run clean offline.Tradeoffs / risks
RevocationStoreProtocol.track()adds anexpires_atparameter and the protocol gainssweep_expired()— custom revocation backends must update. In-tree backends and tests are updated.weaver_kernel.ocsf)._revokedentry (the token is expired and fails the verifier's expiry check regardless); normal lifecycle (revoke while live → sweep after expiry) is fully bounded.Scope notes
Limited to the recommended triage group. Adjacent observability items intentionally left as follow-ups: wiring
KernelStatsintootel.pyas gauges (#179 notes this is acceptable as a follow-up), and#125(OTel/LangWatchActionTraceexport), which partially overlaps the already-shippedinstrument_kernel+export_action_traces.🤖 Generated with Claude Code
https://claude.ai/code/session_019446VfpRWBaPqU4WX5KgTX
Generated by Claude Code