rmoriedata is the R companion data package for
rmorie. AGPL-3.0-or-later.
Distributed via r-universe today; CRAN submission planned alongside
rmorie v1.0.0 alpha.
This package ships only bundled .rda / .csv fixtures drawn from
public Canadian + US open-data portals (TPS, OTIS, ARSAU, Chicago, NYC,
Vancouver, Statistics Canada CODR, Montreal, Calgary, Ottawa, Edmonton).
No private, FOI-restricted, or agreement-only data is in this package.
That property is the central security claim and the rest of this document
explains how it is enforced.
Status (2026-05-26): repository scaffold exists; package contents are being staged in
/Volumes/VSR/rootcoderfiles/rmoriedata-staging/in parallel. The SECURITY policy is being put in place ahead of the first public r-universe push so the conformance discipline is established from day one.
Email vsruhela@proton.me with subject [SECURITY] rmoriedata —
do not open a public GitHub Issue for security reports. GitHub's
private vulnerability reporting
will be enabled at first public push. PGP preferred:
gpg --recv-keys F2A44D5982E7585E48DF861E335990B9336F7DD6
Please include:
- Description, impact, CVSS estimate.
- For data-leak reports: the dataset key, the licence the upstream publishes under, and the redistribution restriction you believe applies.
packageVersion("rmoriedata")+R.version.string+ platform.- Whether you want CHANGELOG + NEWS.md credit.
SLA
| Severity | Acknowledge | Fix or mitigation |
|---|---|---|
| High (private / FOI / agreement-only data in a bundled fixture) | 24 hours | 24 hours (yank) |
High (fabricated rnorm() / sample() data documented as real) |
72 hours | 14 days |
| Moderate (stale fixture vs. live portal, missing licence note) | 72 hours | 30 days |
| Low (typo, broken provenance URL) | 72 hours | 90 days |
A High data-leak report triggers same-day package retraction from r-universe and a CRAN withdrawal request if applicable. No bug bounty (yet). Valid reports get credit by default.
rmoriedata is a data-only R package. The host R session and
user are trusted; the contents of this package are the asset.
Adversaries we model:
-
A maintainer (me, or an LLM-driven agent) accidentally bundling private / FOI / agreement-only data. This is the dominant threat. Mitigation: every CSV /
.rdaininst/extdata/anddata/carries a provenance note indata-raw/recording:- The public open-data portal URL.
- The licence string verbatim (Open Government Licence — Ontario; City of Toronto Open Data Licence; Open Data Commons; etc.).
- The fetch date and the upstream version / SHA.
- The retrieval script that produced the bundled artifact.
The CI gate refuses to publish a fixture whose
data-raw/script doesn't exist or doesn't pass a portal-URL + licence sanity check. -
An LLM-driven agent fabricating bundled data. Bit us on the morie side before; the rule here is hard: any bundled file is either (a) a real slice from a public portal or (b) a typed-empty 0-row frame with a documented schema. Never
rnorm()/sample()invented values, ever, regardless of how innocuous it looks. -
An attacker tampering with a bundled fixture in flight. Mitigated by GPG-signed Git tags + SHA-256 sidecars on release artifacts + GitHub's transparency log + the SLSA build provenance attestation.
-
A stale fixture mis-represented as current. Every
data/*.rdaships with alast_fetchedattribute (UTC, RFC 3339) and anupstream_urlattribute.R/rmoriedata_provenance.Rexposes these. v1.0+: armoriedata_check_freshness()helper diffs bundled vs. live for the user. -
A re-identification attack on a "public-but-sensitive" fixture. Some open-data portals publish low-cell counts that, joined with other public data, may be re-identifiable. Mitigated by:
- Refusing to bundle small-cell tables (n < threshold per upstream guidance).
- Documenting any aggregation we apply before bundling.
- Linking to upstream methodology for caveats.
Assets we protect:
- The "no private data" invariant (highest priority).
- The "no fabricated data" invariant.
- Fixture integrity (no tampering between maintainer and user).
- Provenance accuracy — the URL + licence + date on every bundled artifact.
Out of scope:
- Host-OS / R-runtime compromise. Beyond our reach.
- Analytical conclusions drawn from the data. Those live in rmorie / morie / papers, not here.
- Upstream portal availability. If
data.ontario.cagoes down, the bundled fixture is the user's only copy until the portal returns; we make no SLA promise. - Upstream portal correctness. If OTIS publishes wrong numbers, we bundle wrong numbers. We are a faithful mirror.
Trust boundaries:
| Boundary | Crossing |
|---|---|
| Public portal → maintainer machine | data-raw/*.R ingestion scripts (one-time, audited) |
Maintainer → packaged .rda |
data-raw/*.R → usethis::use_data() |
Packaged .rda → user R session |
data(<name>, package = "rmoriedata") |
| Provenance metadata → user | Attributes on every bundled object |
rmoriedata is a passive data package. No crypto primitives are
exposed at the API surface. The crypto that matters is:
- Release-tag GPG signature with
F2A44D5982E7585E48DF861E335990B9336F7DD6(same key as the rest of the morie family). - SHA-256 sidecar (
.sha256) on each release artifact. - SLSA L3 build provenance attestation from
actions/attest-build-provenance;gh attestation verifyconfirms what tag built what bytes. - (Roadmap) RFC 3161 timestamp from a Canadian TSA
(
timestamp.entrust.net) over each bundled-fixture manifest, so the user can prove the bundled bytes existed at a given moment even if the GitHub Release is later deleted.
For users who want to seal their own derivatives, the crypto stack in rmorie is available: X25519 + ML-KEM-768 KEX, Ed25519 + ML-DSA-65 signatures, ChaCha20-Poly1305 AEAD, Argon2id KDF — all libsodium + liboqs.
| Requirement | Where | ITSG-33 | NIST 800-53 (Mod) | OWASP ASVS L2 | Ontario MGCS IT Sec |
|---|---|---|---|---|---|
| Data provenance (URL + licence + date) | data-raw/*.R per-fixture + attributes on every data/*.rda |
SI-12 | SI-12 | V14.3.3 | §4 Data class |
| "No private data" gate | CI step refuses release if any fixture lacks a data-raw/ script + portal URL + licence string |
AC-21 | AC-21 | V1.8.1 | §4 Data class |
| "No fabricated data" gate | Per-fixture parity check vs. live portal at build time (best-effort) + maintainer policy | SI-7 | SI-7 | V14.3.3 | §6.3 Integrity |
| Open licence compatibility check | inst/LICENCES/ enumerates every upstream licence; CI fails on unknown licence |
SA-4 | SA-4 | V1.1.1 | §3 Acceptable use |
| Reproducible build | DESCRIPTION pins R-deps; data-raw/ is fully deterministic given fetch-date |
CM-2 | CM-2 | V14.2.1 | §6.2 Change ctrl |
| SHA-256 sidecars on release | release workflow | SI-7 | SI-7 | V10.3.1 | §6.3 Integrity |
| GPG-signed Git tags | key F2A44D5982E7585E48DF861E335990B9336F7DD6 |
AU-10 | AU-10 | V10.3.1 | §6.3 Integrity |
| SLSA L3 build provenance | actions/attest-build-provenance |
SR-4 | SR-4 | V14.2.6 | §6.2 Change ctrl |
| Vulnerability disclosure | This document + GitHub Security Advisories | IR-6 | IR-6 | V1.1.4 | §7 Incident |
| Same-day yank procedure for data leaks | Documented retraction procedure (see "Audit & non-repudiation") | IR-4 | IR-4 | V1.1.4 | §7 Incident |
ITSG-33: Treasury Board of Canada IT Security Guidance. NIST 800-53 Rev 5 moderate baseline. OWASP ASVS 4.0.3.
- Reproducible fetches. Every
data-raw/*.Rscript is deterministic given the fetch date and the upstream portal state. Scripts pin upstream SHAs / resource IDs / Socrata view-IDs where the portal exposes them. - SBOM. A CycloneDX SBOM of the R-package surface is attached
per release; the per-fixture provenance file
(
inst/provenance.json) acts as the data-asset BoM. - Signed releases. GPG-signed tags + SHA-256 sidecars on every release artifact + SLSA L3 attestation.
- CI action pinning. All
uses:in.github/workflows/pinned by full commit SHA, never tag.
- Per-fixture provenance — embedded as attributes on every
bundled object and as a sibling
inst/provenance.jsonfor programmatic inspection. - Hash-chained release manifest — each release publishes a
signed manifest of
{fixture_name, sha256, fetched_at, upstream_url, licence}rows; the chain ties to the prior release so insertions / deletions are detectable. - (Roadmap) RFC 3161 timestamp on the manifest.
Data-leak retraction procedure:
- Maintainer marks the affected release as withdrawn on r-universe + GitHub Releases.
- New patch release ships with the fixture removed + a
Deprecatednote inNEWS.md. - If CRAN submission has happened, a withdrawal request goes to CRAN simultaneously.
- A post-mortem entry in
docs/POSTMORTEMS.mdrecords what leaked, how it got in, and what gate failed.
- A user re-distributing bundled fixtures under an incompatible licence. The bundled licences are documented; downstream compliance is the user's.
- A determined re-identification attack across portals. We refuse low-cell tables, but composability across portals is beyond our visibility.
- Stale fixtures. A bundled
.rdais a snapshot, not a live view; thelast_fetchedattribute is the user's clue. - Upstream portal misinformation. Garbage in, garbage in (we're a mirror).
- A user querying upstream portals directly via rmorie/morie. That trust boundary is rmorie's SECURITY, not ours.
Wave 1 — in flight (initial release)
- "No private data" CI gate.
- "No fabricated data" parity-against-live gate.
- Per-fixture provenance attributes +
inst/provenance.json. - GPG-signed release tags + SHA-256 sidecars.
- Same-day yank procedure documented.
Wave 2 — in progress / done
- DP + k-anonymity helpers — done (v0.1.1). Six exported, base-R-only
primitives for analysts releasing aggregates without re-identification
risk:
morie_dp_laplace_count(),morie_dp_gaussian_mean(),morie_dp_laplace_histogram(),morie_k_anonymity_verify(),morie_l_diversity_verify(),morie_cell_suppress(). Round-trip / variance-scaling tests undertests/testthat/. - SLSA L3 attestation on every release.
- Hash-chained release manifest.
rmoriedata_check_freshness()helper.- macOS / Windows binary-package signing once CRAN submission begins.
Future
- RFC 3161 timestamping over the manifest from a Canadian TSA.
- CRAN submission alongside rmorie v1.0.0 alpha.
Maintainer: Vansh Singh Ruhela (rootcoder007) · vsruhela@proton.me