Skip to content

"Biggest Zarr Store" High Score/Leaderboard #175

@jbusecke

Description

@jbusecke

I just thought of an addition to the website that IMO would serve well as marketing material for zarr and also be fun.

Id love to propose a new section of the website where users can present their zarr stores (they need to be public) and we can keep track of which zarr store is the largest (either by space on disk or no of chunks), what technology was used (native zarr, virtual icechunk, etc) and optionally have a link to docs that e.g. go into the ingestion code.

A simple sortable/filterable table on the site, something like:

Store Link Total Data on Disk Total Chunks Tech Implementation Submitted By Last Verified More Info / Docs
s3://example-bucket/era5-reanalysis.zarr 2.4 PB 1.2B native zarr v3 alice 2026-05-01 Ingestion write-up
gs://example/cmip6-mirror.zarr 850 TB 410M icechunk bob 2026-04-22 Repo
s3://example/sentinel2-mosaic 312 TB 88M virtual icechunk carol 2026-05-10 Docs
https://example.org/goes16.zarr 95 TB 22M native zarr v3 dave 2026-03-15
s3://example/noaa-hrrr.zarr 41 TB 9.5M native zarr v2 erin 2026-05-20 Notebook
  • Store Link — must point to a publicly readable store (anonymous S3/GCS/HTTPS access).
  • Total Data on Disk — uncompressed-on-disk size, reported by the submitter and ideally reproducible from store metadata.
  • Total Chunks — total number of chunks across all arrays in the store.
  • Tech Implementation — controlled vocabulary so the column stays sortable/filterable. Suggested values: native zarr v2, native zarr v3, icechunk, virtual icechunk, virtualizarr. (Open to additions.)
  • Submitted By — GitHub handle for attribution and credibility.
  • Last Verified — date the stats were last confirmed; helps keep numbers honest as stores grow or change.
  • More Info / Docs — optional link to a blog post, repo, notebook, or docs page explaining the ingestion / use case.

Something like:

- store_link: s3://example-bucket/era5-reanalysis.zarr
  size_bytes: 2_400_000_000_000_000
  total_chunks: 1_200_000_000
  tech: native-zarr-v3
  submitted_by: alice
  last_verified: 2026-05-01
  docs_url: https://example.com/era5-blog

could probably serve as a user supplied source for the page.

Some open questions and my thoughts about them:

  • Verification: We could verify the numbers of public stores either on submission or when rebuilding the page? - I think for now this is overkill for fun addition
  • At some point we probably want to only show the N-largest stores. For now but Id wait for this until many submissions come in?
  • Should we include datasets that need some sort of sign up (NASA EDL etc) but are free to access? Again Id punt this to later.

Happy to implement this, just wanted to float the idea first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions