"Biggest Zarr Store" High Score/Leaderboard

  I just thought of an addition to the website that IMO would serve well as marketing material for zarr and also be fun. 
  
  Id love to propose a new section of the website where users can present their zarr stores (they need to be public) and we can keep track of which zarr store is the largest (either by space on disk or no of chunks), what technology was used (native zarr, virtual icechunk, etc) and optionally have a link to docs that e.g. go into the ingestion code. 
  
A simple sortable/filterable table on the site, something like:
 
| Store Link | Total Data on Disk | Total Chunks | Tech Implementation | Submitted By | Last Verified | More Info / Docs |
|---|---|---|---|---|---|---|
| `s3://example-bucket/era5-reanalysis.zarr` | 2.4 PB | 1.2B | native zarr v3 | alice | 2026-05-01 | [Ingestion write-up](https://example.com/era5-blog) |
| `gs://example/cmip6-mirror.zarr` | 850 TB | 410M | icechunk | bob | 2026-04-22 | [Repo]() |
| `s3://example/sentinel2-mosaic` | 312 TB | 88M | virtual icechunk | carol | 2026-05-10 | [Docs]() |
| `https://example.org/goes16.zarr` | 95 TB | 22M | native zarr v3 | dave | 2026-03-15 | — |
| `s3://example/noaa-hrrr.zarr` | 41 TB | 9.5M | native zarr v2 | erin | 2026-05-20 | [Notebook]() |

- **Store Link** — must point to a publicly readable store (anonymous S3/GCS/HTTPS access).
- **Total Data on Disk** — uncompressed-on-disk size, reported by the submitter and ideally reproducible from store metadata.
- **Total Chunks** — total number of chunks across all arrays in the store.
- **Tech Implementation** — controlled vocabulary so the column stays sortable/filterable. Suggested values: `native zarr v2`, `native zarr v3`, `icechunk`, `virtual icechunk`, `virtualizarr`. (Open to additions.)
- **Submitted By** — GitHub handle for attribution and credibility.
- **Last Verified** — date the stats were last confirmed; helps keep numbers honest as stores grow or change.
- **More Info / Docs** — optional link to a blog post, repo, notebook, or docs page explaining the ingestion / use case.

Something like:
 
```yaml
- store_link: s3://example-bucket/era5-reanalysis.zarr
  size_bytes: 2_400_000_000_000_000
  total_chunks: 1_200_000_000
  tech: native-zarr-v3
  submitted_by: alice
  last_verified: 2026-05-01
  docs_url: https://example.com/era5-blog
```

could probably serve as a user supplied source for the page. 

**Some open questions and my thoughts about them**:
- Verification: We could  verify the numbers of public stores either on submission or when rebuilding the page? - I think for now this is overkill for fun addition
- At some point we probably want to only show the N-largest stores. For now but Id wait for this until many submissions come in?
- Should we include datasets that need some sort of sign up (NASA EDL etc) but are free to access? Again Id punt this to later. 


Happy to implement this, just wanted to float the idea first. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Biggest Zarr Store" High Score/Leaderboard #175

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Store Link	Total Data on Disk	Total Chunks	Tech Implementation	Submitted By	Last Verified	More Info / Docs
`s3://example-bucket/era5-reanalysis.zarr`	2.4 PB	1.2B	native zarr v3	alice	2026-05-01	Ingestion write-up
`gs://example/cmip6-mirror.zarr`	850 TB	410M	icechunk	bob	2026-04-22	Repo
`s3://example/sentinel2-mosaic`	312 TB	88M	virtual icechunk	carol	2026-05-10	Docs
`https://example.org/goes16.zarr`	95 TB	22M	native zarr v3	dave	2026-03-15	—
`s3://example/noaa-hrrr.zarr`	41 TB	9.5M	native zarr v2	erin	2026-05-20	Notebook

"Biggest Zarr Store" High Score/Leaderboard #175

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions