Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Benchmarks & demos

Standalone scripts that exercise xarray-sql against real data. Each declares its
own dependencies inline (PEP 723) and points `xarray_sql` at this checkout, so
they run with no setup:

```bash
uv run benchmarks/grad_era5.py
```

## `grad_era5.py` — differentiable SQL over ARCO-ERA5

Demonstrates the autograd feature on a real climate archive
([ARCO-ERA5](https://github.com/google-research/arco-era5), read anonymously
from GCS — needs `gcsfs` and network access).

The key idea: a physical quantity is written as an **analytic SQL formula** over
ERA5 variables, and `grad(...)` differentiates that formula **symbolically**,
evaluated at every grid cell. Because each row is an independent point, this is
the relational equivalent of `jax.vmap(jax.grad(f))`. It is *not* a finite-
difference spatial gradient — `grad(f(u, v), u)` is the exact partial derivative
of `f`.

Two worked cases, each checked against an analytic reference:

| Quantity | SQL | Derivative | Check |
| --- | --- | --- | --- |
| Wind speed | `sqrt(power(u,2) + power(v,2))` | `grad(speed, u) = u/speed` | exact |
| Saturation vapour pressure | `A*exp(B*tc/(tc+C))` | `grad(e_s, T)` | closed-form Clausius-Clapeyron slope |

Each query round-trips back to an `xarray.Dataset` via `.to_dataset(...)`.

## `grad_descent.py` — gradient descent as one declarative SQL query

Fits a line `y ~= a*x + b` by minimising the mean squared error, with the
**entire training loop expressed as a single recursive CTE** — no Python
iteration. Two pieces:

- **`grad` compiles the update rule.** `xql.differentiate_sql(loss, "a", cols)`
turns the per-row loss into its symbolic derivative *as SQL text* — the
autograd engine as a calculus compiler.
- **A recursive CTE is the optimiser.** `params(step, a, b)` starts at one row
and each recursion appends the next generation, descending along the gradient
(`AVG` of the compiled rule over the data):

```sql
WITH RECURSIVE params(step, a, b) AS (
SELECT 0, 0.0, 0.0
UNION ALL
SELECT params.step + 1, params.a - lr*AVG(da), params.b - lr*AVG(db)
FROM params CROSS JOIN d WHERE params.step < STEPS
GROUP BY params.step, params.a, params.b)
SELECT * FROM params ORDER BY step
```

So gradient, update, and iteration are all declarative SQL; the trajectory is
the rows of one query. The fit matches numpy's least-squares solution.
Self-contained (no network).

(Why differentiate to text instead of `grad(...)` inside the recursion? `grad`
needs the Substrait round-trip, and Substrait has no recursion — so a `grad`
marker can't live inside a recursive CTE. Differentiating once to plain SQL
sidesteps that.)

## `mnist_mlp.py` — an MNIST MLP as relational tensor algebra

An MLP (196 -> 32 tanh -> 10 softmax on 2x2-pooled 14x14 MNIST) built on one
idea: **a neural net is a chain of tensor contractions (einsums), and an einsum
over coordinate-indexed arrays *is* relational algebra.**

```
C[i,k] = sum_j A[i,j] * B[j,k] <=> JOIN A, B ON A.j = B.j
GROUP BY i, k -> SUM(A.val * B.val)
```

Contracting a shared index is a join on it followed by a grouped `SUM` over the
indices that survive. In xarray-sql an array indexed by named dims is a table
keyed by those dims, so **the dimension names are the join keys**.

**The whole network is one relation.** Two moves get there:

- **Bias folded into the weights (an `nn.Linear`).** Each layer's bias is the
weight of a constant-`1` input, kept as the extra row `inp = width` of the same
weight array — so a layer is a single matrix.
- **A `layer` dimension.** Every layer's weight lives in one
`weight(layer, inp, out)` array, so the forward/backward filter on the `layer`
*column* instead of referencing a table per layer.

So **the architecture is data**: the whole model is one `xr.Dataset` with a
`layer` dim, registered via `from_dataset`. The dim sizes are the layer widths
and the number of layers is the depth — differing neuron counts are just
differing sizes, NaN-padded in the dense array and dropped on the way in (the
relational form is naturally ragged). Change `WIDTHS` (e.g. `196, 64, 32, 10`)
and the same code trains the deeper net.

A small `contract()` helper turns an einsum spec into one query, and a single
generic loop trains a net of any shape:

- **forward** contracts the activation with `weight WHERE layer = L`, adds the
bias row, `tanh` (softmax on the last layer).
- **backward is the *same* operator with indices transposed** — the VJP of a
contraction is a contraction — accumulated into one `gweight` relation, with
`grad(tanh(z), z)` for the only genuinely-calculus part. Even the update is one
query over the whole `weight` relation. Linear algebra is joins; the
derivatives of the nonlinearities are `grad`.

Everything stays relational: every stage is an inspectable table (`a1`, `delta2`,
`gweight`, …); the only hand-written gradient is softmax + cross-entropy's
`delta = softmax - onehot`. Even the training metrics are a table — each logged
step appends a `(step, loss, train_acc, test_acc)` row to a `metrics` relation
rather than a Python list (NN training produces a lot of such data; it belongs in
rows). Evaluation is SQL too (a forward pass + `ROW_NUMBER()` argmax), and the
trained model, predictions, and metrics all come **back out as xarray** via
`to_dataset`. Reaches ~83% test accuracy over 60 steps. Downloads MNIST on first
run.

This is not a numpy replacement — relational matmul carries join overhead a BLAS
inner product doesn't. What it buys is a fully declarative, inspectable pipeline
whose data side is chunked xarray (parallel over the batch, larger-than-memory).
The *outer* training loop stays in Python because the steps must be materialised
between iterations: a multi-layer net can't be one recursive CTE (the recursive
relation may be referenced only once, but the weights are used several times per
step), and unrolling the steps as non-recursive CTEs blows up exponentially
(DataFusion inlines CTEs). The thin loop does exactly that materialisation; all
the maths stays in SQL.
115 changes: 115 additions & 0 deletions benchmarks/grad_descent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "xarray_sql",
# "xarray",
# "numpy",
# ]
#
# [tool.uv.sources]
# xarray_sql = { path = "..", editable = true }
# ///
"""Gradient descent as a single declarative SQL query.

Fits a line ``y ~= a*x + b`` by minimising the mean squared error — with the
**entire training loop expressed as one recursive CTE**, no Python iteration.

Two pieces:

1. **grad compiles the update rule.** ``differentiate_sql`` turns the per-row
loss into the symbolic derivative *as SQL text* — the autograd engine acting
as a calculus compiler:

da = differentiate_sql("(y-(a*x+b))^2", "a") # -> "-2*((a*x+b)-y)*x", etc.

2. **A recursive CTE is the optimiser.** ``params(step, a, b)`` starts at one
row and each recursion appends the next generation, descending along the
gradient (``AVG`` of the compiled rule over the data):

params.a - lr * AVG(da)

So the whole loop — gradient, update, and iteration — is declarative SQL;
the optimisation trajectory is the rows of one query.

Why two pieces instead of ``grad(...)`` directly inside the recursion? ``grad``
needs the Substrait round-trip, and Substrait has no recursion — so ``grad``
can't live inside a recursive CTE (tracked in #194 / a follow-up). Differentiating
once to plain SQL sidesteps that: the recursive query contains no ``grad`` marker.

Run standalone:

uv run benchmarks/grad_descent.py
"""

from __future__ import annotations

import numpy as np
import xarray as xr

import xarray_sql as xql

# Per-row loss r^2 with residual r = y - (a*x + b), over columns a, b, x, y.
RESIDUAL = "(y - (a * x + b))"
LOSS = f"{RESIDUAL} * {RESIDUAL}"
COLUMNS = ["a", "b", "x", "y"]
LR = 0.4
STEPS = 200


def main() -> None:
rng = np.random.default_rng(0)
n = 500
x = rng.uniform(0.0, 1.0, n)
a_true, b_true = 2.0, -1.0
y = a_true * x + b_true + rng.normal(0.0, 0.01, n)

ctx = xql.XarrayContext()
ctx.from_dataset(
"d",
xr.Dataset(
{"x": (("i",), x), "y": (("i",), y)}, coords={"i": np.arange(n)}
),
chunks={"i": n},
)

# grad compiles the per-row update rule to SQL, once.
da = xql.differentiate_sql(LOSS, "a", COLUMNS)
db = xql.differentiate_sql(LOSS, "b", COLUMNS)
print(f"d(loss)/da = {da}")
print(f"d(loss)/db = {db}\n")

# The entire training loop is one declarative recursive query: each step
# appends the next generation, descending along the SQL-computed gradient.
trajectory = ctx.sql(
f"""
WITH RECURSIVE params(step, a, b) AS (
SELECT 0 AS step, CAST(0.0 AS DOUBLE) AS a, CAST(0.0 AS DOUBLE) AS b
UNION ALL
SELECT params.step + 1 AS step,
params.a - {LR} * AVG({da}) AS a,
params.b - {LR} * AVG({db}) AS b
FROM params CROSS JOIN d
WHERE params.step < {STEPS}
GROUP BY params.step, params.a, params.b
)
SELECT step, a, b FROM params ORDER BY step
"""
).to_pandas()

print("trajectory (every 40th generation):")
print(trajectory.iloc[::40].to_string(index=False))

a, b = float(trajectory["a"].iloc[-1]), float(trajectory["b"].iloc[-1])
a_ols, b_ols = np.polyfit(x, y, 1)
print(
f"\nSQL gradient descent: a={a:.4f} b={b:.4f} ({len(trajectory)} generations)"
)
print(f"least-squares (numpy): a={a_ols:.4f} b={b_ols:.4f}")
assert abs(a - a_ols) < 1e-2 and abs(b - b_ols) < 1e-2
print(
"\nOK: a single recursive-CTE query fit the line to the OLS solution."
)


if __name__ == "__main__":
main()
Loading
Loading