Add generalized (relations …) CSV loading construct#259
Conversation
Adds a new `(relations …)` construct on `CSVData` alongside the legacy `(columns …)` form: a shared set of key columns (or the special `METADATA$KEY`) plus one or more output relations, each with its own (possibly empty) value columns, with optional CDC `(inserts …)`/`(deletes …)` grouping. - proto: `NamedColumn` / `OutputRelation` / `Relations` messages + an optional `Relations relations` field on `CSVData` (mutually exclusive with `columns`) - grammar: `relations` / `relation_keys` / `output_relation` / `named_column` rules; `relation_body` returns a concrete `Relations` to keep the Go parser type-stable - regenerated Python / Julia / Go parsers, pretty-printers, and protobuf bindings; `global_ids` + equality extended for the new messages (Julia SDK) - `.lqp` fixtures (binary edge, arity-4 edge, two-relation split, CDC) + regenerated bin / pretty / pretty_debug snapshots `make test` green across Python, Julia, and Go. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Align the CDC relations fixture with the rest of the edge-themed suite: use s3://bucket/edges.csv (matching relations_edge_binary / relations_edge_arity4) and rename the delta relations to :weight_ins / :weight_del, since they carry a weight column. Regenerated the .bin and pretty/pretty_debug snapshot goldens (name-derived relation hashes updated accordingly). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts: # sdks/go/src/parser.go # sdks/julia/LogicalQueryProtocol.jl/src/parser.jl # sdks/python/src/lqp/gen/parser.py
Regenerate parsers, protobuf bindings, pretty printers, and test
fixtures (.bin + pretty/pretty_debug snapshots) from the post-merge
grammar and protos. Parsers changed; protobuf bindings and printers
were already in sync. Test artifacts pick up master's new
:csv_compression default ("auto" -> ""). All Python/Go/Julia tests pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| // A single named CSV column with its type. Used to describe both shared key columns and | ||
| // per-relation value columns in the generalized `Relations` loading construct. | ||
| message NamedColumn { | ||
| string name = 1; // CSV column name (e.g. "src"); special name "METADATA$KEY" => derived hash | ||
| Type type = 2; // Column type | ||
| } | ||
|
|
||
| // One output relation: the shared keys plus this relation's own (possibly empty) value columns. | ||
| message OutputRelation { | ||
| RelationId target_id = 1; // Output relation path | ||
| repeated NamedColumn values = 2; // Value columns for this relation (may be empty) | ||
| } | ||
|
|
||
| // Generalized CSV loading: a shared set of key columns and one or more output relations. | ||
| // CDC vs non-CDC is implied by which group is populated: | ||
| // - `relations` populated => non-CDC outputs | ||
| // - `inserts`/`deletes` => CDC insert/delete groups | ||
| message Relations { | ||
| repeated NamedColumn keys = 1; // Shared key columns (name "METADATA$KEY" => derived hash) | ||
| repeated OutputRelation relations = 2; // Non-CDC outputs | ||
| repeated OutputRelation inserts = 3; // CDC insert group | ||
| repeated OutputRelation deletes = 4; // CDC delete group | ||
| } | ||
|
|
There was a problem hiding this comment.
I was wondering if relations was maybe too generic to burn on this use case. Seems like you arrived at a similar conclusion, given that you went with OutputRelation. But Output also has a different meaning in LQP already. How about TargetRelations and TargetRelation?
Let's also keep this generic and not tied to CSV specifically, I assume we can reuse this for other types of external data. I don't think anything you have above is specific to CSV, except for the comments.
There was a problem hiding this comment.
Sure.
Yeah, the (relations ...) construct is not specific to CSV. I was planning to use the same for Iceberg. I will update the comments.
There was a problem hiding this comment.
Not sure how I missed that, but I just noticed that the outputs keyword that I used in the examples above is missing from the grammar. 🤦♂️ So at the moment it's
(relations
(keys
(column "src" INT)
(column "dst" INT))
(relation :edge))
instead of
(relations
(keys
(column "src" INT)
(column "dst" INT))
(outputs
(relation :edge)))
and
(relations
(keys
(column "src" INT)
(column "dst" INT))
(inserts
(relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
(deletes
(relation :edge_deletions)))
instead of
(relations
(keys
(column "src" INT)
(column "dst" INT))
(outputs
(inserts
(relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
(deletes
(relation :edge_deletions))))
Do you have a preference between leaving it as it is, adding (outputs ...), or maybe using (targets ...)?
|
Could we get some more concrete examples, e.g., for each load config, what the CSV file looks like and what are the shapes of the resulting relations? |
Rename the generalized CSV-loading proto messages Relations -> TargetRelations and OutputRelation -> TargetRelation, plus the matching grammar nonterminals (relations -> target_relations, output_relation -> target_relation) and all comments. The LQP s-expression syntax is unchanged: the (relations …)/(relation …)/(inserts …)/(deletes …) keywords, proto field names, and wire format are all preserved (no .bin or pretty-snapshot changes). Regenerated parsers, printers, and protobuf bindings for all three SDKs; updated hand-written Julia equality. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generalized loading construct (NamedColumn / TargetRelation / TargetRelations) is not CSV-specific — it will be used for other input types too. Remove "CSV" from those messages' comments; CSV wording stays on the CSV-specific CSVData message. Regenerated the Go binding (the only generated SDK that embeds proto comments). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generalized-loading SDK types added in this branch had no coverage in equality_tests.jl. Add testitems for NamedColumn, TargetRelation, and TargetRelations following the existing equality/inequality/hash/ reflexivity/symmetry/transitivity pattern, including the non-CDC (relations) and CDC (inserts/deletes) groupings of TargetRelations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| // - `relations` populated => non-CDC outputs | ||
| // - `inserts`/`deletes` => CDC insert/delete groups |
There was a problem hiding this comment.
Does this mean that the relations and the inserts/deletes portion of TargetRelations are mutually exclusive? e.g. either one or the other is populated? If so, maybe we can express this in a OneOf of CDC/non-CDC?
There was a problem hiding this comment.
Yeah, they should be mutually exclusive. I will look into that.
There was a problem hiding this comment.
I changed it to a OneOf.
Sure. I asked Claude to generate a few examples and added some comments: 1. Binary
|
Replace the flat relations/inserts/deletes fields on TargetRelations
with a `oneof body { PlainTargets plain; CdcTargets cdc; }`, making the
mutually-exclusive plain (non-CDC) and CDC modes explicit instead of
inferred from which repeated field happens to be populated. PlainTargets
wraps `targets`; CdcTargets wraps `inserts`/`deletes` (oneof can't hold
repeated fields directly, hence the wrapper messages).
The LQP s-expression syntax is unchanged — only the grammar's
construct/deconstruct helpers and the deconstruct guard (now switching on
the oneof case via has_proto_field) change, so pretty/pretty_debug
snapshots are untouched; only .bin wire encodings shift. Regenerated all
three SDKs; updated hand-written Julia equality (PlainTargets/CdcTargets
+ oneof-based TargetRelations), properties global_ids, and equality
tests. All Python/Go/Julia suites pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Adds a new
(relations …)construct onCSVData, alongside the existing(columns …)(GNF) form, to support more general CSV loading: a shared set of key columns (or the specialMETADATA$KEY) plus one or more output relations, each with its own (possibly empty) value columns, with optional CDC grouping into(inserts …)/(deletes …).This lets a load:
The legacy
(columns …)form is untouched and remains fully supported (the two are mutually exclusive on a givenCSVData).Some examples:
Changes
logic.proto):NamedColumn,OutputRelation,Relationsmessages; optionalRelations relationsonCSVData.grammar.y):relations/relation_keys/output_relation/named_columnrules.relation_bodyreturns a concreteRelations(not a tuple) so the Go parser stays type-stable.global_idsand==/hash/isequalextended for the new messages.relations_edge_binary,relations_edge_arity4,relations_split,relations_cdc(+ regeneratedbin/pretty/pretty_debugsnapshots).🤖 Generated with Claude Code