Speed up and harden cache fingerprinting: xxh3_128 + vectorized DataFrame hashing + collision fixes by Dev-iL · Pull Request #1619 · apache/hamilton

Dev-iL · 2026-06-01T11:10:09Z

Follow-up to: #1616

What this does

Cache fingerprinting (hamilton/caching/fingerprinting.py) maps a Python value to a data_version string used in cache keys. This PR makes it faster and more correct, without weakening the collision-prevention guarantees added in #1616.

1. Swap the hash algorithm — md5/sha224 → xxhash.xxh3_128. All hashing now routes through a single _hash_bytes helper wrapping xxhash.xxh3_128(data).digest(), reusing the existing _compact_hash base64url encoding.

2. Vectorize the DataFrame paths (the real bottleneck — see benchmark):

pandas: hash the hash_pandas_object(obj).values uint64 buffer in one shot instead of round-tripping through .to_dict() and an ordered per-row hash_mapping. Column names + dtypes (schema) are folded in; the path stays order-sensitive (the old docstring claiming row order "doesn't matter" was incorrect and is now fixed).
polars: hash the hash_rows().to_numpy() buffer in one shot instead of .to_list() into a per-element hash_sequence loop. The schema_hash + row_hash combine from Include metadata in numpy/polars cache fingerprints to prevent collisions #1616 is preserved.

3. Close confirmed collisions. Primitives and bytes now carry a type tag, so 1, "1", b"1", 1.0, and "1.0" hash distinctly; pandas frames with identical values but different column names or dtypes no longer collide.

Any fingerprint change invalidates existing caches exactly once (cache miss → recompute, never a wrong result), which is what makes it safe to land the collision fixes alongside the algorithm swap.

Benchmark results

scripts/benchmark_fingerprinting.py fingerprints a 500,000-row, 3-column DataFrame, comparing the old per-row approach against the new vectorized path (best of 3 runs):

Path	Time
Old (per-row `to_dict()` loop)	~3,200–3,600 ms
New (vectorized buffer hash)	~210–260 ms
Speedup	~14–15×

The structural "no per-row Python loop" assertion is the hard correctness gate; the benchmark is corroborating evidence with a generous ≥5× floor to avoid timing flakiness.

Why xxh3_128 is a sound replacement for the longer sha224

The previous code mixed two digests: md5 (128-bit) for primitives/bytes and sha224 (224-bit) for sequences, mappings, and sets. Replacing the wider sha224 with a 128-bit digest is safe here for three reasons:

Collision resistance is about digest width, not cryptographic strength. For a fingerprint, the only property that matters is the probability that two distinct inputs map to the same digest. For a well-distributed n-bit hash that's governed by the birthday bound (~2^(n/2)). These fingerprints are never a security boundary — there is no adversary choosing inputs to force a collision; inputs are ordinary pipeline values. So sha224's extra resistance to deliberate collision attacks (its reason for being longer) buys nothing in this use case.
128 bits is astronomically sufficient for cache keys. The birthday bound for a 128-bit digest is ~2^64 (≈1.8×10¹⁹) distinct values before a ~50% collision chance. A cache will never hold anywhere near that many fingerprints; the realistic collision probability is effectively zero. sha224's 2^112 headroom is far past the point of any practical difference for this workload.
xxh3_128 is purpose-built for this. It is a fast, non-cryptographic hash with strong dispersion (passes the SMHasher quality suite), and at 128 bits it matches the width md5 was already trusted for in the same module — so the swap strengthens the former md5 paths' guarantees to par and keeps the former sha224 paths comfortably collision-safe, while removing the cryptographic-hashing overhead we were paying for no benefit.

Net: we trade unused cryptographic headroom for a large throughput win, with collision safety that remains far beyond what any cache will ever exercise.

Dependency & licensing

Adds xxhash>=0.8.0 to core runtime dependencies (xxh3_128 was introduced in 0.8.0). Fingerprinting is imported eagerly via the caching adapter, so this is a hard dependency, not an optional extra.
xxhash (the python-xxhash package) is BSD-2-Clause; its copyright and licence text are appended to LICENSE in the same style as the existing third-party (MIT databackend) entry.

Testing

Pinned literal-digest tests recomputed against the new algorithm (run, not hand-written).
New must-differ tests (cross-type primitives; pandas different column-names; pandas/polars different dtypes) and must-match tests (identical frames; list == tuple sequence equality).
Full caching suite passes (115 non-polars tests). Polars-dependent tests are exercised on CI — they can't run in every local environment (the polars wheel crashes on hosts lacking certain CPU features), so they're guarded with pytest.importorskip.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

Replace the md5/sha224 hashes in the caching fingerprinting module with the non-cryptographic xxhash.xxh3_128, routed through a single shared _hash_bytes helper. xxh3_128 produces a 16-byte digest (24 base64url chars, identical width to the md5 already in use), so collision resistance is preserved while throughput on buffer-bound paths rises substantially. Vectorize the DataFrame paths: - pandas: hash the hash_pandas_object(obj).values uint64 buffer in one shot instead of round-tripping through .to_dict() and a per-row Python loop; fold column names + dtypes (schema) into the hash; keep the path order-sensitive and correct the misleading docstring. - polars: hash the hash_rows().to_numpy() buffer in one shot instead of .to_list() into a per-element hash_sequence loop; keep the schema_hash + row_hash combine introduced in apache#1616. Close confirmed fingerprint collisions by tagging primitives and bytes with their type, so 1, "1", b"1", 1.0 and "1.0" hash distinctly, and pandas frames with identical values but different column names or dtypes no longer collide. Recompute the pinned literal-digest tests against the new algorithm and add must-differ / must-match collision tests plus a benchmark script demonstrating the pandas speedup (~14x on a 500k-row frame). Declare xxhash>=0.8.0 as a core runtime dependency (xxh3_128 was added in 0.8.0); fingerprinting is imported eagerly via the caching adapter, so it must be a hard dependency rather than an optional extra. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

xxhash (the python-xxhash package) is a new runtime dependency licensed under BSD-2-Clause, whose terms require reproducing the copyright notice and licence text. Append it to LICENSE in the same style as the existing third-party (MIT databackend) entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jernejfrank

Looks good, the speedup is amazing! I just have the one concern if invalidating existing caches is breaking in case some users relied on it for cachign some heavy computations.

Dev-iL · 2026-06-02T03:09:55Z

Looks good, the speedup is amazing! I just have the one concern if invalidating existing caches is breaking in case some users relied on it for cachign some heavy computations.

Valid concern! Several reasons why I think it's alright:

This is a followup to @skrawcz's PR that already invalidated a lot of caches, and since we didn't have a release in between - there's no significant penalty from introducing this change presently.
Some hashes genuinely need recomputing since they should point to distinct objects (where at the moment their cache is the same).
The next release will be the first under "apache", so users might expect (and accept) such changes as "the price of progress".

jernejfrank

Good point on #1616 , make sense to add this now!

Dev-iL and others added 2 commits June 1, 2026 13:46

jernejfrank reviewed Jun 1, 2026

View reviewed changes

Comment thread hamilton/caching/fingerprinting.py

Comment thread hamilton/caching/fingerprinting.py

Dev-iL requested review from elijahbenizzy and skrawcz June 2, 2026 06:14

jernejfrank approved these changes Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up and harden cache fingerprinting: xxh3_128 + vectorized DataFrame hashing + collision fixes#1619

Speed up and harden cache fingerprinting: xxh3_128 + vectorized DataFrame hashing + collision fixes#1619
Dev-iL wants to merge 2 commits into
apache:mainfrom
SummitSG-LLC:2605/xxh3-fingerprinting

Dev-iL commented Jun 1, 2026

Uh oh!

jernejfrank left a comment

Uh oh!

Uh oh!

Uh oh!

Dev-iL commented Jun 2, 2026 •

edited

Loading

Uh oh!

jernejfrank left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dev-iL commented Jun 1, 2026

What this does

Benchmark results

Why xxh3_128 is a sound replacement for the longer sha224

Dependency & licensing

Testing

Checklist

Uh oh!

jernejfrank left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Dev-iL commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jernejfrank left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dev-iL commented Jun 2, 2026 •

edited

Loading