Speed up and harden cache fingerprinting: xxh3_128 + vectorized DataFrame hashing + collision fixes#1619
Open
Dev-iL wants to merge 2 commits into
Open
Speed up and harden cache fingerprinting: xxh3_128 + vectorized DataFrame hashing + collision fixes#1619Dev-iL wants to merge 2 commits into
Dev-iL wants to merge 2 commits into
Conversation
Replace the md5/sha224 hashes in the caching fingerprinting module with the non-cryptographic xxhash.xxh3_128, routed through a single shared _hash_bytes helper. xxh3_128 produces a 16-byte digest (24 base64url chars, identical width to the md5 already in use), so collision resistance is preserved while throughput on buffer-bound paths rises substantially. Vectorize the DataFrame paths: - pandas: hash the hash_pandas_object(obj).values uint64 buffer in one shot instead of round-tripping through .to_dict() and a per-row Python loop; fold column names + dtypes (schema) into the hash; keep the path order-sensitive and correct the misleading docstring. - polars: hash the hash_rows().to_numpy() buffer in one shot instead of .to_list() into a per-element hash_sequence loop; keep the schema_hash + row_hash combine introduced in apache#1616. Close confirmed fingerprint collisions by tagging primitives and bytes with their type, so 1, "1", b"1", 1.0 and "1.0" hash distinctly, and pandas frames with identical values but different column names or dtypes no longer collide. Recompute the pinned literal-digest tests against the new algorithm and add must-differ / must-match collision tests plus a benchmark script demonstrating the pandas speedup (~14x on a 500k-row frame). Declare xxhash>=0.8.0 as a core runtime dependency (xxh3_128 was added in 0.8.0); fingerprinting is imported eagerly via the caching adapter, so it must be a hard dependency rather than an optional extra. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xxhash (the python-xxhash package) is a new runtime dependency licensed under BSD-2-Clause, whose terms require reproducing the copyright notice and licence text. Append it to LICENSE in the same style as the existing third-party (MIT databackend) entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
jernejfrank
reviewed
Jun 1, 2026
Contributor
jernejfrank
left a comment
There was a problem hiding this comment.
Looks good, the speedup is amazing! I just have the one concern if invalidating existing caches is breaking in case some users relied on it for cachign some heavy computations.
Collaborator
Author
Valid concern! Several reasons why I think it's alright:
|
jernejfrank
approved these changes
Jun 2, 2026
Contributor
jernejfrank
left a comment
There was a problem hiding this comment.
Good point on #1616 , make sense to add this now!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to: #1616
What this does
Cache fingerprinting (
hamilton/caching/fingerprinting.py) maps a Python value to adata_versionstring used in cache keys. This PR makes it faster and more correct, without weakening the collision-prevention guarantees added in #1616.1. Swap the hash algorithm — md5/sha224 →
xxhash.xxh3_128. All hashing now routes through a single_hash_byteshelper wrappingxxhash.xxh3_128(data).digest(), reusing the existing_compact_hashbase64url encoding.2. Vectorize the DataFrame paths (the real bottleneck — see benchmark):
hash_pandas_object(obj).valuesuint64 buffer in one shot instead of round-tripping through.to_dict()and an ordered per-rowhash_mapping. Column names + dtypes (schema) are folded in; the path stays order-sensitive (the old docstring claiming row order "doesn't matter" was incorrect and is now fixed).hash_rows().to_numpy()buffer in one shot instead of.to_list()into a per-elementhash_sequenceloop. Theschema_hash + row_hashcombine from Include metadata in numpy/polars cache fingerprints to prevent collisions #1616 is preserved.3. Close confirmed collisions. Primitives and bytes now carry a type tag, so
1,"1",b"1",1.0, and"1.0"hash distinctly; pandas frames with identical values but different column names or dtypes no longer collide.Any fingerprint change invalidates existing caches exactly once (cache miss → recompute, never a wrong result), which is what makes it safe to land the collision fixes alongside the algorithm swap.
Benchmark results
scripts/benchmark_fingerprinting.pyfingerprints a 500,000-row, 3-column DataFrame, comparing the old per-row approach against the new vectorized path (best of 3 runs):to_dict()loop)The structural "no per-row Python loop" assertion is the hard correctness gate; the benchmark is corroborating evidence with a generous ≥5× floor to avoid timing flakiness.
Why xxh3_128 is a sound replacement for the longer sha224
The previous code mixed two digests: md5 (128-bit) for primitives/bytes and sha224 (224-bit) for sequences, mappings, and sets. Replacing the wider sha224 with a 128-bit digest is safe here for three reasons:
xxh3_128is purpose-built for this. It is a fast, non-cryptographic hash with strong dispersion (passes the SMHasher quality suite), and at 128 bits it matches the width md5 was already trusted for in the same module — so the swap strengthens the former md5 paths' guarantees to par and keeps the former sha224 paths comfortably collision-safe, while removing the cryptographic-hashing overhead we were paying for no benefit.Net: we trade unused cryptographic headroom for a large throughput win, with collision safety that remains far beyond what any cache will ever exercise.
Dependency & licensing
xxhash>=0.8.0to core runtime dependencies (xxh3_128was introduced in 0.8.0). Fingerprinting is imported eagerly via the caching adapter, so this is a hard dependency, not an optional extra.python-xxhashpackage) is BSD-2-Clause; its copyright and licence text are appended toLICENSEin the same style as the existing third-party (MIT databackend) entry.Testing
list == tuplesequence equality).pytest.importorskip.Checklist