Skip to content

Speed up and harden cache fingerprinting: xxh3_128 + vectorized DataFrame hashing + collision fixes#1619

Open
Dev-iL wants to merge 2 commits into
apache:mainfrom
SummitSG-LLC:2605/xxh3-fingerprinting
Open

Speed up and harden cache fingerprinting: xxh3_128 + vectorized DataFrame hashing + collision fixes#1619
Dev-iL wants to merge 2 commits into
apache:mainfrom
SummitSG-LLC:2605/xxh3-fingerprinting

Conversation

@Dev-iL
Copy link
Copy Markdown
Collaborator

@Dev-iL Dev-iL commented Jun 1, 2026

Follow-up to: #1616

What this does

Cache fingerprinting (hamilton/caching/fingerprinting.py) maps a Python value to a data_version string used in cache keys. This PR makes it faster and more correct, without weakening the collision-prevention guarantees added in #1616.

1. Swap the hash algorithm — md5/sha224 → xxhash.xxh3_128. All hashing now routes through a single _hash_bytes helper wrapping xxhash.xxh3_128(data).digest(), reusing the existing _compact_hash base64url encoding.

2. Vectorize the DataFrame paths (the real bottleneck — see benchmark):

  • pandas: hash the hash_pandas_object(obj).values uint64 buffer in one shot instead of round-tripping through .to_dict() and an ordered per-row hash_mapping. Column names + dtypes (schema) are folded in; the path stays order-sensitive (the old docstring claiming row order "doesn't matter" was incorrect and is now fixed).
  • polars: hash the hash_rows().to_numpy() buffer in one shot instead of .to_list() into a per-element hash_sequence loop. The schema_hash + row_hash combine from Include metadata in numpy/polars cache fingerprints to prevent collisions #1616 is preserved.

3. Close confirmed collisions. Primitives and bytes now carry a type tag, so 1, "1", b"1", 1.0, and "1.0" hash distinctly; pandas frames with identical values but different column names or dtypes no longer collide.

Any fingerprint change invalidates existing caches exactly once (cache miss → recompute, never a wrong result), which is what makes it safe to land the collision fixes alongside the algorithm swap.

Benchmark results

scripts/benchmark_fingerprinting.py fingerprints a 500,000-row, 3-column DataFrame, comparing the old per-row approach against the new vectorized path (best of 3 runs):

Path Time
Old (per-row to_dict() loop) ~3,200–3,600 ms
New (vectorized buffer hash) ~210–260 ms
Speedup ~14–15×

The structural "no per-row Python loop" assertion is the hard correctness gate; the benchmark is corroborating evidence with a generous ≥5× floor to avoid timing flakiness.

Why xxh3_128 is a sound replacement for the longer sha224

The previous code mixed two digests: md5 (128-bit) for primitives/bytes and sha224 (224-bit) for sequences, mappings, and sets. Replacing the wider sha224 with a 128-bit digest is safe here for three reasons:

  • Collision resistance is about digest width, not cryptographic strength. For a fingerprint, the only property that matters is the probability that two distinct inputs map to the same digest. For a well-distributed n-bit hash that's governed by the birthday bound (~2^(n/2)). These fingerprints are never a security boundary — there is no adversary choosing inputs to force a collision; inputs are ordinary pipeline values. So sha224's extra resistance to deliberate collision attacks (its reason for being longer) buys nothing in this use case.
  • 128 bits is astronomically sufficient for cache keys. The birthday bound for a 128-bit digest is ~2^64 (≈1.8×10¹⁹) distinct values before a ~50% collision chance. A cache will never hold anywhere near that many fingerprints; the realistic collision probability is effectively zero. sha224's 2^112 headroom is far past the point of any practical difference for this workload.
  • xxh3_128 is purpose-built for this. It is a fast, non-cryptographic hash with strong dispersion (passes the SMHasher quality suite), and at 128 bits it matches the width md5 was already trusted for in the same module — so the swap strengthens the former md5 paths' guarantees to par and keeps the former sha224 paths comfortably collision-safe, while removing the cryptographic-hashing overhead we were paying for no benefit.

Net: we trade unused cryptographic headroom for a large throughput win, with collision safety that remains far beyond what any cache will ever exercise.

Dependency & licensing

  • Adds xxhash>=0.8.0 to core runtime dependencies (xxh3_128 was introduced in 0.8.0). Fingerprinting is imported eagerly via the caching adapter, so this is a hard dependency, not an optional extra.
  • xxhash (the python-xxhash package) is BSD-2-Clause; its copyright and licence text are appended to LICENSE in the same style as the existing third-party (MIT databackend) entry.

Testing

  • Pinned literal-digest tests recomputed against the new algorithm (run, not hand-written).
  • New must-differ tests (cross-type primitives; pandas different column-names; pandas/polars different dtypes) and must-match tests (identical frames; list == tuple sequence equality).
  • Full caching suite passes (115 non-polars tests). Polars-dependent tests are exercised on CI — they can't run in every local environment (the polars wheel crashes on hosts lacking certain CPU features), so they're guarded with pytest.importorskip.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Dev-iL and others added 2 commits June 1, 2026 13:46
Replace the md5/sha224 hashes in the caching fingerprinting module with
the non-cryptographic xxhash.xxh3_128, routed through a single shared
_hash_bytes helper. xxh3_128 produces a 16-byte digest (24 base64url
chars, identical width to the md5 already in use), so collision
resistance is preserved while throughput on buffer-bound paths rises
substantially.

Vectorize the DataFrame paths:
- pandas: hash the hash_pandas_object(obj).values uint64 buffer in one
  shot instead of round-tripping through .to_dict() and a per-row Python
  loop; fold column names + dtypes (schema) into the hash; keep the path
  order-sensitive and correct the misleading docstring.
- polars: hash the hash_rows().to_numpy() buffer in one shot instead of
  .to_list() into a per-element hash_sequence loop; keep the
  schema_hash + row_hash combine introduced in apache#1616.

Close confirmed fingerprint collisions by tagging primitives and bytes
with their type, so 1, "1", b"1", 1.0 and "1.0" hash distinctly, and
pandas frames with identical values but different column names or dtypes
no longer collide.

Recompute the pinned literal-digest tests against the new algorithm and
add must-differ / must-match collision tests plus a benchmark script
demonstrating the pandas speedup (~14x on a 500k-row frame).

Declare xxhash>=0.8.0 as a core runtime dependency (xxh3_128 was added
in 0.8.0); fingerprinting is imported eagerly via the caching adapter,
so it must be a hard dependency rather than an optional extra.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xxhash (the python-xxhash package) is a new runtime dependency licensed
under BSD-2-Clause, whose terms require reproducing the copyright notice
and licence text. Append it to LICENSE in the same style as the existing
third-party (MIT databackend) entry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@jernejfrank jernejfrank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, the speedup is amazing! I just have the one concern if invalidating existing caches is breaking in case some users relied on it for cachign some heavy computations.

Comment thread hamilton/caching/fingerprinting.py
Comment thread hamilton/caching/fingerprinting.py
@Dev-iL
Copy link
Copy Markdown
Collaborator Author

Dev-iL commented Jun 2, 2026

Looks good, the speedup is amazing! I just have the one concern if invalidating existing caches is breaking in case some users relied on it for cachign some heavy computations.

Valid concern! Several reasons why I think it's alright:

  1. This is a followup to @skrawcz's PR that already invalidated a lot of caches, and since we didn't have a release in between - there's no significant penalty from introducing this change presently.
  2. Some hashes genuinely need recomputing since they should point to distinct objects (where at the moment their cache is the same).
  3. The next release will be the first under "apache", so users might expect (and accept) such changes as "the price of progress".

@Dev-iL Dev-iL requested review from elijahbenizzy and skrawcz June 2, 2026 06:14
Copy link
Copy Markdown
Contributor

@jernejfrank jernejfrank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on #1616 , make sense to add this now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants