Support Snappy compression and a configurable GZIP level for make_examples output by tfenne · Pull Request #1088 · google/deepvariant

tfenne · 2026-06-23T01:23:34Z

Summary

make_examples has always written its tf.Example output as GZIP-compressed TFRecords at zlib's default level, with the codec hardcoded on both the write and read sides. On many-core machines the GZIP step is a meaningful slice of make_examples CPU time, and when output throughput is provisioned generously the cheaper Snappy codec is often the better trade. This PR makes the examples codec selectable and exposes the GZIP level. Output compression type is keyed off filenames, compression level by a new option.

Going from gzip level 6 to level 1 on chr20 saved ~7% of make_examples runtime, at the cost of about doubling the output size. Going gzip level 6 to snappy saved 10% at the cost of about 4x more storage. The tradeoff is user selectable based on available storage, and is much more modest with the small-model handling a large fraction of calls now.

Design: the file name is the single source of truth

The compression codec is inferred from the examples file-name suffix rather than from a separate flag:

a *.snappy examples path selects Snappy;
any other suffix keeps the historical GZIP behaviour.

The C++ writer (nucleus::ExampleWriter, via the new CompressionTypeForPath) and every examples reader (call_variants, data_providers, show_examples, and the shape probe in dv_utils, via the new dv_utils.compression_type_for_examples_path) derive the codec the same way. Detection is case-insensitive and handles the usual sharded (@N, -ddddd-of-ddddd) and comma-separated path forms.

A new --examples_compression_level flag (proto MakeExamplesOptions.examples_compression_level) exposes the GZIP level (-1 or 0..9; -1 = library default). It only affects GZIP output. The proto field is declared optional so an unset value is distinguishable from a deliberate level 0 (otherwise a non-flag caller would silently get level 0 / no deflate), and a level supplied alongside a .snappy path is ignored with a warning (Snappy has no levels).

Backward compatibility

Default behaviour is unchanged: with no .snappy suffix and no level flag, output is GZIP at the library default level, exactly as before. The auxiliary make_examples outputs (call_variant_outputs, small_model_examples) are written through the Python TFRecord writer, which cannot emit Snappy, so they remain GZIP — and their file names are forced to .gz even when the main examples are Snappy, so each file's name matches its actual bytes.

What changed

third_party/nucleus/io/example_writer.{h,cc}: codec inferred from the path; compression_level plumbed into RecordWriterOptions.zlib_options; new public CompressionTypeForPath.
deepvariant/protos/deepvariant.proto: optional int32 examples_compression_level.
deepvariant/make_examples_options.py: --examples_compression_level flag + range validator.
deepvariant/make_examples_native.cc: passes the level (library default when unset).
deepvariant/dv_utils.py, call_variants.py, data_providers.py, show_examples.py: readers autodetect the codec from the suffix.
deepvariant/make_examples_core.py: side outputs forced to .gz.

Testing

third_party/nucleus/io:example_writer_test — codec detection, Snappy write/read round-trip, and GZIP level actually applied (level 0 file larger than level 9).
deepvariant:dv_utils_test — suffix detection across casing and sharding forms.
deepvariant:make_examples_core_test — side-output .snappy → .gz renaming (incl. uppercase).
End-to-end make_examples → call_variants round-trip on a chr20 region for both Snappy and GZIP examples (and an explicit GZIP level), confirming call_variants reads Snappy output and that an out-of-range level is rejected.

…amples make_examples has always written its tf.Example output as GZIP-compressed TFRecords at the zlib default level. On many-core machines the GZIP step is a meaningful slice of make_examples CPU, and when output throughput is provisioned generously the cheaper Snappy codec is the better trade. This adds that control. The compression codec is inferred from the examples file-name suffix, so the file name is the single source of truth for both writing and reading: a ".snappy" examples path selects Snappy, any other suffix keeps the historical GZIP behaviour. The C++ writer (nucleus::ExampleWriter, via the new CompressionTypeForPath) and every examples reader (call_variants, data_providers, show_examples, and the shape probe in dv_utils, via dv_utils.compression_type_for_examples_path) derive the codec the same way, so a file can never be written with one codec and read with another. A new --examples_compression_level flag (MakeExamplesOptions.examples_compression_level) exposes the GZIP level (-1 or 0..9; -1 = library default). It only affects GZIP output; the proto field is declared optional so an unset value is distinguishable from a deliberate level 0, and a level supplied alongside a Snappy path is ignored with a warning. The auxiliary make_examples outputs (call_variant_outputs, small_model_examples) are written through the Python TFRecord writer, which cannot emit Snappy, so they remain GZIP and their file names are forced to ".gz" to stay consistent with their bytes. Tested: C++ example_writer_test (codec detection, Snappy round-trip, GZIP level applied), dv_utils_test (suffix detection), make_examples_core_test (side-output renaming), and an end-to-end make_examples -> call_variants round-trip for both Snappy and GZIP examples.

The compression work added a suffix-inferred codec and --examples_compression_level to make_examples, but the run_deepvariant wrapper still hardcoded .gz example names and did not surface the level, so Snappy was reachable only by invoking make_examples directly. This wires both controls through the orchestration wrapper. --examples_compression {GZIP,SNAPPY} selects the suffix of the single shared examples path, which feeds both make_examples (writer) and call_variants (reader) so they stay aligned; the auxiliary outputs (gVCF, call_variant_outputs, small_model) remain GZIP. --examples_compression_level forwards the GZIP level to make_examples for GZIP output only, and check_flags validates the range and warns if a level is supplied alongside SNAPPY.

tfenne · 2026-06-23T11:03:55Z

Pushed an extra commit this morning that exposes the compression options in run_deepvariant.py too.

pichuan · 2026-06-24T04:32:44Z

Hi @tfenne ,

Thanks for the PR!

Since I believe you're already familiar with our process, I'll go ahead and start the review.

As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description.

Please let me know if you have any concerns with this approach.

-pichuan

tfenne · 2026-06-26T17:26:00Z

I re-benchmarked this PR's changes vs. r1.10 through the standard docker build process, running the baseline and modified versions in the resulting docker containers on the chr20 short-read WGS set on a c8a.4xlarge at AWS with 16 cores / 16 shards. In that setup:

baseline runtime: 108.6
pr runtime: 102.7
% change: ~5.2% improvement

The baseline is different from my other PR, I think largely run-to-run variance; the above is the average of multiple interleaved replicates. Also, the improvement shown above is only seen if a) the option is supplied to switch to snappy intermediate compression, and b) the storage has enough throughput to absorb the extra writes (the snappy compressed outputs are ~3x the size of the gzip level 6, but with the small model on, the absolute size is not so bad).

pichuan · 2026-06-28T16:25:38Z

    new_file = os.path.join(file_dir, new_file_base)
    return new_file

+  @staticmethod


I'm changing this part to be:

@classmethod def _as_gzip_path(cls, file_path):

pichuan · 2026-06-28T16:26:03Z

+    agree with the codec detection in
+    dv_utils.compression_type_for_examples_path and the C++ writer (both
+    lower-case the path before comparing).
+    """


I'm adding more comment here:

Args: file_path: The file path to potentially rewrite. Returns: The path with '.snappy' replaced by '.gz', or the original path.

pichuan · 2026-06-28T16:27:30Z

+#include <vector>

+#include "tensorflow/core/lib/io/record_reader.h"
+#include "tensorflow/core/platform/env.h"


I'll add two more include here for tstring.h and types.h

pichuan · 2026-06-28T16:28:26Z


 // High-level options that encapsulates all of the parameters needed to run
 // DeepVariant end-to-end.
 // Next ID: 102.


I'll update this ID.

pichuan

As I incorporate changes internally, I'm adding some minor edits.

Here are a few examples.

Going forward I might not mention all of them if they're minor ones like this. FYI.

pichuan · 2026-06-28T16:39:14Z

+  tensorflow::io::RecordReaderOptions opts =
+      tensorflow::io::RecordReaderOptions::CreateRecordReaderOptions("SNAPPY");
+  tensorflow::io::RecordReader reader(file.get(), opts);
+  tensorflow::uint64 offset = 0;


I'm changing this to uint64_t

pichuan · 2026-06-30T20:40:28Z

Hi @tfenne,

An update on testing. I ran your PR with --examples_compression=SNAPPY across 6 sample types (exome, wgs, pacbio, ont-r104, hybrid-pacbio-illumina, rnaseq), 5 trials each.

Good news first: On n2-standard-96, the PR is a clean no-op when the flags aren't used — VCF outputs are byte-identical to a baseline built without the PR, and runtimes showed no significant difference on 5 of 6 datasets (one dataset, ont-r104, showed a spurious 4.76% slowdown attributable to cross-day VM variance). The flag-guarded design works well.

On the runtime improvement, my results so far are mixed:

n2-standard-96: SNAPPY showed marginal make_examples speedups on 3 of 6 datasets (pacbio -6.7%, wgs -5.0%, hybrid -6.6%; all p<0.05 but none reaching p<0.01). The other 3 datasets showed no significant effect. SNAPPY also had slightly lower variance than the baseline on most datasets, which is a nice property.
c3d-standard-16: Unfortunately these results were inconclusive — the SNAPPY run suffered from severe VM-level instability (variance up to 18× higher than the noop, affecting all stages equally, including stages that don't touch example compression like vcf_stats). I can't draw any conclusions from this data. Full report below.

I'm currently re-running the experiment on n2-standard-96 to see if the marginal improvements reproduce. I'll report back when those results are in.

I noticed in your earlier comment you saw a ~5.2% improvement on chr20 WGS (c8a.4xlarge, 16 cores), and you mentioned that storage throughput matters — the benefit only appears if storage can absorb the ~3× larger intermediate writes. That's an interesting clue. In my setup (Google Batch VMs writing to local SSD), I'm not sure if storage throughput is the bottleneck. Do you have a sense of what other factors might explain why I'm not seeing a clear improvement at full-genome scale?

-pichuan

PR #1088 (SNAPPY vs Noop) Runtime Report — `c3d-standard-16`

Setup

Machine type: c3d-standard-16
Trials: 5 per sample
Noop: gh1088-head938924210-noop-c3d16 (default GZIP level 6)
SNAPPY: gh1088-head938924210-snappy-c3d16 (--examples_compression=SNAPPY)
Both runs on same day (20260629), same code, same machine type.

`make_examples` Comparison

uid	noop (mean)	snappy (mean)	diff (sec)	diff (%)	t-stat	p-value	significant?
pacbio	8915.50s (2h 28m)	9096.95s (2h 31m)	+181.45s	+2.04%	-0.47	0.6566	❌ Not significant
wgs	11245.00s (3h 7m)	11728.52s (3h 15m)	+483.52s	+4.30%	-0.97	0.3708	❌ Not significant
hybrid	13956.67s (3h 52m)	14588.14s (4h 3m)	+631.47s	+4.52%	-5.21	0.0057	⚠️ YES (p<0.01)
ont-r104	12868.34s (3h 34m)	13541.70s (3h 45m)	+673.36s	+5.23%	-1.87	0.1333	❌ Not significant
exome	481.07s (8m 1s)	499.10s (8m 19s)	+18.03s	+3.75%	-1.08	0.3242	❌ Not significant
rnaseq	1461.44s (24m 21s)	1676.24s (27m 56s)	+214.80s	+14.70%	-3.39	0.0255	⚠️ Marginal (p<0.05)

Caution

SNAPPY is slower on every single dataset. Not a single make_examples improvement. The direction is 6/6 regressions.

However, the SNAPPY run had dramatically higher runtime variance across all stages (up to 18× on ont-r104), including stages unrelated to example compression (vcf_stats, postprocess_variants), suggesting VM-level instability rather than a SNAPPY-specific regression. See the raw tables below for the std_dev values.

VCF Output Verification

All md5sums are byte-identical:

uid	md5sum (both runs)	match?
exome	`a564eb2e51d9ba927e8d2c406b045cd6`	✅
hybrid-pacbio-illumina	`eba191e9407907e660317ccbd49f420d`	✅
ont-r104	`46aeecf822164751ade92d05c64c61cd`	✅
pacbio	`e70fb870b3d4b8e82f3dab6dbc6aeee3`	✅
rnaseq	`1296173f2fd5b6f328a37130186021ba`	✅
wgs	`b0f1e53dbc9bc1b9b44dc653ae5f9cef`	✅

Note

The c3d-standard-16 md5sums differ from the n2-standard-96 md5sums — this is expected since num_shards differs (16 vs 96), and DeepVariant's sharding is not deterministic across different shard counts.

Summary

SNAPPY appears slower on every dataset on c3d-standard-16, but this result is almost certainly unreliable due to:

Catastrophic variance in the SNAPPY run (up to 18× higher than noop)
The slowdown affects all stages equally — including stages that don't touch example compression at all (vcf_stats, postprocess_variants)
The pattern is consistent with the SNAPPY VMs landing on degraded hardware or hitting severe noisy-neighbor effects

This data cannot be used to conclude that SNAPPY is harmful. The runs would need to be repeated, ideally with more trials or with the VMs pinned to specific hardware to reduce placement variance.

Click to view raw snappy-c3d16 runtime table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	499.1	33.992	5	8m 19s
exome	HG003	call_variants	188.28	34.675	5	3m 8s
exome	HG003	postprocess_variants	32.81	2.989	5	32s
exome	HG003	vcf_stats	6.19	0.589	5	6s
exome	HG003	total	720.19	71.534	5	12m 0s
hybrid-pacbio-illumina	HG003	make_examples	14588.14	267.518	5	4h 3m 8s
hybrid-pacbio-illumina	HG003	call_variants	24897.76	382.958	5	6h 54m 57s
hybrid-pacbio-illumina	HG003	postprocess_variants	287.11	11.679	5	4m 47s
hybrid-pacbio-illumina	HG003	vcf_stats	240.92	1.755	5	4m 0s
hybrid-pacbio-illumina	HG003	total	39773.02	120.736	5	11h 2m 53s
ont-r104	HG003	make_examples	13541.7	800.865	5	3h 45m 41s
ont-r104	HG003	call_variants	12762.13	604.424	5	3h 32m 42s
ont-r104	HG003	postprocess_variants	1122.57	72.831	5	18m 42s
ont-r104	HG003	vcf_stats	376.11	9.419	5	6m 16s
ont-r104	HG003	total	27426.4	1326.791	5	7h 37m 6s
pacbio	HG003	make_examples	9096.95	780.255	5	2h 31m 36s
pacbio	HG003	call_variants	6656.72	417.249	5	1h 50m 56s
pacbio	HG003	postprocess_variants	551.59	30.105	5	9m 11s
pacbio	HG003	vcf_stats	283.55	4.095	5	4m 43s
pacbio	HG003	total	16305.27	1226.954	5	4h 31m 45s
rnaseq	HG005	make_examples	1676.24	139.908	5	27m 56s
rnaseq	HG005	call_variants	116.3	11.928	5	1m 56s
rnaseq	HG005	postprocess_variants	238.74	19.31	5	3m 58s
rnaseq	HG005	vcf_stats	5.69	0.549	5	5s
rnaseq	HG005	total	2031.27	170.952	5	33m 51s
wgs	HG003	make_examples	11728.52	1008.781	5	3h 15m 28s
wgs	HG003	call_variants	6156.89	303.598	5	1h 42m 36s
wgs	HG003	postprocess_variants	414.25	15.42	5	6m 54s
wgs	HG003	vcf_stats	253.21	1.886	5	4m 13s
wgs	HG003	total	18299.66	1327.511	5	5h 4m 59s

Click to view raw noop-c3d16 runtime table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	481.07	15.308	5	8m 1s
exome	HG003	call_variants	156.24	7.811	5	2m 36s
exome	HG003	postprocess_variants	30.02	0.75	5	30s
exome	HG003	vcf_stats	5.66	0.114	5	5s
exome	HG003	total	667.34	23.065	5	11m 7s
hybrid-pacbio-illumina	HG003	make_examples	13956.67	41.849	5	3h 52m 36s
hybrid-pacbio-illumina	HG003	call_variants	23576.81	50.241	5	6h 32m 56s
hybrid-pacbio-illumina	HG003	postprocess_variants	291.82	4.216	5	4m 51s
hybrid-pacbio-illumina	HG003	vcf_stats	241.65	2.887	5	4m 1s
hybrid-pacbio-illumina	HG003	total	37825.3	73.059	5	10h 30m 25s
ont-r104	HG003	make_examples	12868.34	74.442	5	3h 34m 28s
ont-r104	HG003	call_variants	11858.91	33.492	5	3h 17m 38s
ont-r104	HG003	postprocess_variants	1012.97	6.896	5	16m 52s
ont-r104	HG003	vcf_stats	356.39	1.438	5	5m 56s
ont-r104	HG003	total	25740.22	105.546	5	7h 9m 0s
pacbio	HG003	make_examples	8915.5	395.291	5	2h 28m 35s
pacbio	HG003	call_variants	6326.91	191.306	5	1h 45m 26s
pacbio	HG003	postprocess_variants	521.59	8.425	5	8m 41s
pacbio	HG003	vcf_stats	278.0	2.267	5	4m 38s
pacbio	HG003	total	15763.99	571.737	5	4h 22m 43s
rnaseq	HG005	make_examples	1461.44	22.379	5	24m 21s
rnaseq	HG005	call_variants	96.44	1.544	5	1m 36s
rnaseq	HG005	postprocess_variants	205.45	1.356	5	3m 25s
rnaseq	HG005	vcf_stats	4.97	0.069	5	4s
rnaseq	HG005	total	1763.33	22.935	5	29m 23s
wgs	HG003	make_examples	11245.0	468.325	5	3h 7m 25s
wgs	HG003	call_variants	5924.95	156.029	5	1h 38m 44s
wgs	HG003	postprocess_variants	392.92	2.839	5	6m 32s
wgs	HG003	vcf_stats	255.12	6.976	5	4m 15s
wgs	HG003	total	17562.87	612.448	5	4h 52m 42s

pichuan · 2026-07-01T04:29:14Z

Here is my latest run on n2-standard-96.

Overall, the difference on this machine type doesn't seem to exceed the variance between runs.

@tfenne if you have some suggestions on how I can get a better signal, let me know :-/

For now, I am on the fence of this PR : I could be fine with accepting it because it is cleanly a no-op by default, but I need to discuss internally.

-pichuan

PR #1088 (SNAPPY vs Noop) Runtime Report — `n2-standard-96` (Re-run)

Setup

Machine type: n2-standard-96
Trials: 5 per sample
Both runs on same day (20260630), same code, same machine type.

`make_examples` Comparison

Since SNAPPY only affects example compression, the relevant stage is make_examples. Welch's t-test (two-tailed, n=5 per group):

uid	noop (mean)	snappy (mean)	diff (sec)	diff (%)	t-stat	p-value	significant?
hybrid	3423.9s (57m 3s)	3211.7s (53m 31s)	-212.2s	-6.2%	-8.12	0.0001	✅ YES (p<0.001)
ont-r104	3488.7s (58m 8s)	3264.4s (54m 24s)	-224.3s	-6.4%	-1.42	0.2095	❌ Not significant
wgs	2717.4s (45m 17s)	2632.5s (43m 52s)	-85.0s	-3.1%	-1.18	0.2722	❌ Not significant
pacbio	2326.5s (38m 46s)	2255.8s (37m 35s)	-70.7s	-3.0%	-1.10	0.3100	❌ Not significant
exome	190.0s (3m 9s)	190.0s (3m 10s)	+0.0s	+0.0%	+0.01	1.0000	❌ Not significant
rnaseq	432.5s (7m 12s)	457.9s (7m 37s)	+25.4s	+5.9%	+2.98	0.0304	⚠️ Marginal (p<0.05, wrong direction)

Note

hybrid-pacbio-illumina shows a highly significant 3.5-minute speedup (p<0.001) with zero overlap between the trial distributions — every snappy trial is faster than every noop trial. This is a clean, reproducible result.

Warning

rnaseq is significant but in the wrong direction — snappy is 25s slower. This is unexpected and may be a confound. Notably, the non-make_examples stages for rnaseq also showed a small but significant difference (p<0.001), suggesting something VM-level rather than SNAPPY-specific.

Sanity Check: Non-`make_examples` Stages

To confirm SNAPPY only affects make_examples, I compared the combined runtime of call_variants + postprocess_variants + vcf_stats:

uid	noop other (mean)	snappy other (mean)	diff (sec)	p-value	significant?
exome	73.0s	72.1s	-0.9s	0.403	❌ No
wgs	1688.4s	1685.4s	-3.0s	0.939	❌ No
pacbio	1917.3s	1946.9s	+29.6s	0.351	❌ No
ont-r104	3478.1s	3388.3s	-89.9s	0.644	❌ No
hybrid	4977.5s	4863.3s	-114.2s	0.399	❌ No
rnaseq	94.9s	98.0s	+3.1s	0.0003	⚠️ Yes

Non-make_examples stages are statistically indistinguishable for 5/6 datasets, as expected. The rnaseq anomaly (significant in non-make_examples stages too) reinforces that the rnaseq result is likely a VM-level confound, not a SNAPPY effect.

Summary

On n2-standard-96 with 5 trials, focusing on make_examples (the only stage affected by SNAPPY):

hybrid-pacbio-illumina: ✅ Clear, significant 6.2% speedup (p<0.001). This reproduced cleanly.
wgs, pacbio, ont-r104: Snappy is nominally 3–6% faster, but p-values of 0.21–0.31 mean these are not statistically significant with n=5. Power analysis suggests ~20–33 trials per group would be needed to reliably detect effects of this size.
exome: No difference at all (0.0%).
rnaseq: ⚠️ Significant regression (+5.9%), but likely a VM-level confound.

Bottom line: The only statistically clean result is the hybrid-pacbio-illumina speedup. The other datasets trend in the right direction but the run-to-run GCP variance (CV 1–9%) is too high to reach significance with 5 trials.

Click to view raw snappy-n2s96 runtime table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	190.01	3.602	5	3m 10s
exome	HG003	call_variants	34.57	0.592	5	34s
exome	HG003	postprocess_variants	31.3	1.071	5	31s
exome	HG003	vcf_stats	6.25	0.092	5	6s
exome	HG003	total	255.88	4.521	5	4m 15s
hybrid-pacbio-illumina	HG003	make_examples	3211.74	48.322	5	53m 31s
hybrid-pacbio-illumina	HG003	call_variants	4322.17	256.14	5	1h 12m 2s
hybrid-pacbio-illumina	HG003	postprocess_variants	237.31	4.439	5	3m 57s
hybrid-pacbio-illumina	HG003	vcf_stats	303.78	3.728	5	5m 3s
hybrid-pacbio-illumina	HG003	total	7771.22	277.621	5	2h 9m 31s
ont-r104	HG003	make_examples	3264.44	148.935	5	54m 24s
ont-r104	HG003	call_variants	2039.84	80.927	5	33m 59s
ont-r104	HG003	postprocess_variants	894.76	22.239	5	14m 54s
ont-r104	HG003	vcf_stats	453.67	12.701	5	7m 33s
ont-r104	HG003	total	6199.03	239.944	5	1h 43m 19s
pacbio	HG003	make_examples	2255.76	122.206	5	37m 35s
pacbio	HG003	call_variants	1154.46	47.644	5	19m 14s
pacbio	HG003	postprocess_variants	445.32	16.256	5	7m 25s
pacbio	HG003	vcf_stats	347.15	7.399	5	5m 47s
pacbio	HG003	total	3855.53	184.267	5	1h 4m 15s
rnaseq	HG005	make_examples	457.85	17.891	5	7m 37s
rnaseq	HG005	call_variants	26.45	0.993	5	26s
rnaseq	HG005	postprocess_variants	66.0	0.47	5	1m 6s
rnaseq	HG005	vcf_stats	5.57	0.141	5	5s
rnaseq	HG005	total	550.3	18.674	5	9m 10s
wgs	HG003	make_examples	2632.45	125.363	5	43m 52s
wgs	HG003	call_variants	964.09	40.098	5	16m 4s
wgs	HG003	postprocess_variants	402.62	15.637	5	6m 42s
wgs	HG003	vcf_stats	318.71	14.375	5	5m 18s
wgs	HG003	total	3999.17	177.889	5	1h 6m 39s

Click to view raw noop-n2s96 runtime table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	189.98	7.532	5	3m 9s
exome	HG003	call_variants	35.0	1.142	5	35s
exome	HG003	postprocess_variants	31.64	1.007	5	31s
exome	HG003	vcf_stats	6.38	0.236	5	6s
exome	HG003	total	256.62	9.248	5	4m 16s
hybrid-pacbio-illumina	HG003	make_examples	3423.95	32.812	5	57m 3s
hybrid-pacbio-illumina	HG003	call_variants	4445.46	103.993	5	1h 14m 5s
hybrid-pacbio-illumina	HG003	postprocess_variants	234.49	7.536	5	3m 54s
hybrid-pacbio-illumina	HG003	vcf_stats	297.52	1.827	5	4m 57s
hybrid-pacbio-illumina	HG003	total	8103.91	112.347	5	2h 15m 3s
ont-r104	HG003	make_examples	3488.74	321.376	5	58m 8s
ont-r104	HG003	call_variants	2147.68	403.269	5	35m 47s
ont-r104	HG003	postprocess_variants	875.67	5.602	5	14m 35s
ont-r104	HG003	vcf_stats	454.76	13.149	5	7m 34s
ont-r104	HG003	total	6512.09	723.365	5	1h 48m 32s
pacbio	HG003	make_examples	2326.5	76.272	5	38m 46s
pacbio	HG003	call_variants	1132.62	11.294	5	18m 52s
pacbio	HG003	postprocess_variants	440.23	5.251	5	7m 20s
pacbio	HG003	vcf_stats	344.47	2.594	5	5m 44s
pacbio	HG003	total	3899.35	89.443	5	1h 4m 59s
rnaseq	HG005	make_examples	432.46	6.509	5	7m 12s
rnaseq	HG005	call_variants	24.8	0.193	5	24s
rnaseq	HG005	postprocess_variants	64.77	0.435	5	1m 4s
rnaseq	HG005	vcf_stats	5.37	0.008	5	5s
rnaseq	HG005	total	522.03	6.609	5	8m 42s
wgs	HG003	make_examples	2717.42	100.299	5	45m 17s
wgs	HG003	call_variants	973.81	40.183	5	16m 13s
wgs	HG003	postprocess_variants	403.38	13.776	5	6m 43s
wgs	HG003	vcf_stats	311.25	3.215	5	5m 11s
wgs	HG003	total	4094.61	137.969	5	1h 8m 14s

tfenne force-pushed the tf_examples-compression branch from d9fe074 to 65cf535 Compare June 23, 2026 04:26

pichuan self-assigned this Jun 24, 2026

pichuan reviewed Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Snappy compression and a configurable GZIP level for make_examples output#1088

Support Snappy compression and a configurable GZIP level for make_examples output#1088
tfenne wants to merge 2 commits into
google:r1.10from
tfenne:tf_examples-compression

tfenne commented Jun 23, 2026 •

edited

Loading

Uh oh!

tfenne commented Jun 23, 2026

Uh oh!

pichuan commented Jun 24, 2026

Uh oh!

tfenne commented Jun 26, 2026

Uh oh!

pichuan Jun 28, 2026

Uh oh!

pichuan Jun 28, 2026

Uh oh!

pichuan Jun 28, 2026

Uh oh!

pichuan Jun 28, 2026

Uh oh!

pichuan left a comment

Uh oh!

pichuan Jun 28, 2026

Uh oh!

pichuan commented Jun 30, 2026

Uh oh!

pichuan commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tfenne commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design: the file name is the single source of truth

Backward compatibility

What changed

Testing

Uh oh!

tfenne commented Jun 23, 2026

Uh oh!

pichuan commented Jun 24, 2026

Uh oh!

tfenne commented Jun 26, 2026

Uh oh!

pichuan Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

pichuan Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

pichuan Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

pichuan Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

pichuan left a comment

Choose a reason for hiding this comment

Uh oh!

pichuan Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

pichuan commented Jun 30, 2026

PR #1088 (SNAPPY vs Noop) Runtime Report — c3d-standard-16

Setup

make_examples Comparison

VCF Output Verification

Summary

Uh oh!

pichuan commented Jul 1, 2026

PR #1088 (SNAPPY vs Noop) Runtime Report — n2-standard-96 (Re-run)

Setup

make_examples Comparison

Sanity Check: Non-make_examples Stages

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tfenne commented Jun 23, 2026 •

edited

Loading

PR #1088 (SNAPPY vs Noop) Runtime Report — `c3d-standard-16`

`make_examples` Comparison

PR #1088 (SNAPPY vs Noop) Runtime Report — `n2-standard-96` (Re-run)

`make_examples` Comparison

Sanity Check: Non-`make_examples` Stages