Skip to content

marbl/DJCounter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 DJCounter

Estimate the copy number of ribosomal-DNA distal junctions (DJs) from sequencing data.


Overview

DJCounter estimates how many copies of the ribosomal-DNA distal junction (DJ) are present in a human genome from short-read sequencing data. It supports two complementary modes:

Mode Input When to use
Mapping-based aligned BAM/CRAM reads already aligned to GRCh38 / GRCh37 / CHM13
K-mer based (reference-free) raw FASTQ (or BAM/CRAM) raw reads or BAM file
  • Mapping-based: DJ copy number is derived from sequencing coverage in the target region, normalized to autosomal background.
  • K-mer based: DJ copy number is derived from the k-mer multiplicity of a curated DJ-specific 31-mer set, normalized to the 2-copy peak in the read k-mer histogram.

Typical human samples yield ~10 DJ copies and Robertsonian samples typically show ~8.

Quick start

1. Mapping-based

Suitable when your BAM/CRAM is aligned to one of the supported references (see References).

scripts/calCounts.sh \
    --sample  Sample01 \
    --bam     /path/to/sample.bam \
    --ref     GRCh38 \
    --threads 10

Output: $outdir/$sample.$ref.tg.<filter>.<gap>.txt

sample      ref     roi      DJ_count
Sample01    GRCh38  DJ_filt  11.01608

📘 Details: scripts/mapping_based.md

2. K-mer based (reference-free)

# 1. Prepare the DJ target k-mer database (one-time)
cd resources
pigz -cd DJtarget.meryl.tar.gz | tar -xf -

# 2. Run on a sample
scripts/kmer_based_dj_counting.sh Sample01 /path/to/reads.fq.gz
# or paired-end:
scripts/kmer_based_dj_counting.sh Sample01 reads_1.fq.gz,reads_2.fq.gz
# or BAM/CRAM:
scripts/kmer_based_dj_counting.sh Sample01 sample.bam GRCh38

Plot the distribution across many samples:

cat DJcounts/*_DJ_count.txt > DJ_counts.txt
Rscript scripts/plot_dist.R

📘 Details: scripts/kmer_based.md

How it works

Mapping-based

DJ_count = (2 × tgCount) / (covLen × bgCov)

  tgCount : reads aligned to the DJ target regions
  covLen  : DJ length on CHM13 used to normalize tgCount
  bgCov   : background autosomal coverage

K-mer based

  1. Count all 31-mers in the input (meryl count k=31).
  2. Intersect with the curated DJtarget.meryl set (52,227 distinct k-mers; 26,140,589 occurrences) and read the median frequency from its histogram.
  3. Use Merqury's kmerHistToPloidyDepth.jar to estimate the 2-copy peak from the read k-mer histogram.
  4. DJ_count ≈ DJ_median / (peak2 / 2).

Supported references

Build Required contigs Notes
GRCh38 / hg38 chr21, chrUn_GL000220v1, chr17_GL000205v2_random, chr22_KI270733v1_random, chrUn_GL000195v1 Broad ver. (UK Biobank) or 1KGP NYGC ver.
GRCh37 / hg19 (experimental) chr7_gl000195_random, chr17_gl000205_random 1KGP ver.
T2T-CHM13 / hs1 chr13, chr14, chr15, chr21, chr22

Verify your BAM contains the required contigs:

samtools view -H sample.bam | grep chr17_GL000205v2_random

Repository layout

DJCounter/
├── scripts/         # Pipeline scripts and per-mode docs
│   ├── calCounts.sh
│   ├── kmer_based_dj_counting.sh
│   ├── mapping_based.md
│   └── kmer_based.md
├── resources/       # Pre-built DJ k-mer database & references
│   └── DJtarget.meryl.tar.gz
├── roi/             # Target BED files
│   ├── GRCh38/
│   ├── hg19/
│   └── CHM13/
└── paper/           # jupyter notebook for generating plots

Dependencies

  • samtools ≥ 1.21 — mapping-based mode and BAM/CRAM input
  • meryl ≥ 1.4.1 — k-mer mode
  • merqury — only eval/kmerHistToPloidyDepth.jar; set $MERQURY to the clone path
  • Java runtime (for the Merqury jar)
  • pigz, R (for plotting)

Changelog

Version Date Changes
v1.0 2026-03-08 Finalized hg38 and k-mer modes
v0.2.2 2025-11-26 Added BED file for ROI on hg19
v0.2.1 2024-07-29 Output background and fragment size; fixed background command
v0.2 2024-07-25 samtools idxstatssamtools coverage for background; removed temp files
v0.1 2024-07-17 First commit

About

This repository can be used to calculate the number of DJ counts in the genome aligned to the GRCh38 (broad reference).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages