EVscope: A Comprehensive Bioinformatics Workflow for Accurate and Robust Analysis of Total RNA Sequencing from Extracellular Vesicles
EVscope is an open-source, modular bioinformatics pipeline designed for the analysis of extracellular vesicle (EV)-enriched total RNA sequencing data. Tailored to EV RNA-seq challenges - low RNA yield, fragmented inserts, diverse RNA biotypes, high multi-mapping, and contamination risk - EVscope processes paired-end or single-end FASTQ files through an end-to-end workflow. It includes quality control, UMI-based deduplication, two-pass STAR alignment, circular RNA detection, expression matrix generation, contamination screening, exploratory source-enrichment analysis, and comprehensive reporting. Optimized for the SMARTer Stranded Total RNA-Seq Kit v3 (Pico Input), EVscope introduces EMapper, an expectation-maximization (EM) module whose main EMapper-specific contribution in EVscope is EM-weighted genome-coordinate BigWig/coverage generation with RNA annotation support; gene-level count concordance with featureCounts/RSEM is used as a sanity check, not as a claim that EMapper is superior to RSEM for conventional gene readcount quantification.
- Key Features
- Motivation
- Directory Structure
- Requirements
- Installation
- Usage
- Input Data Format
- Pipeline Steps
- Output Structure
- Troubleshooting
- FAQ
- Contributing
- Feedback
- Citation
- Credits
- License
- Contact
- Novel Read-Through Detection: Trims UMI-derived adapter sequences from Read1 using reverse-complemented Read2 UMIs (
bin/Step_03_UMIAdapterTrimR1.py). - EM-Weighted BigWig Coverage Profiling: Uses a genome-wide expectation-maximization algorithm to assign multi-mapped fragments at single-base or binned resolution and generate strand-aware, RPM-normalized BigWig tracks (
bin/Step_25_EMapper.py). - Comprehensive RNA Annotation: Supports 3,659,642 RNAs across 20 biotypes (e.g., protein-coding, lncRNAs, miRNAs, piRNAs, retrotransposons) from GENCODE v45, piRBase v3.0, and RepeatMasker.
- Dual circRNA Detection: Integrates CIRCexplorer2 and CIRI2 for robust circular RNA identification, with merged results for enhanced sensitivity (
bin/Step_10_circRNA_merge.py). - Tissue Deconvolution: Infers EV RNA cellular origins using GTEx v10 and Human Brain Cell Atlas v1.0 references (
bin/Step_22_run_RNA_deconvolution_ARIC.py). - Contamination Screening: Filters bacterial (BBSplit) and microbial (Kraken2) contamination, with optional genomic DNA correction via strand-specific subtraction.
- Extensive Quality Control: Validates raw and trimmed FASTQs (FastQC), UMI motifs, and alignment metrics (
bin/Step_24_generate_QC_matrix.py). - Expression Quantification: Produces TPM/CPM matrices using featureCounts and RSEM, with RNA distribution visualizations (
bin/Step_15_plot_RNA_distribution_*.py). - Interactive Reporting: Generates bigWig tracks, density plots, and a comprehensive HTML report via R Markdown (
bin/Step_27_html_report.Rmd). - Reproducibility: Single-command Bash script with Conda environments, containerization support, and detailed logging.
Extracellular vesicles (EVs) are critical mediators of intercellular communication, carrying diverse RNAs that serve as potential biomarkers for diseases like cancer and neurodegeneration. However, EV RNA sequencing faces unique challenges: low RNA abundance, fragmented transcripts, contamination from genomic DNA or bacterial RNA, and the presence of non-polyadenylated RNAs (e.g., miRNAs, lncRNAs). Standard RNA-seq pipelines, designed for cellular RNA, often fail to address these issues, leading to unreliable results due to multi-mapping reads, incomplete RNA annotations, or unfiltered contaminants.
EVscope provides a specialized, end-to-end pipeline optimized for EV-enriched total RNA-seq. Its distinctive contribution is not to replace transcript quantifiers such as RSEM for ordinary gene readcount estimation, but to connect EV-oriented QC, broad RNA annotation, EM-weighted genome-coordinate BigWig tracks, RNA-biotype/meta-gene coverage profiling, and report generation in one reproducible workflow.
The EVscope repository is organized as follows:
EVscope/
├── EVscope.conf # Configuration file for tool and reference paths
├── EVscope.sh # Main pipeline script (v1.0.0)
├── README.md # This documentation
├── bin/ # Custom scripts for pipeline steps
│ ├── Step_02_calculate_ACC_motif_fraction.py # Calculates ACC motif fractions
│ ├── Step_02_plot_fastq2UMI_motif.py # Visualizes UMI motif distributions
│ ├── Step_03_plot_fastq_read_length_dist.py # Plots read length distributions
│ ├── Step_03_UMIAdapterTrimR1.py # Trims UMI-derived adapters
│ ├── Step_07_bam2strand.py # Determines library strandedness and splice/kb QC
│ ├── Step_08_convert_CIRCexplorer2CPM.py # Normalizes CIRCexplorer2 circRNA output
│ ├── Step_09_convert_CIRI2CPM.py # Normalizes CIRI2 circRNA output
│ ├── Step_10_circRNA_merge.py # Merges circRNA results
│ ├── Step_13_gDNA_corrected_featureCounts.py # Generates gDNA-corrected counts
│ ├── Step_15_combine_total_RNA_expr_matrix.py # Combines RNA expression matrices
│ ├── Step_15_featureCounts2TPM.py # Converts featureCounts to TPM
│ ├── Step_15_plot_RNA_distribution_1subplot.py # RNA distribution plots (1 subplot)
│ ├── Step_15_plot_RNA_distribution_2subplots.py # RNA distribution plots (2 subplots)
│ ├── Step_15_plot_RNA_distribution_20subplots.py # RNA distribution plots (20 subplots)
│ ├── Step_15_plot_top_expressed_genes.py # Plots top expressed genes
│ ├── Step_17_RSEM2expr_matrix.py # Converts RSEM to expression matrix
│ ├── Step_18_plot_reads_mapping_stats.py # Visualizes genomic region mapping
│ ├── Step_22_run_RNA_deconvolution_ARIC.py # Performs tissue deconvolution
│ ├── Step_24_generate_QC_matrix.py # Compiles QC metrics
│ ├── Step_25_bigWig2Expression.py # Converts bigWig to CPM/TPM
│ ├── Step_25_EMapper.py # EM-based read coverage estimation
│ ├── Step_26_density_plot_over_meta_gene.sh # Density plots for meta-gene regions
│ ├── Step_26_density_plot_over_RNA_types.sh # Density plots for RNA types
│ └── Step_27_html_report.Rmd # Generates HTML report
├── figures/ # Pipeline visualization
│ └── EVscope_pipeline.png # Pipeline overview image
├── references/ # Reference genomes, annotations, and indices
│ ├── annotations_HG38/ # Human genome annotations
│ ├── deconvolution_HG38/ # Deconvolution reference matrices
│ ├── genome/ # Reference genomes
│ └── index/ # Aligner indices
└── soft/ # Bundled external tools
├── bbmap # BBMap tools
├── CIRI_v2.0.6 # CIRI2 for circRNA detection
├── kraken2 # Kraken2 for taxonomic classification
├── KrakenTools # Kraken2 helper scripts
└── RSEM_v1.3.3 # RSEM for quantification
- Operating System: Linux (e.g., Ubuntu 20.04+) or macOS.
- Bash: Version 4.0 or higher.
- Conda: Miniconda or Anaconda for environment management.
- Core Tools:
- FastQC (v0.12.1), umi_tools (v1.1.5), cutadapt (v4.9)
- STAR (v2.7.11b), samtools (v1.21), featureCounts (v2.0.6)
- CIRCexplorer2 (v2.3.8), CIRI2 (v2.0.6), RSEM (v1.3.3)
- BBMap (v39.15), Kraken2, ribodetector (v0.3.1)
- seqtk (v1.4), BWA (v0.7.18), Picard (v3.3.0), deepTools (v3.5.5)
- R (v4.3.1) with rmarkdown, DT, kableExtra, bookdown, ggplot2, dplyr
- Python (v3.10.0) with pandas, numpy, matplotlib, biopython, numba, pyBigWig, pysam
- CPU: 20+ threads recommended for optimal performance.
- RAM: Minimum 64 GB; 250 GB recommended for Picard tools.
- Storage: 500 GB+ for input data, references, and outputs.
-
Clone the Repository:
git clone https://github.com/TheDongLab/EVscope.git cd EVscope -
Install Conda (if not already installed):
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh source ~/.bashrc
-
Create Conda Environments:
conda env create -f environments/evscope_env.yml conda env create -f environments/picard_env.yml conda env create -f environments/kraken2_env.yml
-
Install CIRCexplorer2:
conda activate evscope_env pip install CIRCexplorer2==2.3.8
-
Download Reference Files: Reference annotation files (HG38, GENCODE v45) are available on Zenodo. Download
EVscope_annotations_HG38.zipand extract toreferences/annotations_HG38/. -
Test Installation:
conda activate evscope_env fastqc --version STAR --version samtools --version python --version
A lightweight repository smoke test is available under tests/smoke/. It validates script syntax and selected toy-data transformations without requiring the full human reference bundle.
bash tests/smoke/run_smoke.shFor a full end-to-end run, download the Zenodo reference bundle and use the SRA example listed in Data Availability.
bash EVscope.sh --sample_name <name> --input_fastqs <files> [options]Required Arguments:
--sample_name <name>: Unique sample identifier (used for output files).--input_fastqs <files>: FASTQ file paths: one file for single-end input or two files for paired-end input. Space-separated (R1.fq.gz R2.fq.gz) and comma-separated (R1.fq.gz,R2.fq.gz) forms are accepted; R1 must be listed before R2.
Optional Arguments:
| Option | Description | Default |
|---|---|---|
--threads <int> |
Number of CPU threads | 1 |
--run_steps <list> |
Steps to run (e.g., 1,3,5-8, all) |
all |
--skip_steps <list> |
Steps to skip (e.g., 2,4) |
None |
--circ_tool <tool> |
circRNA detection tool (CIRCexplorer2, CIRI2, both) |
both |
--gDNA_correction <yes|no> |
Apply genomic DNA correction | no |
--strand <strand> |
Library strandedness (forward, reverse, unstrand) |
unstrand |
--config <path> |
Custom configuration file | EVscope.conf |
--resume |
Resume an existing output directory only when step metadata match current inputs/config/parameters | off |
--force |
Allow re-running selected steps in an existing output directory | off |
--dry-run |
Validate inputs/config/dependencies and print execution plan without running steps | off |
-V, --verbosity <level> |
Logging level (1=DEBUG, 2=INFO, 3=WARN, 4=ERROR, 5=FATAL) | 2 |
Resume behavior: EVscope refuses to reuse a non-empty output directory unless --resume or --force is provided. --resume only skips steps whose step.meta.json matches the current inputs/config/parameters. Legacy output directories created before this metadata file existed may require --force or cleanup before rerunning.
Report R environment: Step 27 can render through a configured Conda R environment. Set REPORT_R_ENV in EVscope.conf to a Conda R environment name (for example, r_env_4.4.3) and, if needed, CONDA_EXE to the Conda executable. The default public config leaves REPORT_R_ENV empty and uses Rscript from PATH.
| -h, --help | Display help message | - |
| -v, --version | Show pipeline version | - |
bash EVscope.sh --sample_name Example_Data \
--input_fastqs R1.fq.gz R2.fq.gz \
--threads 20 \
--run_steps all \
--gDNA_correction yes \
--strand reverse \
--verbosity 2- FASTQ Files: Gzipped, paired-end (
R1.fastq.gz,R2.fastq.gz) or single-end. - Sequencing Protocol: Optimized for SMARTer Stranded Total RNA-Seq Kit v3 (Pico Input) with 14-bp UMIs in Read2.
- Quality: High-quality reads suitable for EV RNA-seq.
| Step | Description |
|---|---|
| 1 | Raw FASTQ quality control using FastQC |
| 2 | UMI motif analysis and ACC motif fraction calculation |
| 3 | UMI extraction, adapter trimming, and read-through UMI removal |
| 4 | Quality control of trimmed FASTQs |
| 5 | Bacterial contamination screening (E. coli, Mycoplasma) using BBSplit |
| 6 | Two-pass STAR alignment with UMI deduplication |
| 7 | Library strandedness detection; splice/kb complementary gDNA-contamination QC proxy (primary source: STAR Log.final.out from Step 6) |
| 8 | CIRCexplorer2-based circular RNA detection |
| 9 | CIRI2-based circular RNA detection using BWA alignments |
| 10 | Merging of CIRCexplorer2 and CIRI2 circRNA results |
| 11 | RNA-seq metrics collection using Picard |
| 12 | featureCounts quantification (unique-mapping mode) |
| 13 | Genomic DNA-corrected featureCounts quantification |
| 14 | RSEM quantification (multi-mapping mode) |
| 15 | featureCounts-based expression matrix and RNA distribution plots |
| 16 | gDNA-corrected expression matrix |
| 17 | RSEM-based expression matrix |
| 18 | Genomic region read mapping analysis |
| 19 | Taxonomic classification using Kraken2 |
| 20-22 | Tissue deconvolution (featureCounts, gDNA-corrected, RSEM) |
| 23 | rRNA detection using ribodetector |
| 24 | Comprehensive quality control summary generation |
| 25 | Coverage analysis and bigWig generation using EMapper |
| 26 | Coverage density plots for RNA types and meta-gene regions |
| 27 | Final interactive HTML report generation |
splice/kb interpretation. EVscope reports splice junction crossings per kilobase of uniquely mapped aligned sequence as a complementary genomic-DNA-contamination QC proxy. For paired-end STAR logs, the denominator uses both mates' uniquely mapped aligned bases per fragment/read-pair, not insert size: splice/kb = Number of splices: Total x 1000 / (Uniquely mapped reads number x Average mapped length). In an in-house EV RNA-seq pilot using EXODUS-M isolation, miRNeasy Advanced extraction and 400 uL plasma input (N=5), the single no-DNase control had splice/kb = 0.13; the TURBO DNase pilot libraries had splice/kb values of 1.67 (0.5 U, 10 min), 0.63 (0.25 U, 10 min), 1.64 (0.5 U, 5 min), and 0.94 (0.25 U, 5 min). Values closer to the no-DNase pilot value may indicate higher residual genomic-DNA-derived contribution, whereas higher values are directionally compatible with increased spliced/transcript-derived signal after DNase treatment. These empirical values are exploratory reference guides, not universal cutoffs and not a direct DNA quantification assay.
Each sample generates an output directory with the following structure:
<sample_name>_EVscope_output/
├── Step_01_Raw_QC/ # FastQC reports for raw reads
├── Step_02_UMI_Analysis/ # UMI motif analysis and ACC fraction
├── Step_03_UMI_Adaptor_Trim/ # UMI extraction, adapter trimming, clean FASTQs
├── Step_04_Trimmed_QC/ # FastQC reports for trimmed reads
├── Step_05_Bacterial_Filter/ # BBSplit bacterial/mycoplasma screening
├── Step_06_Alignment_Initial/ # Initial STAR alignment and UMI-deduplicated FASTQs
├── Step_06_Alignment_Refined/ # Refined STAR alignment and UMI-deduplicated BAM
├── Step_07_Strand_Detection/ # Strandedness and splice/kb metrics
├── Step_08_CIRCexplorer2_circRNA/ # CIRCexplorer2 circRNA results
├── Step_09_CIRI2_circRNA/ # CIRI2 circRNA results
├── Step_10_circRNA_Merge/ # Merged circRNA results (CPM-normalized)
├── Step_11_RNA_Metrics/ # Picard RNA-seq metrics
├── Step_12_featureCounts_Quant/ # featureCounts gene quantification
├── Step_13_gDNA_Corrected_Quant/ # gDNA-corrected quantification
├── Step_14_RSEM_Quant/ # RSEM quantification
├── Step_15_featureCounts_Expression/ # Expression matrices (TPM/CPM)
├── Step_16_gDNA_Corrected_Expression/ # gDNA-corrected expression matrices
├── Step_17_RSEM_Expression/ # RSEM expression matrices
├── Step_18_Genomic_Regions/ # Meta-gene region mapping stats
├── Step_19_Taxonomy/ # Kraken2 taxonomic classification
├── Step_20-22_Deconvolution/ # Tissue deconvolution results
├── Step_23_rRNA_Detection/ # ribodetector rRNA detection
├── Step_24_MultiQC_Summary/ # QC summary matrix
├── Step_25_EMapper_BigWig_Quantification/ # EMapper coverage and bigWig
├── Step_26_BigWig_Density_Plot/ # RNA type and meta-gene density plots
├── Step_27_HTML_Report/ # Interactive HTML report
└── EVscope_pipeline.log # Pipeline execution log
- Dependency Not Found: Verify Conda environments with
conda list -n evscope_env. - Reference File Missing: Check
EVscope.confpaths and file existence. - Memory Issues: Picard requires up to 250 GB RAM. Reduce
--threadsor use a high-memory server. - Step Failure: Review logs in
<output_dir>/EVscope_pipeline.log.
Q: Can EVscope process non-SMARTer-seq data?
A: Yes, modify UMI parameters in bin/Step_02_*.py and bin/Step_03_UMIAdapterTrimR1.py.
Q: How do I run specific pipeline steps?
A: Use --run_steps, e.g., --run_steps 1,3,5-8.
Q: How do I view the final report?
A: Open Step_27_HTML_Report/<sample_name>_final_report.html in a web browser.
We welcome contributions! To contribute:
- Fork the repository: https://github.com/TheDongLab/EVscope.
- Create a feature branch:
git checkout -b feature/YourFeature. - Submit a pull request.
Please report bugs via GitHub Issues.
If you use EVscope in your research, please cite:
Zhao, Yiyong, et al. "EVscope: A Comprehensive Bioinformatics Pipeline for Accurate and Robust Analysis of Total RNA Sequencing from Extracellular Vesicles." bioRxiv (2025). Zenodo: https://doi.org/10.5281/zenodo.15577788
Authors:
- Yiyong Zhao: Data curation, Formal analysis, Software, Visualization
- Himanshu Chintalapudi: Visualization
- Ziqian Xu: Resources
- Weiqiang Liu: Data curation
- Yuxuan Hu: Validation
- Ewa Grassin: Resources [supporting]
- Xianjun Dong: Conceptualization, Methodology, Funding, Supervision
Affiliations:
- Stephen & Denise Adams Center for Parkinson's Disease Research of Yale School of Medicine, New Haven, CT 06510, USA
- Department of Neurology, Yale School of Medicine, Yale University, New Haven, CT 06510, USA
- Aligning Science Across Parkinson's (ASAP) Collaborative Research Network, Chevy Chase, MD 20815, USA
- Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Harvard University, Boston, MA, USA
Data Availability: Source code: https://github.com/TheDongLab/EVscope and Zenodo (https://doi.org/10.5281/zenodo.15577788), licensed under the MIT License. Raw sequencing data: NCBI SRA (accession: SRR31350808-SRR31350811).
Corresponding Author: Xianjun Dong (xianjun.dong@yale.edu)
EVscope source code is licensed under the MIT License.
- Xianjun Dong: xianjun.dong@yale.edu
- GitHub: https://github.com/TheDongLab/EVscope
