CATFISHR: inferred CNV And Transcriptome Framework for Identification of Similarity to High-confidence References
CATFISHR identifies malignant cells by integrating RNA expression similarity and inferred CNV similarity to high-confidence malignant reference clusters.
Install the development version from GitHub:
install.packages("remotes")
remotes::install_github("LevesqueLabHub/CATFISHR",
build_vignettes = TRUE,
dependencies = TRUE)CATFISHR identifies malignant cells by comparing each cell to user-selected high-confidence malignant reference clusters in both RNA and CNV space.
The workflow has three main steps:
- Calculate Mahalanobis distances from each cell to malignant reference clusters in RNA and CNV PCA space.
- Combine RNA and CNV distance outputs.
- Run mean shift clustering in the integrated RNA-CNV distance space.
Users are responsible for:
- Generating PCA embeddings in RNA and CNV space
- Clustering cells in each space
- Selecting high-confidence malignant reference clusters
- Interpreting mean shift cluster output
CATFISHR requires three inputs for each data space, RNA and CNV:
-
PCA matrix
A cells-by-PCs matrix. For RNA, this can be generated from scRNA-seq expression data. For CNV, this can be generated from residual expression obtained from inferred CNV methods such as inferCNV. -
Cluster membership
A vector of cluster assignments for each cell. -
Reference clusters
One or more high-confidence malignant clusters. All query cells are compared against these reference clusters.
For example:
library(CATFISHR)
# Example: extract from Seurat objects
rna_pca <- Seurat::Embeddings(rna_seurat, "pca")
cnv_pca <- Seurat::Embeddings(cnv_seurat, "pca")
rna_clusters <- setNames(rna_seurat$seurat_clusters, colnames(rna_seurat))
cnv_clusters <- setNames(cnv_seurat$seurat_clusters, colnames(cnv_seurat))
# User-defined malignant reference clusters
rna_ref_clusters <- c("0", "2", "5")
cnv_ref_clusters <- c("1", "3")CATFISHR includes a small toy dataset for demonstrating the workflow.
library(CATFISHR)
data("RNA_catfishr_data", package = "CATFISHR")
data("CNV_catfishr_data", package = "CATFISHR")
names(RNA_catfishr_data)
names(CNV_catfishr_data)Each toy dataset contains the required inputs:
RNA_catfishr_data$pca_matrix
RNA_catfishr_data$clusters
RNA_catfishr_data$ref_clusters
CNV_catfishr_data$pca_matrix
CNV_catfishr_data$clusters
CNV_catfishr_data$ref_clustersCalculate distances separately in RNA and CNV space.
mahal_RNA <- calc_mahalanobis(
pca_matrix = RNA_catfishr_data$pca_matrix,
clusters = RNA_catfishr_data$clusters,
ref_clusters = RNA_catfishr_data$ref_clusters,
n_pcs = 3
)
mahal_CNV <- calc_mahalanobis(
pca_matrix = CNV_catfishr_data$pca_matrix,
clusters = CNV_catfishr_data$clusters,
ref_clusters = CNV_catfishr_data$ref_clusters,
n_pcs = 3
)cm_output <- format_mahal_output(
mahal_RNA = mahal_RNA,
mahal_CNV = mahal_CNV
)ms_output <- run_mean_shift(
mahal_df = cm_output,
bandwidths = c(0.4, 0.5, 0.6, 0.7, 0.8),
max_clusters = 7,
iterations = 500,
sample_col = "sample_barcode",
rna_dist_col = "Mahal_Dist_RNA",
cnv_dist_col = "Mahal_Dist_CNV"
)# Extracting data for plotting
ms_plotting <- ms_output$data
ggplot(ms_plotting,
aes(x = log2(Mahal_Dist_CNV), y = log2(Mahal_Dist_RNA))) +
geom_point(aes(color = factor(Assignment))) +
labs(color = "Mean Shift\nAssignment")Mean shift clusters with low Mahalanobis distance to the malignant reference clusters in both RNA and CNV space are interpreted as malignant candidates. The final malignant/non-malignant call is made by the user based on the mean shift output.
vignette("CATFISHR", package = "CATFISHR")Shukla et al., in preparation.