Annotating unknown clusters in scRNAseq data using CIPR: A cross-species reference comparison
We have recently published a Shiny (R) app called Cluster Identity Predictor (CIPR) in BMC Bioinformatics that helps annotate unknown clusters in single cell RNAseq (scRNAseq) datasets.
CIPR does its thing by comparing the gene expression signatures of unknown clusters against signatures from known cell types. CIPR features 7 reference datasets (2 from mouse and 5 from humans) in addition to the ability to provide a custom reference dataset (see more here). The pre-loaded reference files are as follows:
Immunological Genome Project (ImmGen) microarray data from sorted mouse immune cells. This dataset contains 296 samples from 20 different cell types (253 subtypes).
Mouse RNAseq data from sorted cells reported in Benayoun et al. (2019). This dataset contains 358 sorted immune and nonimmune samples from 18 different lineages (28 subtypes).
Human Primary Cell Atlas that contains microarray data from 713 sorted immune and nonimmune cells (37 main cell types and 157 subtypes).
DICE (Database for Immune Cell Expression(/eQTLs/Epigenomics) that contains 1561 human immune samples from 5 main cell types (15 subtypes).
Human microarray data from sorted hematopoietic cells reported in Novershtern et al. (2011). This dataset contains data from 211 samples and 17 main cell types (38 subtypes)
Human RNAseq data from sorted cells reported in Monaco et al. (2019). This dataset contains 114 samples originating from 11 main cell types (29 subtypes)
I routinely use CIPR for analyzing scRNAseq datasets and I was curious to see how predictions would change if I used different reference datasets on the same experimental data. To satisfy my curiosity, I tested CIPR on scRNAseq data from Tirosh et al. published in Science in 2016. This highly cited study characterizes the immune landscape in human melanoma tumors.
Using Seurat R package, I found 15 single cell clusters (numbered from 0 to 14) in this dataset, and I examined the identity scores predicted by CIPR using differentially expressed genes (logFC comparison method). The top predictions are summarized in the table below which shows unknown clusters (first column) and the predictions using different reference datasets (note that Immgen reference originates from mouse while the others are human references):
|0||CD8 Eff||CD8 Tem||Tgd||CD9 Nai-act||CD8 Tem||CD8 Tem|
|1||CD4 Nai-Early act||Treg||Treg-CD4 Tcm||CD4 Th17-Nai act||Nai CD4||Tfh|
|2||CD8 Nai-Early act||CD8 Tem||CD8 Tcm||CD4 Th1/17||CD8 Tcm-Tem||MAIT-gDT|
|3||Nai CD4||CD4||Nai CD4||Treg||Nai CD4||Nai CD8|
|6||B cell||B cell||B cell||B cell||B cell||B cell|
|7||B cell||B cell||B cell||B cell||B cell||B cell|
|9||Stromal-Eosino||Muscle-B cell||Gametocyte-B cell||B cell||Erythroid-B cell||B cell|
|10||CD8 Eff||CD8 Tcm-Tem||CD4 Tem||CD4 Th1/17||CD8 Tem||CD8 Eff|
|13||B cell||B cell||B cell||B cell||B cell||B cell|
|14||B cell||B cell||B cell||B cell||B cell||B cell|
At a glance, we can see some clusters are annotated consistently while some others show more variance. It is pretty interesting that natural killer cells (NK) and B cell clusters were identified similarly by all the reference datasets, including the ImmGen reference. This suggests that NK cells and B cells are characterized by a unique gene signature that is consistent between mice and humans!
On the other hand, myeloid cells (such as macrophages, monocytes, and granulocytes) show high variance, indicating that these cells have overlapping gene signatures which makes it difficult to tease them apart. Along these lines, myeloid cell subsets may also have a higher degree of species-specificity.
If you used CIPR in your studies, let me know how it worked for you!