This note uses R kernel.

First load the data matrix, the files used are available from here.

library(Seurat)

# Read in the expression matrix The first row is a header row, the first column is rownames
exp.mat <- read.table(file = "./nestorawa_forcellcycle_expressionMatrix.txt", header = TRUE,
    as.is = TRUE, row.names = 1)

# A list of cell cycle markers, from Tirosh et al, 2015, is loaded with Seurat.  We can
# segregate this list into markers of G2/M phase and markers of S phase
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes

# Create our Seurat object and complete the initalization steps
marrow <- CreateSeuratObject(counts = exp.mat)
marrow <- NormalizeData(marrow)
marrow <- FindVariableFeatures(marrow, selection.method = "vst")
marrow <- ScaleData(marrow, features = rownames(marrow))

Registered S3 method overwritten by 'spatstat.geom':
  method     from
  print.boxx cli 

Attaching SeuratObject

Centering and scaling data matrix

If we run a PCA on our object, using the variable genes we found in FindVariableFeatures() above, we see that while most of the variance can be explained by lineage, PC8 and PC10 are split on cell-cycle genes including TOP2A and MKI67. We will attempt to regress this signal from the data, so that cell-cycle heterogeneity does not contribute to PCA or downstream analysis.

marrow <- RunPCA(marrow, features = VariableFeatures(marrow), ndims.print = 6:10, nfeatures.print = 10)

PC_ 6 
Positive:  SELL, ARL6IP1, CCL9, CD34, ADGRL4, BPIFC, NUSAP1, FAM64A, CD244, C030034L19RIK 
Negative:  LY6C2, AA467197, CYBB, MGST2, ITGB2, PF4, CD74, ATP1B1, GP1BB, TREM3 
PC_ 7 
Positive:  F13A1, LY86, CFP, IRF8, CSF1R, TIFAB, IFI209, CCR2, TNS4, MS4A6C 
Negative:  HDC, CPA3, PGLYRP1, MS4A3, NKG7, UBE2C, CCNB1, NUSAP1, PLK1, FUT8 
PC_ 8 
Positive:  NUSAP1, UBE2C, KIF23, PLK1, CENPF, FAM64A, CCNB1, H2AFX, ID2, CDC20 
Negative:  WFDC17, SLC35D3, ADGRL4, VLDLR, CD33, H2AFY, P2RY14, IFI206, CCL9, CD34 
PC_ 9 
Positive:  IGKC, JCHAIN, LY6D, MZB1, CD74, IGLC2, FCRLA, IGKV4-50, IGHM, IGHV9-1 
Negative:  SLC2A6, HBA-A1, HBA-A2, IGHV8-7, FCER1G, F13A1, HBB-BS, PLD4, HBB-BT, IGFBP4 
PC_ 10 
Positive:  H2AFX, FAM64A, ZFP383, NUSAP1, CDC25B, CENPF, GBP10, TOP2A, GBP6, GFRA1 
Negative:  CTSW, XKRX, PRR5L, RORA, MBOAT4, A630014C17RIK, ZFP105, COL9A3, CLEC2I, TRAT1

DimHeatmap(marrow, dims = c(8, 10))

Assign Cell-Cycle Scores

In the CellCycleScoring() function, which stores S and G2/M scores in object meta data, along with the predicted classification of each cell in either G2M, S or G1 phase.

marrow <- CellCycleScoring(marrow, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)

# view cell cycle scores and phase assignments
head(marrow[[]])

Warning message:
“The following features are not present in the object: MLF1IP, GMNN, not searching for symbol synonyms”

RidgePlot(marrow, features = c("PCNA", "TOP2A", "MCM6", "MKI67"), ncol = 2)

Picking joint bandwidth of 0.144

Picking joint bandwidth of 0.143

Picking joint bandwidth of 0.177

Picking joint bandwidth of 0.129

s.genes

g2m.genes

For more information about how to regress this source of heterogeneity(cell cycles) out of the data, you can check here.

Check clusters and identify cells

Number of clusters could be adjust based on the background information.

marrow <- FindNeighbors(marrow, dims = 1:10)
marrow <- FindClusters(marrow, resolution = 0.5)

Computing nearest neighbor graph

Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 774
Number of edges: 21265

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8447
Number of communities: 6
Elapsed time: 0 seconds

head(Idents(marrow), 5)

Run umap and visualize:

marrow <- RunUMAP(marrow, dims = 1:10, verbose = FALSE)
DimPlot(marrow, reduction = "umap", label = TRUE)

Warning message:
“The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session”

Annotation with SingleR

library(celldex, verbose = FALSE)
hpca.se <- HumanPrimaryCellAtlasData()

Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars


Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min


Loading required package: S4Vectors


Attaching package: ‘S4Vectors’


The following objects are masked from ‘package:base’:

    expand.grid, I, unname


Loading required package: IRanges

Loading required package: GenomeInfoDb

Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: ‘Biobase’


The following object is masked from ‘package:MatrixGenerics’:

    rowMedians


The following objects are masked from ‘package:matrixStats’:

    anyMissing, rowMedians


Attaching package: ‘SummarizedExperiment’


The following object is masked from ‘package:SeuratObject’:

    Assays


The following object is masked from ‘package:Seurat’:

    Assays


snapshotDate(): 2021-10-19

see ?celldex and browseVignettes('celldex') for documentation

loading from cache

see ?celldex and browseVignettes('celldex') for documentation

loading from cache

hpca.se

class: SummarizedExperiment 
dim: 19363 713 
metadata(0):
assays(1): logcounts
rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
colData names(3): label.main label.fine label.ont

pred.hesc <- SingleR::SingleR(GetAssayData(marrow, assay = "RNA", slot = "data"), clusters = Idents(marrow),ref = hpca.se, assay.type.test=1,
    labels = hpca.se$label.main)

pred.hesc

DataFrame with 6 rows and 5 columns
                          scores first.labels     tuning.scores        labels
                        <matrix>  <character>       <DataFrame>   <character>
0 0.377020:0.586039:0.612565:...          GMP 0.358065:0.267735           GMP
1 0.420151:0.546801:0.551765:...          MEP 0.412205:0.353706  Erythroblast
2 0.401024:0.572664:0.578866:...          MEP 0.273437:0.226113           MEP
3 0.378935:0.548739:0.578182:...          CMP 0.224248:0.218487           CMP
4 0.375977:0.579766:0.627907:...          GMP 0.246549:0.214529 Pro-Myelocyte
5 0.392911:0.556911:0.582148:...          MEP 0.251734:0.186742           MEP
  pruned.labels
    <character>
0           GMP
1  Erythroblast
2           MEP
3           CMP
4 Pro-Myelocyte
5           MEP

table(pred.hesc$labels)

          CMP  Erythroblast           GMP           MEP Pro-Myelocyte 
            1             1             1             2             1

Import cluster ident information back to seurat object.

marrow[["SingleR.cluster.labels"]] <- 
        pred.hesc$labels[match(Idents(marrow), rownames(pred.hesc))]

DimPlot(marrow, reduction = "umap", label = TRUE,group.by = "SingleR.cluster.labels")

Or identify cells individually:

pred.hesc2 <- SingleR::SingleR(GetAssayData(marrow, assay = "RNA", slot = "data"),ref = hpca.se, assay.type.test=1,
    labels = hpca.se$label.main)
table(pred.hesc2$labels)

          B_cell       BM & Prog.              CMP     Erythroblast 
               1               13              168              177 
             GMP       HSC_-G-CSF              MEP Pro-B_cell_CD34+ 
             184                1              169               14 
   Pro-Myelocyte 
              47

colors = c("#097559","#757575","#C29359","#C2101E","#EDDB51","#EDDCBE","#F7A3AA","#1BBCC2","#F6E246")
marrow[["SingleR.labels"]] <- pred.hesc2$labels
DimPlot(marrow, reduction = "umap", label = TRUE,group.by = "SingleR.labels")

Re-plot with ggplot2

Here to demostrate both the cell cyclcing information and cell definistion from SingleR, shapes and color are used to represent different layers of data.

colors = c("#097559","#757575","#C29359","#C2101E","#EDDB51","#EDDCBE","#F7A3AA","#1BBCC2","#F6E246")
DimPlot(marrow, reduction = "umap", label = TRUE, group.by = "SingleR.labels", cols = colors)

Export to use ggplot2 directly:

library(tidyverse, verbose = FALSE)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.1.6     ✔ dplyr   1.0.8
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::collapse()   masks IRanges::collapse()
✖ dplyr::combine()    masks Biobase::combine(), BiocGenerics::combine()
✖ dplyr::count()      masks matrixStats::count()
✖ dplyr::desc()       masks IRanges::desc()
✖ tidyr::expand()     masks S4Vectors::expand()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::first()      masks S4Vectors::first()
✖ dplyr::lag()        masks stats::lag()
✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
✖ purrr::reduce()     masks GenomicRanges::reduce(), IRanges::reduce()
✖ dplyr::rename()     masks S4Vectors::rename()
✖ dplyr::slice()      masks IRanges::slice()

umap_tx = marrow@reductions$umap@cell.embeddings %>% 
    as.data.frame() %>% 
    cbind(cycle = marrow@meta.data$Phase) %>%
    cbind(singleRid = marrow@meta.data$SingleR.labels)

options(repr.plot.width=12, repr.plot.height=8)

ggplot(umap_tx, aes(x=UMAP_1, y=UMAP_2, color=singleRid, shape=cycle)) + 
    geom_point(aes(size=cycle), alpha = 0.6 ) +
    scale_color_manual(values = colors) +
    scale_size_manual(values = c(6,3,3)) +
    scale_shape_manual(values = c(16,17,18)) +
    theme_bw() +
    theme(panel.border = element_blank(), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(), 
          axis.line = element_line(colour = "black"))

We can tall from the plot that cell cycling phases could contribute to the ambiguity of annotation and clustering.

Session information

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 11 (bullseye)

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] forcats_0.5.1               stringr_1.4.0              
 [3] dplyr_1.0.8                 purrr_0.3.4                
 [5] readr_2.1.2                 tidyr_1.2.0                
 [7] tibble_3.1.6                ggplot2_3.3.5              
 [9] tidyverse_1.3.1             celldex_1.4.0              
[11] SummarizedExperiment_1.24.0 Biobase_2.54.0             
[13] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
[15] IRanges_2.28.0              S4Vectors_0.32.3           
[17] BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
[19] matrixStats_0.61.0          SeuratObject_4.0.4         
[21] Seurat_4.1.0               

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                    reticulate_1.24              
  [3] tidyselect_1.1.1              RSQLite_2.2.9                
  [5] AnnotationDbi_1.56.2          htmlwidgets_1.5.4            
  [7] grid_4.1.2                    BiocParallel_1.28.3          
  [9] Rtsne_0.15                    munsell_0.5.0                
 [11] ScaledMatrix_1.2.0            codetools_0.2-18             
 [13] ica_1.0-2                     pbdZMQ_0.3-7                 
 [15] future_1.23.0                 miniUI_0.1.1.1               
 [17] withr_2.4.3                   colorspace_2.0-2             
 [19] filelock_1.0.2                uuid_1.0-3                   
 [21] rstudioapi_0.13               ROCR_1.0-11                  
 [23] tensor_1.5                    listenv_0.8.0                
 [25] labeling_0.4.2                repr_1.1.4                   
 [27] GenomeInfoDbData_1.2.7        polyclip_1.10-0              
 [29] bit64_4.0.5                   farver_2.1.0                 
 [31] parallelly_1.30.0             vctrs_0.3.8                  
 [33] generics_0.1.2                BiocFileCache_2.2.1          
 [35] R6_2.5.1                      rsvd_1.0.5                   
 [37] bitops_1.0-7                  spatstat.utils_2.3-0         
 [39] cachem_1.0.6                  DelayedArray_0.20.0          
 [41] assertthat_0.2.1              promises_1.2.0.1             
 [43] scales_1.1.1                  gtable_0.3.0                 
 [45] beachmat_2.10.0               globals_0.14.0               
 [47] goftest_1.2-3                 rlang_1.0.1                  
 [49] splines_4.1.2                 lazyeval_0.2.2               
 [51] spatstat.geom_2.3-1           broom_0.7.12                 
 [53] BiocManager_1.30.16           yaml_2.2.2                   
 [55] reshape2_1.4.4                abind_1.4-5                  
 [57] modelr_0.1.8                  backports_1.4.1              
 [59] httpuv_1.6.5                  tools_4.1.2                  
 [61] ellipsis_0.3.2                spatstat.core_2.3-2          
 [63] RColorBrewer_1.1-2            ggridges_0.5.3               
 [65] Rcpp_1.0.8                    plyr_1.8.6                   
 [67] base64enc_0.1-3               sparseMatrixStats_1.6.0      
 [69] zlibbioc_1.40.0               RCurl_1.98-1.6               
 [71] rpart_4.1.16                  deldir_1.0-6                 
 [73] pbapply_1.5-0                 cowplot_1.1.1                
 [75] zoo_1.8-9                     haven_2.4.3                  
 [77] ggrepel_0.9.1                 cluster_2.1.2                
 [79] fs_1.5.2                      magrittr_2.0.2               
 [81] data.table_1.14.2             RSpectra_0.16-0              
 [83] scattermore_0.7               reprex_2.0.1                 
 [85] lmtest_0.9-39                 RANN_2.6.1                   
 [87] fitdistrplus_1.1-6            hms_1.1.1                    
 [89] patchwork_1.1.1               mime_0.12                    
 [91] evaluate_0.14                 xtable_1.8-4                 
 [93] readxl_1.3.1                  gridExtra_2.3                
 [95] compiler_4.1.2                KernSmooth_2.23-20           
 [97] crayon_1.4.2                  htmltools_0.5.2              
 [99] tzdb_0.2.0                    mgcv_1.8-38                  
[101] later_1.3.0                   lubridate_1.8.0              
[103] DBI_1.1.2                     ExperimentHub_2.2.1          
[105] dbplyr_2.1.1                  MASS_7.3-55                  
[107] rappdirs_0.3.3                Matrix_1.4-0                 
[109] cli_3.1.1                     parallel_4.1.2               
[111] igraph_1.2.11                 pkgconfig_2.0.3              
[113] IRdisplay_1.1                 plotly_4.10.0                
[115] spatstat.sparse_2.1-0         xml2_1.3.3                   
[117] XVector_0.34.0                rvest_1.0.2                  
[119] digest_0.6.29                 sctransform_0.3.3            
[121] RcppAnnoy_0.0.19              SingleR_1.8.1                
[123] spatstat.data_2.1-2           Biostrings_2.62.0            
[125] cellranger_1.1.0              leiden_0.3.9                 
[127] uwot_0.1.11                   DelayedMatrixStats_1.16.0    
[129] curl_4.3.2                    shiny_1.7.1                  
[131] lifecycle_1.0.1               nlme_3.1-155                 
[133] jsonlite_1.7.3                BiocNeighbors_1.12.0         
[135] viridisLite_0.4.0             fansi_1.0.2                  
[137] pillar_1.7.0                  lattice_0.20-45              
[139] KEGGREST_1.34.0               fastmap_1.1.0                
[141] httr_1.4.2                    survival_3.2-13              
[143] interactiveDisplayBase_1.32.0 glue_1.6.1                   
[145] png_0.1-7                     BiocVersion_3.14.0           
[147] bit_4.0.4                     stringi_1.7.6                
[149] blob_1.2.2                    BiocSingular_1.10.0          
[151] AnnotationHub_3.2.1           memoise_2.0.1                
[153] IRkernel_1.3.0.9000           irlba_2.3.5                  
[155] future.apply_1.8.1

Created: 02/09/2022 by vikkki

	orig.ident	nCount_RNA	nFeature_RNA	S.Score	G2M.Score	Phase	old.ident
	<fct>	<dbl>	<int>	<dbl>	<dbl>	<chr>	<fct>
Prog_013	Prog	2563089	10211	-0.14248691	-0.4680395	G1	Prog
Prog_019	Prog	3030620	9991	-0.16915786	0.5851766	G2M	Prog
Prog_031	Prog	1293487	10192	-0.34627038	-0.3971879	G1	Prog
Prog_037	Prog	1357987	9599	-0.44270212	0.6820229	G2M	Prog
Prog_008	Prog	4079891	10540	0.55854051	0.1284359	S	Prog
Prog_014	Prog	2569783	10788	0.07116218	0.3166073	G2M	Prog