This note uses R kernel.

First load the data matrix, the files used are available from here.

library(Seurat)

# Read in the expression matrix The first row is a header row, the first column is rownames
exp.mat <- read.table(file = "./nestorawa_forcellcycle_expressionMatrix.txt", header = TRUE,
    as.is = TRUE, row.names = 1)

# A list of cell cycle markers, from Tirosh et al, 2015, is loaded with Seurat.  We can
# segregate this list into markers of G2/M phase and markers of S phase
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes

# Create our Seurat object and complete the initalization steps
marrow <- CreateSeuratObject(counts = exp.mat)
marrow <- NormalizeData(marrow)
marrow <- FindVariableFeatures(marrow, selection.method = "vst")
marrow <- ScaleData(marrow, features = rownames(marrow))
Registered S3 method overwritten by 'spatstat.geom':
  method     from
  print.boxx cli 

Attaching SeuratObject

Centering and scaling data matrix

If we run a PCA on our object, using the variable genes we found in FindVariableFeatures() above, we see that while most of the variance can be explained by lineage, PC8 and PC10 are split on cell-cycle genes including TOP2A and MKI67. We will attempt to regress this signal from the data, so that cell-cycle heterogeneity does not contribute to PCA or downstream analysis.

marrow <- RunPCA(marrow, features = VariableFeatures(marrow), ndims.print = 6:10, nfeatures.print = 10)
PC_ 6 
Positive:  SELL, ARL6IP1, CCL9, CD34, ADGRL4, BPIFC, NUSAP1, FAM64A, CD244, C030034L19RIK 
Negative:  LY6C2, AA467197, CYBB, MGST2, ITGB2, PF4, CD74, ATP1B1, GP1BB, TREM3 
PC_ 7 
Positive:  F13A1, LY86, CFP, IRF8, CSF1R, TIFAB, IFI209, CCR2, TNS4, MS4A6C 
Negative:  HDC, CPA3, PGLYRP1, MS4A3, NKG7, UBE2C, CCNB1, NUSAP1, PLK1, FUT8 
PC_ 8 
Positive:  NUSAP1, UBE2C, KIF23, PLK1, CENPF, FAM64A, CCNB1, H2AFX, ID2, CDC20 
Negative:  WFDC17, SLC35D3, ADGRL4, VLDLR, CD33, H2AFY, P2RY14, IFI206, CCL9, CD34 
PC_ 9 
Positive:  IGKC, JCHAIN, LY6D, MZB1, CD74, IGLC2, FCRLA, IGKV4-50, IGHM, IGHV9-1 
Negative:  SLC2A6, HBA-A1, HBA-A2, IGHV8-7, FCER1G, F13A1, HBB-BS, PLD4, HBB-BT, IGFBP4 
PC_ 10 
Positive:  H2AFX, FAM64A, ZFP383, NUSAP1, CDC25B, CENPF, GBP10, TOP2A, GBP6, GFRA1 
Negative:  CTSW, XKRX, PRR5L, RORA, MBOAT4, A630014C17RIK, ZFP105, COL9A3, CLEC2I, TRAT1 

DimHeatmap(marrow, dims = c(8, 10))

Assign Cell-Cycle Scores

In the CellCycleScoring() function, which stores S and G2/M scores in object meta data, along with the predicted classification of each cell in either G2M, S or G1 phase.

marrow <- CellCycleScoring(marrow, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)

# view cell cycle scores and phase assignments
head(marrow[[]])
Warning message:
“The following features are not present in the object: MLF1IP, GMNN, not searching for symbol synonyms”
A data.frame: 6 × 7
orig.identnCount_RNAnFeature_RNAS.ScoreG2M.ScorePhaseold.ident
<fct><dbl><int><dbl><dbl><chr><fct>
Prog_013Prog256308910211-0.14248691-0.4680395G1 Prog
Prog_019Prog3030620 9991-0.16915786 0.5851766G2MProg
Prog_031Prog129348710192-0.34627038-0.3971879G1 Prog
Prog_037Prog1357987 9599-0.44270212 0.6820229G2MProg
Prog_008Prog407989110540 0.55854051 0.1284359S Prog
Prog_014Prog256978310788 0.07116218 0.3166073G2MProg
RidgePlot(marrow, features = c("PCNA", "TOP2A", "MCM6", "MKI67"), ncol = 2)
Picking joint bandwidth of 0.144

Picking joint bandwidth of 0.143

Picking joint bandwidth of 0.177

Picking joint bandwidth of 0.129

s.genes
<ol class=list-inline>
  • 'MCM5'
  • 'PCNA'
  • 'TYMS'
  • 'FEN1'
  • 'MCM2'
  • 'MCM4'
  • 'RRM1'
  • 'UNG'
  • 'GINS2'
  • 'MCM6'
  • 'CDCA7'
  • 'DTL'
  • 'PRIM1'
  • 'UHRF1'
  • 'MLF1IP'
  • 'HELLS'
  • 'RFC2'
  • 'RPA2'
  • 'NASP'
  • 'RAD51AP1'
  • 'GMNN'
  • 'WDR76'
  • 'SLBP'
  • 'CCNE2'
  • 'UBR7'
  • 'POLD3'
  • 'MSH2'
  • 'ATAD2'
  • 'RAD51'
  • 'RRM2'
  • 'CDC45'
  • 'CDC6'
  • 'EXO1'
  • 'TIPIN'
  • 'DSCC1'
  • 'BLM'
  • 'CASP8AP2'
  • 'USP1'
  • 'CLSPN'
  • 'POLA1'
  • 'CHAF1B'
  • 'BRIP1'
  • 'E2F8'
  • </ol>
    g2m.genes
    
    <ol class=list-inline>
  • 'HMGB2'
  • 'CDK1'
  • 'NUSAP1'
  • 'UBE2C'
  • 'BIRC5'
  • 'TPX2'
  • 'TOP2A'
  • 'NDC80'
  • 'CKS2'
  • 'NUF2'
  • 'CKS1B'
  • 'MKI67'
  • 'TMPO'
  • 'CENPF'
  • 'TACC3'
  • 'FAM64A'
  • 'SMC4'
  • 'CCNB2'
  • 'CKAP2L'
  • 'CKAP2'
  • 'AURKB'
  • 'BUB1'
  • 'KIF11'
  • 'ANP32E'
  • 'TUBB4B'
  • 'GTSE1'
  • 'KIF20B'
  • 'HJURP'
  • 'CDCA3'
  • 'HN1'
  • 'CDC20'
  • 'TTK'
  • 'CDC25C'
  • 'KIF2C'
  • 'RANGAP1'
  • 'NCAPD2'
  • 'DLGAP5'
  • 'CDCA2'
  • 'CDCA8'
  • 'ECT2'
  • 'KIF23'
  • 'HMMR'
  • 'AURKA'
  • 'PSRC1'
  • 'ANLN'
  • 'LBR'
  • 'CKAP5'
  • 'CENPE'
  • 'CTCF'
  • 'NEK2'
  • 'G2E3'
  • 'GAS2L3'
  • 'CBX5'
  • 'CENPA'
  • </ol>

    For more information about how to regress this source of heterogeneity(cell cycles) out of the data, you can check here.

    Check clusters and identify cells

    Number of clusters could be adjust based on the background information.

    marrow <- FindNeighbors(marrow, dims = 1:10)
    marrow <- FindClusters(marrow, resolution = 0.5)
    
    Computing nearest neighbor graph
    
    Computing SNN
    
    
    Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
    
    Number of nodes: 774
    Number of edges: 21265
    
    Running Louvain algorithm...
    Maximum modularity in 10 random starts: 0.8447
    Number of communities: 6
    Elapsed time: 0 seconds
    
    head(Idents(marrow), 5)
    
    <dl class=dl-inline>
    Prog_013
    0
    Prog_019
    0
    Prog_031
    3
    Prog_037
    5
    Prog_008
    1
    </dl>
    <summary style=display:list-item;cursor:pointer> Levels: </summary> <ol class=list-inline>
  • '0'
  • '1'
  • '2'
  • '3'
  • '4'
  • '5'
  • </ol>

    Run umap and visualize:

    marrow <- RunUMAP(marrow, dims = 1:10, verbose = FALSE)
    DimPlot(marrow, reduction = "umap", label = TRUE)
    
    Warning message:
    “The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
    To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
    This message will be shown once per session”
    

    Annotation with SingleR

    library(celldex, verbose = FALSE)
    hpca.se <- HumanPrimaryCellAtlasData()
    
    Loading required package: SummarizedExperiment
    
    Loading required package: MatrixGenerics
    
    Loading required package: matrixStats
    
    
    Attaching package: ‘MatrixGenerics’
    
    
    The following objects are masked from ‘package:matrixStats’:
    
        colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
        colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
        colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
        colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
        colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
        colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
        colWeightedMeans, colWeightedMedians, colWeightedSds,
        colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
        rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
        rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
        rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
        rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
        rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
        rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
        rowWeightedSds, rowWeightedVars
    
    
    Loading required package: GenomicRanges
    
    Loading required package: stats4
    
    Loading required package: BiocGenerics
    
    
    Attaching package: ‘BiocGenerics’
    
    
    The following objects are masked from ‘package:stats’:
    
        IQR, mad, sd, var, xtabs
    
    
    The following objects are masked from ‘package:base’:
    
        anyDuplicated, append, as.data.frame, basename, cbind, colnames,
        dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
        grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
        order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
        rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
        union, unique, unsplit, which.max, which.min
    
    
    Loading required package: S4Vectors
    
    
    Attaching package: ‘S4Vectors’
    
    
    The following objects are masked from ‘package:base’:
    
        expand.grid, I, unname
    
    
    Loading required package: IRanges
    
    Loading required package: GenomeInfoDb
    
    Loading required package: Biobase
    
    Welcome to Bioconductor
    
        Vignettes contain introductory material; view with
        'browseVignettes()'. To cite Bioconductor, see
        'citation("Biobase")', and for packages 'citation("pkgname")'.
    
    
    
    Attaching package: ‘Biobase’
    
    
    The following object is masked from ‘package:MatrixGenerics’:
    
        rowMedians
    
    
    The following objects are masked from ‘package:matrixStats’:
    
        anyMissing, rowMedians
    
    
    
    Attaching package: ‘SummarizedExperiment’
    
    
    The following object is masked from ‘package:SeuratObject’:
    
        Assays
    
    
    The following object is masked from ‘package:Seurat’:
    
        Assays
    
    
    snapshotDate(): 2021-10-19
    
    see ?celldex and browseVignettes('celldex') for documentation
    
    loading from cache
    
    see ?celldex and browseVignettes('celldex') for documentation
    
    loading from cache
    
    
    hpca.se
    
    class: SummarizedExperiment 
    dim: 19363 713 
    metadata(0):
    assays(1): logcounts
    rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
    rowData names(0):
    colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
    colData names(3): label.main label.fine label.ont
    pred.hesc <- SingleR::SingleR(GetAssayData(marrow, assay = "RNA", slot = "data"), clusters = Idents(marrow),ref = hpca.se, assay.type.test=1,
        labels = hpca.se$label.main)
    
    pred.hesc
    
    DataFrame with 6 rows and 5 columns
                              scores first.labels     tuning.scores        labels
                            <matrix>  <character>       <DataFrame>   <character>
    0 0.377020:0.586039:0.612565:...          GMP 0.358065:0.267735           GMP
    1 0.420151:0.546801:0.551765:...          MEP 0.412205:0.353706  Erythroblast
    2 0.401024:0.572664:0.578866:...          MEP 0.273437:0.226113           MEP
    3 0.378935:0.548739:0.578182:...          CMP 0.224248:0.218487           CMP
    4 0.375977:0.579766:0.627907:...          GMP 0.246549:0.214529 Pro-Myelocyte
    5 0.392911:0.556911:0.582148:...          MEP 0.251734:0.186742           MEP
      pruned.labels
        <character>
    0           GMP
    1  Erythroblast
    2           MEP
    3           CMP
    4 Pro-Myelocyte
    5           MEP
    table(pred.hesc$labels)
    
              CMP  Erythroblast           GMP           MEP Pro-Myelocyte 
                1             1             1             2             1 

    Import cluster ident information back to seurat object.

    marrow[["SingleR.cluster.labels"]] <- 
            pred.hesc$labels[match(Idents(marrow), rownames(pred.hesc))]
    
    DimPlot(marrow, reduction = "umap", label = TRUE,group.by = "SingleR.cluster.labels")
    

    Or identify cells individually:

    pred.hesc2 <- SingleR::SingleR(GetAssayData(marrow, assay = "RNA", slot = "data"),ref = hpca.se, assay.type.test=1,
        labels = hpca.se$label.main)
    table(pred.hesc2$labels)
    
              B_cell       BM & Prog.              CMP     Erythroblast 
                   1               13              168              177 
                 GMP       HSC_-G-CSF              MEP Pro-B_cell_CD34+ 
                 184                1              169               14 
       Pro-Myelocyte 
                  47 
    colors = c("#097559","#757575","#C29359","#C2101E","#EDDB51","#EDDCBE","#F7A3AA","#1BBCC2","#F6E246")
    marrow[["SingleR.labels"]] <- pred.hesc2$labels
    DimPlot(marrow, reduction = "umap", label = TRUE,group.by = "SingleR.labels")
    

    Re-plot with ggplot2

    Here to demostrate both the cell cyclcing information and cell definistion from SingleR, shapes and color are used to represent different layers of data.

    colors = c("#097559","#757575","#C29359","#C2101E","#EDDB51","#EDDCBE","#F7A3AA","#1BBCC2","#F6E246")
    DimPlot(marrow, reduction = "umap", label = TRUE, group.by = "SingleR.labels", cols = colors)
    

    Export to use ggplot2 directly:

    library(tidyverse, verbose = FALSE)
    
    ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
    
     ggplot2 3.3.5      purrr   0.3.4
     tibble  3.1.6      dplyr   1.0.8
     tidyr   1.2.0      stringr 1.4.0
     readr   2.1.2      forcats 0.5.1
    
    ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
     dplyr::collapse()   masks IRanges::collapse()
     dplyr::combine()    masks Biobase::combine(), BiocGenerics::combine()
     dplyr::count()      masks matrixStats::count()
     dplyr::desc()       masks IRanges::desc()
     tidyr::expand()     masks S4Vectors::expand()
     dplyr::filter()     masks stats::filter()
     dplyr::first()      masks S4Vectors::first()
     dplyr::lag()        masks stats::lag()
     ggplot2::Position() masks BiocGenerics::Position(), base::Position()
     purrr::reduce()     masks GenomicRanges::reduce(), IRanges::reduce()
     dplyr::rename()     masks S4Vectors::rename()
     dplyr::slice()      masks IRanges::slice()
    
    
    umap_tx = marrow@reductions$umap@cell.embeddings %>% 
        as.data.frame() %>% 
        cbind(cycle = marrow@meta.data$Phase) %>%
        cbind(singleRid = marrow@meta.data$SingleR.labels)
    
    options(repr.plot.width=12, repr.plot.height=8)
    
    ggplot(umap_tx, aes(x=UMAP_1, y=UMAP_2, color=singleRid, shape=cycle)) + 
        geom_point(aes(size=cycle), alpha = 0.6 ) +
        scale_color_manual(values = colors) +
        scale_size_manual(values = c(6,3,3)) +
        scale_shape_manual(values = c(16,17,18)) +
        theme_bw() +
        theme(panel.border = element_blank(), 
              panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(), 
              axis.line = element_line(colour = "black"))
    

    We can tall from the plot that cell cycling phases could contribute to the ambiguity of annotation and clustering.

    Session information

    sessionInfo()
    
    R version 4.1.2 (2021-11-01)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Debian GNU/Linux 11 (bullseye)
    
    Matrix products: default
    BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
    LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
     [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] stats4    stats     graphics  grDevices utils     datasets  methods  
    [8] base     
    
    other attached packages:
     [1] forcats_0.5.1               stringr_1.4.0              
     [3] dplyr_1.0.8                 purrr_0.3.4                
     [5] readr_2.1.2                 tidyr_1.2.0                
     [7] tibble_3.1.6                ggplot2_3.3.5              
     [9] tidyverse_1.3.1             celldex_1.4.0              
    [11] SummarizedExperiment_1.24.0 Biobase_2.54.0             
    [13] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
    [15] IRanges_2.28.0              S4Vectors_0.32.3           
    [17] BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
    [19] matrixStats_0.61.0          SeuratObject_4.0.4         
    [21] Seurat_4.1.0               
    
    loaded via a namespace (and not attached):
      [1] utf8_1.2.2                    reticulate_1.24              
      [3] tidyselect_1.1.1              RSQLite_2.2.9                
      [5] AnnotationDbi_1.56.2          htmlwidgets_1.5.4            
      [7] grid_4.1.2                    BiocParallel_1.28.3          
      [9] Rtsne_0.15                    munsell_0.5.0                
     [11] ScaledMatrix_1.2.0            codetools_0.2-18             
     [13] ica_1.0-2                     pbdZMQ_0.3-7                 
     [15] future_1.23.0                 miniUI_0.1.1.1               
     [17] withr_2.4.3                   colorspace_2.0-2             
     [19] filelock_1.0.2                uuid_1.0-3                   
     [21] rstudioapi_0.13               ROCR_1.0-11                  
     [23] tensor_1.5                    listenv_0.8.0                
     [25] labeling_0.4.2                repr_1.1.4                   
     [27] GenomeInfoDbData_1.2.7        polyclip_1.10-0              
     [29] bit64_4.0.5                   farver_2.1.0                 
     [31] parallelly_1.30.0             vctrs_0.3.8                  
     [33] generics_0.1.2                BiocFileCache_2.2.1          
     [35] R6_2.5.1                      rsvd_1.0.5                   
     [37] bitops_1.0-7                  spatstat.utils_2.3-0         
     [39] cachem_1.0.6                  DelayedArray_0.20.0          
     [41] assertthat_0.2.1              promises_1.2.0.1             
     [43] scales_1.1.1                  gtable_0.3.0                 
     [45] beachmat_2.10.0               globals_0.14.0               
     [47] goftest_1.2-3                 rlang_1.0.1                  
     [49] splines_4.1.2                 lazyeval_0.2.2               
     [51] spatstat.geom_2.3-1           broom_0.7.12                 
     [53] BiocManager_1.30.16           yaml_2.2.2                   
     [55] reshape2_1.4.4                abind_1.4-5                  
     [57] modelr_0.1.8                  backports_1.4.1              
     [59] httpuv_1.6.5                  tools_4.1.2                  
     [61] ellipsis_0.3.2                spatstat.core_2.3-2          
     [63] RColorBrewer_1.1-2            ggridges_0.5.3               
     [65] Rcpp_1.0.8                    plyr_1.8.6                   
     [67] base64enc_0.1-3               sparseMatrixStats_1.6.0      
     [69] zlibbioc_1.40.0               RCurl_1.98-1.6               
     [71] rpart_4.1.16                  deldir_1.0-6                 
     [73] pbapply_1.5-0                 cowplot_1.1.1                
     [75] zoo_1.8-9                     haven_2.4.3                  
     [77] ggrepel_0.9.1                 cluster_2.1.2                
     [79] fs_1.5.2                      magrittr_2.0.2               
     [81] data.table_1.14.2             RSpectra_0.16-0              
     [83] scattermore_0.7               reprex_2.0.1                 
     [85] lmtest_0.9-39                 RANN_2.6.1                   
     [87] fitdistrplus_1.1-6            hms_1.1.1                    
     [89] patchwork_1.1.1               mime_0.12                    
     [91] evaluate_0.14                 xtable_1.8-4                 
     [93] readxl_1.3.1                  gridExtra_2.3                
     [95] compiler_4.1.2                KernSmooth_2.23-20           
     [97] crayon_1.4.2                  htmltools_0.5.2              
     [99] tzdb_0.2.0                    mgcv_1.8-38                  
    [101] later_1.3.0                   lubridate_1.8.0              
    [103] DBI_1.1.2                     ExperimentHub_2.2.1          
    [105] dbplyr_2.1.1                  MASS_7.3-55                  
    [107] rappdirs_0.3.3                Matrix_1.4-0                 
    [109] cli_3.1.1                     parallel_4.1.2               
    [111] igraph_1.2.11                 pkgconfig_2.0.3              
    [113] IRdisplay_1.1                 plotly_4.10.0                
    [115] spatstat.sparse_2.1-0         xml2_1.3.3                   
    [117] XVector_0.34.0                rvest_1.0.2                  
    [119] digest_0.6.29                 sctransform_0.3.3            
    [121] RcppAnnoy_0.0.19              SingleR_1.8.1                
    [123] spatstat.data_2.1-2           Biostrings_2.62.0            
    [125] cellranger_1.1.0              leiden_0.3.9                 
    [127] uwot_0.1.11                   DelayedMatrixStats_1.16.0    
    [129] curl_4.3.2                    shiny_1.7.1                  
    [131] lifecycle_1.0.1               nlme_3.1-155                 
    [133] jsonlite_1.7.3                BiocNeighbors_1.12.0         
    [135] viridisLite_0.4.0             fansi_1.0.2                  
    [137] pillar_1.7.0                  lattice_0.20-45              
    [139] KEGGREST_1.34.0               fastmap_1.1.0                
    [141] httr_1.4.2                    survival_3.2-13              
    [143] interactiveDisplayBase_1.32.0 glue_1.6.1                   
    [145] png_0.1-7                     BiocVersion_3.14.0           
    [147] bit_4.0.4                     stringi_1.7.6                
    [149] blob_1.2.2                    BiocSingular_1.10.0          
    [151] AnnotationHub_3.2.1           memoise_2.0.1                
    [153] IRkernel_1.3.0.9000           irlba_2.3.5                  
    [155] future.apply_1.8.1           

    Created: 02/09/2022 by vikkki