Automated_Cell_Annotation
Using Tabula Sapiens as a reference for annotating new datasets
This notebook allows you to annotate your data with a number of annotation methods using the Tabula Sapiens dataset as the reference.
Initial setup:
- Make sure GPU is enabled (Runtime -> Change Runtime Type -> Hardware Accelerator -> GPU)
- We also highly recommend getting Colab PRO for access to a high ram session.
Integration Methods Provided:
- scVI
- bbKNN
- scanorama
Annotation Methods:
- KNN on integrated spaces
- scANVI
- onClass
- SVM
- RandomForest
To use the notebook, simply connect to your Google Drive account, set the necessary arguments, select your methods, and run all the code blocks!
*User action is only required in Step 2 and Step 3.
%%capture
#@title Setup Colab
#@markdown Here we install the necessary packages
#@markdown This will take a few minutes (~5 min)
import sys
import os
!pip install --quiet obonet
!pip install --quiet --upgrade jsonschema
!pip install --quiet bbknn
!pip install --quiet git+https://github.com/wangshenguiuc/OnClass@21232f293a549a7ee0da8ebe3cbb22df3e885d4c
!pip install --quiet git+https://github.com/yoseflab/scvi-tools@master#egg=scvi-tools[tutorials]
!pip install --quiet imgkit
!pip install --quiet gdown
!pip install --quiet --upgrade scanorama
# Download annoation code
!wget -O annotation.py -q https://www.dropbox.com/s/id8sallwrunjc5c/annotation.py?dl=1
import anndata
import numpy as np
import scanpy as sc
import scvi
Step 2: Load your data (User Action Required)
Here we provide three options to load your data:
- Connect to Google Drive (highly recommended)
- Download your data from the cloud and save into this session or on Google drive.
- Upload your data manually into this session (files are not persistent and will be deleted when session is closed)
As an example, we use a subsampled version of the Lung Cell Atlas [1] for our query data.
[1] Travaglini, K. et al. A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625(2020).
# This is the recomended method especially for large datasets
from google.colab import drive
drive.mount('/content/drive')
query_adata = anndata.read('/path/to/your/anndata')
# Google Colab supports wget, curl, and gdown commands
# It is recommended to download the data into Google Drive and read from there.
# This way your data will be persistent.
!wget <YOUR URL>
query_adata = anndata.read('/path/to/your/anndata')
# Click the folder icon on the left navigation bar, and select the upload icon
# Note: Manually uploaded data is automatically deleted when the colab session ends
# This is not recommended if your dataset is very large
query_adata = anndata.read('/path/to/your/anndata')
!wget -O LCA.h5ad https://www.dropbox.com/s/mrf8y7emfupo4he/LCA.h5ad?dl=1
query_adata = anndata.read('LCA.h5ad')
query_adata
from annotation import _check_nonnegative_integers
assert _check_nonnegative_integers(query_adata.X) == True, 'Make sure query_adata.X contains raw_counts'
Step 3: Setting Up Annotation Parameters (User Action Required)
Here is where you set the parameters for the automated annotation.
Arguments:
- tissue: Tabula Sapiens tissue to annotate your data with. Available tissues: ["Bladder", "Blood", "Bone_Marrow", "Kidney", "Large_Intestine", "Lung","Lymph_Node", "Pancreas", "Small_Intestine", "Spleen", "Thymus","Trachea", "Vasculature"]
-
save_location: location to save results to. By default will save to a folder named
annotation_results
. It is highly recommended you provide a Google Drive folder here. -
query_batch_key: key in
query_adata.obs
for batch correction. Set to None for no batch correction. - methods: these are the methods to run. By default, will run all methods.
-
training_mode can be
online
oroffline
. Ifoffline
will train scVI and scANVI models from scratch. Ifonline
, will use pretrained models.
Lesser used parameters
-
query_labels_key: scANVI has the option to use labeled cells in the query dataset during training. To use some prelabeled cells from the query dataset, set
query_labels_key
to the corresponding key inquery_adata.obs
-
unknown_celltype_label: If
query_labels_key
is not None, will treat everything not labeledunknown_celltype_label
as a labeled cell
"""
tissue options:
["Bladder", "Blood", "Bone_Marrow", "Kidney", "Large_Intestine", "Lung",
"Lymph_Node", "Pancreas", "Small_Intestine", "Spleen", "Thymus",
"Trachea", "Vasculature"]
"""
tissue = 'Lung'
save_folder = './'
query_batch_key = 'method'
methods = ['bbknn','scvi', 'scanvi', 'svm', 'rf', 'onclass', 'scanorama']
training_mode='online'
# Lesser used parameters
query_labels_key=None
unknown_celltype_label='unknown'
if tissue == 'Bladder':
refdata_url = 'https://ndownloader.figshare.com/files/27388874'
pretrained_url='https://www.dropbox.com/s/rb89y577l6vs2mm/Bladder.tar.gz?dl=1'
elif tissue == 'Blood':
refdata_url = 'https://ndownloader.figshare.com/files/27388853'
pretrained_url = 'https://www.dropbox.com/s/kyh9nv202n0db65/Blood.tar.gz?dl=1'
elif tissue == 'Bone_Marrow':
refdata_url = 'https://ndownloader.figshare.com/files/27388841'
pretrained_url = 'https://www.dropbox.com/s/a3r4ddg7o7kua7z/Bone_Marrow.tar.gz?dl=1'
elif tissue == 'Kidney':
refdata_url = 'https://ndownloader.figshare.com/files/27388838'
pretrained_url = 'https://www.dropbox.com/s/k41r1a346z0tuip/Kidney.tar.gz?dl=1'
elif tissue == 'Large_Intestine':
refdata_url = 'https://ndownloader.figshare.com/files/27388835'
pretrained_url = 'https://www.dropbox.com/s/jwvpk727hd54byd/Large_Intestine.tar.gz?dl=1'
elif tissue == 'Lung':
refdata_url = 'https://ndownloader.figshare.com/files/27388832'
pretrained_url = 'https://www.dropbox.com/s/e4al4ia9hm9qtcg/Lung.tar.gz?dl=1'
elif tissue == 'Lymph_Node':
refdata_url = 'https://ndownloader.figshare.com/files/27388715'
pretrained_url = 'https://www.dropbox.com/s/mbejy9tcbx9e1yv/Lymph_Node.tar.gz?dl=1'
elif tissue == 'Pancreas':
refdata_url = 'https://ndownloader.figshare.com/files/27388613'
pretrained_url = 'https://www.dropbox.com/s/r3klvr22m6kq143/Pancreas.tar.gz?dl=1'
elif tissue == 'Small_Intestine':
refdata_url = 'https://ndownloader.figshare.com/files/27388559'
pretrained_url = 'https://www.dropbox.com/s/7eiv2mke70jinzc/Small_Intestine.tar.gz?dl=1'
elif tissue == 'Spleen':
refdata_url = 'https://ndownloader.figshare.com/files/27388544'
pretrained_url = 'https://www.dropbox.com/s/6j3iwahsjnb8rb3/Spleen.tar.gz?dl=1'
elif tissue == 'Thymus':
refdata_url = 'https://ndownloader.figshare.com/files/27388505'
pretrained_url='https://www.dropbox.com/s/9k0mneu2wvpiudz/Thymus.tar.gz?dl=1'
elif tissue == 'Trachea':
refdata_url = 'https://ndownloader.figshare.com/files/27388460'
pretrained_url = 'https://www.dropbox.com/s/57tthfgkl8jtxk6/Trachea.tar.gz?dl=1'
elif tissue == 'Vasculature':
refdata_url = 'https://ndownloader.figshare.com/files/27388451'
pretrained_url='https://www.dropbox.com/s/1wt3r871kxjas5o/Vasculature.tar.gz?dl=1'
# Download reference dataset
output_fn = 'TS_{}.h5ad'.format(tissue)
!wget -O $output_fn $refdata_url
# Download pretrained scVI and scANVI models.
output_fn = '{}.tar.gz'.format(tissue)
!wget -O $output_fn $pretrained_url
!tar -xvzf $output_fn
# Download onclass files
!wget -O cl.obo -q https://www.dropbox.com/s/hodp0etapzrd8ak/cl.obo?dl=1
!wget -O cl.ontology -q https://www.dropbox.com/s/nes0zprzfbwbgj5/cl.ontology?dl=1
!wget -O cl.ontology.nlp.emb https://www.dropbox.com/s/y9x9yt2pi7s0d1n/cl.ontology.nlp.emb?dl=1
ref_adata_path = 'TS_{}.h5ad'.format(tissue)
ref_adata = anndata.read(ref_adata_path)
# This way we only train on expert annotated data
ref_adata = ref_adata[ref_adata.obs["Manually Annotated"] == "True"].copy()
# We wish to correct for batch effects from donor and method
# So we make a new batch key that will be passed to the methods
ref_adata.obs['donor_method'] = ref_adata.obs['Donor'].astype(str) + ref_adata.obs['Method'].astype(str)
# The annotation pipeline expects raw counts in the the X field
ref_adata.X = ref_adata.layers['raw_counts']
# Following parameters are specific to Tabula Sapiens dataset
ref_labels_key='Annotation'
ref_batch_key = 'donor_method'
from annotation import get_pretrained_model_genes, check_genes_is_subset
pretrained_scanvi_path = os.path.join(tissue, tissue + "_scanvi_model")
pretrained_scvi_path = os.path.join(tissue, tissue + "_scvi_model")
training_mode='online'
is_subset = False
if training_mode == 'online':
pretrained_genes = get_pretrained_model_genes(pretrained_scvi_path)
query_genes = query_adata.var_names.to_numpy().astype("str")
is_subset = check_genes_is_subset(pretrained_genes, query_genes)
if is_subset and training_mode=='online':
ref_adata = ref_adata[:, pretrained_genes]
else:
training_mode = 'offline'
from annotation import process_query
adata = process_query(query_adata,
ref_adata,
tissue=tissue,
save_folder=save_folder,
query_batch_key=query_batch_key,
query_labels_key=query_labels_key,
unknown_celltype_label=unknown_celltype_label,
pretrained_scvi_path=pretrained_scvi_path,
ref_labels_key=ref_labels_key,
ref_batch_key=ref_batch_key,
training_mode=training_mode,
ref_adata_path=ref_adata_path)
adata
Step 5: Run Automated Cell Annotation Methods
No user action required. Takes about ~1 hour for a dataset for 100k cells.
Your results will be saved to the folder you provided as save_folder.
There will be the following files:
-
annotated_query.h5ad
containing annotated query cells. The consensus annotations will be inconsensus_prediction
. There will also be aconsensus_percentage
field which is the percentage of methods that had the same prediction. -
annotated_query_plus_ref.h5ad
containing your query and the reference cells with predicted annotations. -
confusion_matrices.pdf
which contains the confusion matrices between the consensus_predictions and each individual method. -
csv
files containing the metrics for each confusion matrix.
from annotation import annotate_data
annotate_data(adata,
methods,
save_folder,
pretrained_scvi_path=pretrained_scvi_path,
pretrained_scanvi_path=pretrained_scanvi_path)
import pandas as pd
import matplotlib.pyplot as plt
results_file = os.path.join(save_folder,'annotated_query_plus_ref.h5ad')
results = anndata.read(results_file)
from annotation import make_agreement_plots
all_prediction_keys = [
"knn_on_bbknn_pred",
"knn_on_scvi_online_pred",
"knn_on_scvi_offline_pred",
"scanvi_online_pred",
"scanvi_offline_pred",
"svm_pred",
"rf_pred",
"onclass_pred",
"knn_on_scanorama_pred",
]
obs_keys = adata.obs.keys()
pred_keys = [key for key in obs_keys if key in all_prediction_keys]
make_agreement_plots(results, methods=pred_keys, save_folder=save_folder)
is_query = results.obs._dataset == "query"
methods = [x for x in results.obs.columns if x.endswith("_pred")]
labels = results.obs.consensus_prediction.astype(str)
labels[~is_query] = results[~is_query].obs._labels_annotation.astype(str)
celltypes = np.unique(labels)
latent_methods = results.obsm.keys()
agreement_counts = pd.DataFrame(
np.unique(results[is_query].obs["consensus_percentage"], return_counts=True)
).T
agreement_counts.columns = ["Percent Agreement", "Count"]
agreement_counts.plot.bar(
x="Percent Agreement", y="Count", legend=False, figsize=(4, 3)
)
plt.ylabel("Frequency")
plt.xlabel("Percent of Algorithms Agreeing with Majority Vote")
figpath = os.path.join(save_folder, "Concensus_Percentage_barplot.pdf")
plt.savefig(figpath, bbox_inches="tight")
mean_agreement = [
np.mean(results[is_query & (labels == x)].obs["consensus_percentage"].astype(float))
for x in celltypes
]
mean_agreement = pd.DataFrame([mean_agreement], index=["agreement"]).T
mean_agreement.index = celltypes
mean_agreement = mean_agreement.sort_values("agreement", ascending=True)
mean_agreement.plot.bar(y="agreement", figsize=(15, 2), legend=False)
plt.ylabel("Mean Agreement")
plt.xticks(rotation=290, ha="left")
figpath = os.path.join(save_folder, "percelltype_agreement_barplot.pdf")
plt.savefig(figpath, bbox_inches="tight")
prop = pd.DataFrame(index=celltypes, columns=["ref", "query"])
for x in celltypes:
prop.loc[x, "query"] = np.sum(labels[is_query] == x)
prop.loc[x, "ref"] = np.sum(labels[~is_query] == x)
prop.loc[mean_agreement.index].plot(kind='bar', figsize=(len(celltypes)*0.5,4),logy=True)
plt.legend(bbox_to_anchor=(1, 0.9))
plt.ylabel('log Celltype Abundance')
plt.tight_layout()
figpath = os.path.join(save_folder, 'celltype_prop_barplot.pdf')
plt.savefig(figpath, bbox_inches="tight")
plt.show()
plt.close()