TCGA samples comparison on PCA plots
Download GDC TCGA Bile Duct Cancer then run and plot PCA with draw_pca()
Inspired by one of the plots in this publication about urban/rural clovers, I was thinking if we can apply similar method to show some TCGA data, here is what I've tried:
First, download count and metadata from UCSC Xena&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443):
proj <- "TCGA-CHOL"
header <- "https://gdc.xenahubs.net/download/"
download.file(url = paste0(hearder ,proj, ".htseq_counts.tsv.gz"),destfile = paste0(proj,".htseq_counts.tsv.gz"))
download.file(url = paste0(hearder ,proj, ".GDC_phenotype.tsv.gz"),destfile = paste0(proj,".GDC_phenotype.tsv.gz"))
#download.file(url = paste0(hearder ,proj, ".survival.tsv"),destfile = paste0(proj,".survival.tsv"))
phenotype <- read.delim(paste0(proj,".GDC_phenotype.tsv.gz"),fill = T,header = T,sep = "\t")
Take a look at phenotype data:
phenotype[1:3,]
Load count matrix and convert it back from log
data <- read.table(paste0(proj,".htseq_counts.tsv.gz"),check.names = F,row.names = 1,header = T)
data <- as.data.frame(2^dat - 1)
count <- apply(dat, 2, as.integer)
rownames(count) <- rownames(data)
count[1:4,1:4]
Then Filter out genes of which less than half samples have expression:
n_sample <- ncol(count)
n_sample
count = count[apply(count, 1, function(x) sum(x > 0) > 0.5*n_sample), ]
In sample Id we can extract its group info by checking the last 3 chars, like in TCGA-ZH-A8Y2-01A, 01A is tumor:
library(stringr)
Group = ifelse(as.numeric(str_sub(colnames(count),14,15)) < 10,'tumor','normal')
Group = factor(Group,levels = c("normal","tumor"))
table(Group)
Normally we can use DESeq2, edgeR, or limma to move on to differential expression analysis, here we would jump over it for now and do PCA first.
library(ggplot2)
library(tinyarray)
pca.plot = draw_pca(count,Group);pca.plot
Here it's clear that tumor and normal samples group in their own clusters, and we have CI as ovals surround each of them. Tumor cluster is larger and one of the cause could be heterogenesis; but before that we would noticed that tumor group has more samples than normal group. Next I'm going to look at those samples that can be paired.