Inspired by one of the plots in this publication about urban/rural clovers, I was thinking if we can apply similar method to show some TCGA data, here is what I've tried:

First, download count and metadata from UCSC Xena&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443):

proj <- "TCGA-CHOL"
header <- "https://gdc.xenahubs.net/download/"
download.file(url = paste0(hearder ,proj, ".htseq_counts.tsv.gz"),destfile = paste0(proj,".htseq_counts.tsv.gz"))
download.file(url = paste0(hearder ,proj, ".GDC_phenotype.tsv.gz"),destfile = paste0(proj,".GDC_phenotype.tsv.gz"))
#download.file(url = paste0(hearder ,proj, ".survival.tsv"),destfile = paste0(proj,".survival.tsv"))
phenotype <- read.delim(paste0(proj,".GDC_phenotype.tsv.gz"),fill = T,header = T,sep = "\t")

Take a look at phenotype data:

phenotype[1:3,]
A data.frame: 3 × 122
submitter_id.samplesage_at_initial_pathologic_diagnosisalbumin_result_lower_limitalbumin_result_specified_valuealbumin_result_upper_limitbatch_numberbcrbcr_followup_barcodebcr_followup_uuidsubmitter_iddays_to_collection.samplesdays_to_sample_procurement.samplesinitial_weight.samplesis_ffpe.samplesoct_embedded.samplespreservation_method.samplessample_type.samplessample_type_id.samplesstate.samplestissue_type.samples
<chr><int><dbl><dbl><dbl><chr><chr><chr><chr><chr><dbl><lgl><dbl><chr><chr><lgl><chr><int><chr><chr>
1TCGA-ZH-A8Y2-01A59 NA NANA428.25.0Nationwide Children's HospitalTCGA-ZH-A8Y2897NA 400FalsefalseNAPrimary Tumor1releasedNot Reported
2TCGA-ZH-A8Y7-01A593.52.4 5448.13.0Nationwide Children's HospitalTCGA-ZH-A8Y7 77NA1100FalsefalseNAPrimary Tumor1releasedNot Reported
3TCGA-W7-A93O-01ANA NA NANA448.13.0Nationwide Children's HospitalTCGA-W7-A93O465NA 180Falsetrue NAPrimary Tumor1releasedNot Reported

Load count matrix and convert it back from log

data <- read.table(paste0(proj,".htseq_counts.tsv.gz"),check.names = F,row.names = 1,header = T)
data <- as.data.frame(2^dat - 1)
count <- apply(dat, 2, as.integer)
rownames(count) <- rownames(data)
count[1:4,1:4]
A matrix: 4 × 4 of type int
TCGA-ZD-A8I3-01ATCGA-W5-AA2U-11ATCGA-W5-AA30-01ATCGA-W5-AA38-01A
ENSG00000000003.135254247651328249
ENSG00000000005.5 1 1 0 1
ENSG00000000419.111211 65516431695
ENSG00000000457.12 752 3452652 519

Then Filter out genes of which less than half samples have expression:

n_sample <- ncol(count)
n_sample
count = count[apply(count, 1, function(x) sum(x > 0) > 0.5*n_sample), ]
45

In sample Id we can extract its group info by checking the last 3 chars, like in TCGA-ZH-A8Y2-01A, 01A is tumor:

library(stringr)
Group = ifelse(as.numeric(str_sub(colnames(count),14,15)) < 10,'tumor','normal')
Group = factor(Group,levels = c("normal","tumor"))
table(Group)
Group
normal  tumor 
     9     36 

Normally we can use DESeq2, edgeR, or limma to move on to differential expression analysis, here we would jump over it for now and do PCA first.

library(ggplot2)
library(tinyarray)
pca.plot = draw_pca(count,Group);pca.plot

Here it's clear that tumor and normal samples group in their own clusters, and we have CI as ovals surround each of them. Tumor cluster is larger and one of the cause could be heterogenesis; but before that we would noticed that tumor group has more samples than normal group. Next I'm going to look at those samples that can be paired.