Normalization steps preformed on the BrainSpan RPKM gene-level RNA-seq data


I have been looking into how to properly normalize and use the gene-level RNA-seq data (RNA-Seq Gencode v10 summarized to genes) provided in I have looked into the documents and the forums and I cannot find how these data were pooled together into one csv file. My question is, was there a between samples and across studies normalization step applied to the RPKM data provided in the gene_matrix_csv file? Or where the data from different study groups just put together without further normalization? To put it another way, I am wondering if a normalization step has been performed on the whole expression matrix provided for download (so that they are readily comparable across samples and age categories as is).

I am interested in studying the expression trajectory of genes across developmental stages (from 8 pcw to 40 yrs) and given that this would require comparison between different samples from different studies, I was wondering if other normalization steps are required and if there is a guideline on how to do this.

This is what I have found in “microarray data analysis” section in Miller et al., 2014 paper:
“Data for samples passing QC were normalized in three steps: 1) “within-batch” normalization to the 75th percentile expression values; 2) “cross-batch” bias reduction using ComBat57; and 3) “cross-brain” normalization as in step 1.”

But this is only in reference to four brains included in this paper and only microarray data. So basically I want to know if a similar normalization was done once RNA-seq data from all 42 donors was aggregated into the gene-level expression matrix.

I would really appreciate your help with this. Thank you very much.

Information about BrainSpan normalization is included (albeit somewhat buried) in the methods for the associated manuscript. Normalization scripts can be found on GitHub here.

1 Like