Normalization steps preformed on the BrainSpan RPKM gene-level RNA-seq data

Hello,

I have been looking into how to properly normalize and use the gene-level RNA-seq data (RNA-Seq Gencode v10 summarized to genes) provided in https://www.brainspan.org/static/download.html. I have looked into the documents and the forums and I cannot find how these data were pooled together into one csv file. My question is, was there a between samples and across studies normalization step applied to the RPKM data provided in the gene_matrix_csv file? Or where the data from different study groups just put together without further normalization? To put it another way, I am wondering if a normalization step has been performed on the whole expression matrix provided for download (so that they are readily comparable across samples and age categories as is).

I am interested in studying the expression trajectory of genes across developmental stages (from 8 pcw to 40 yrs) and given that this would require comparison between different samples from different studies, I was wondering if other normalization steps are required and if there is a guideline on how to do this.

This is what I have found in “microarray data analysis” section in Miller et al., 2014 paper:
“Data for samples passing QC were normalized in three steps: 1) “within-batch” normalization to the 75th percentile expression values; 2) “cross-batch” bias reduction using ComBat57; and 3) “cross-brain” normalization as in step 1.”

But this is only in reference to four brains included in this paper and only microarray data. So basically I want to know if a similar normalization was done once RNA-seq data from all 42 donors was aggregated into the gene-level expression matrix.

I would really appreciate your help with this. Thank you very much.

1 Like

Information about BrainSpan normalization is included (albeit somewhat buried) in the methods for the associated manuscript. Normalization scripts can be found on GitHub here.

1 Like

I have same question as u. do u have the results of the question: whether the developmental stages RPKM data provided in the gene_matrix_csv file remove batch efftct ?

Thanks for reopening this thread, @suzoo18. I’m realizing that I only gave a partial answer to @muhikpe’s question. The BrainSpan atlas has two separate bulk transcriptomics experiments:

  1. Developmental Transcriptome: This broad developmental survey of gene expression in specific brain regions includes 42 donors and used both RNA sequencing and exon microarray techniques. This is the study I was referring to above. These data were collected in the lab of Dr. Nenad Sestan at Yale as part of a collaborative project. I think data is appropriately normalized with batches removed, but I’d encourage you to review and the manuscript and/or normalization scripts or reach out to the manuscript’s corresponding authors for clarifications.
  2. Prenatal LMD Microarray: This is a study with broader anatomical focus (~300 distinct anatomical samples) but narrower temporal focus (four midgestational prenatal specimens, 15-21 pcw), generated using a combination of laser microdissection-based sample isolation and DNA microarrays. This study is described in the Miller et al., 2014 paper, where the data was normalized as mentioned by @muhikpe. If there are any questions about this study, I could address further.

These two studies are entirely independent, using different experimental and computational pipelines, including their normalization steps.

Best,
Jeremy

Hi @suzoo18, very sorry for the late response. So if you look in this github script that @jeremyinseattle provided, you can see that developmental transcriptome data were quantile normalized (using conditional quantile normalization) and also batch effect corrected using ComBat. Unfortunately, there is no further documentation for the github script, but it seems like this final “normalized RPKM” data is what’s available here: https://www.brainspan.org/static/download.html.

I ended up using the normalized RPKM values to plot the trajectories without further normalization. However, I am still not sure this is appropriate since RPKM values cannot properly be compared between samples (this is because as far as I know, RPKM is not an absolute value and it is an abundance ratio within each sample).

I have seen papers log2 transforming the expressions and plotting the trajectories based on that, so that could be one option.

Thank you @jeremyinseattle for providing more info on these datasets.
I would appreciate any other insights on how to properly use this data for tracking trajectories across ages.

Best,
Moohebat