How gene-level RPKM is calculated (Brainspan)

Hi Allen Brian Map community,

I am trying to understand how exon- and gene-level RPKM has been calculated for the brainspan project. I guess I am after some clarification, please correct me if I am wrong with the following.

For exon-level expression, a program will count the number of reads mapping to an exon, then you do the RPKM calculation for the exon. In this way, reads that cross an exon-exon boundary (eg between exons 1 and 2 of gene A) will be counted twice, once in exon 1, once in exon 2.

For gene-level expression, a program will count the number of reads mapping to gene A, then you do the RPKM calculation as before. In this way, a read will only be counted once.

If you do the normalisation as suggested above for both gene- and exon-level expression, there is a bias for exons to have higher RPKM than the genes they come from (as it is possible for a single read to map to multiple exons). This is especially a concern for genes which contain many short exons. Yet, when I look at the data, the bias is not there (eg exon 1 RPKM is approx equal to gene A RPKM). I am wondering, how is this possible? To calculate the gene-level RPKM, did you simply add the counts from the list of exons in that gene?

Thank you for your time

Kind regards,

Hi @JacquelineHeighway. Sorry for the delayed response. I would encourage you to visit the Documentation link on the main BrainSpan website for “Developmental Transcriptome,” as all the information you requested should be included there. I say that I would encourage that because it sounds like you already did and that is where you question comes from. I would suggest contacting the Sestan lab at Yale if you still have questions after reviewing the documentation, as they (along with several other collaborators listed on the main BrainSpan website) were collaborators on this project and were the ones that generated these data files. What I can say is that in most normalization methods, reads are scaled when they cross exon boundaries such that if half the read is in each exon, each exon will get half a read for the RPKM calculation rather than a full read. Such a scaling avoids the issue you bring up.