Hi Allen Brian Map community,
I am trying to understand how exon- and gene-level RPKM has been calculated for the brainspan project. I guess I am after some clarification, please correct me if I am wrong with the following.
For exon-level expression, a program will count the number of reads mapping to an exon, then you do the RPKM calculation for the exon. In this way, reads that cross an exon-exon boundary (eg between exons 1 and 2 of gene A) will be counted twice, once in exon 1, once in exon 2.
For gene-level expression, a program will count the number of reads mapping to gene A, then you do the RPKM calculation as before. In this way, a read will only be counted once.
If you do the normalisation as suggested above for both gene- and exon-level expression, there is a bias for exons to have higher RPKM than the genes they come from (as it is possible for a single read to map to multiple exons). This is especially a concern for genes which contain many short exons. Yet, when I look at the data, the bias is not there (eg exon 1 RPKM is approx equal to gene A RPKM). I am wondering, how is this possible? To calculate the gene-level RPKM, did you simply add the counts from the list of exons in that gene?
Thank you for your time