Is RPKM information included in the scRNA seq dataset?

Hi everyone,

I am looking at the recent release of Allen Brain Cell Atlas (AWS S3 Explorer) I am wondering in those .h5ad files, e.g. WMB-10Xv3-STR-raw.h5ad, is the RPKM information included? If yes, how can I load them? If not, is there a way I can calculate RPKM?

Hi @sophiechenhf,

We typically do not use RPKM (reads per kilobase per million) for analysis of single cell/nucleus RNA-seq data. RPKM normalizes data based on gene length, but because (1) many reads come from gene’s introns and (2) 10x methods are typically reading from one end of the gene, neither the transcript length nor a gene’s genomic extent precisely captures the distribution of reads from these scRNA-seq studies. Instead, we typically use CPM (counts per million) or sometimes counts per 10,000, or use normalized values provided by standard scRNA-seq tools like scVI or Seurat.

That said, if you’d still like to use RPKM, there is a standard formula, and many one-line computational functions to convert your data. Here is a video and short blog describing it: RPKM, FPKM and TPM, clearly explained | RNA-Seq Blog.

You’ll also need gene lengths, which can be calculated from the relevant reference genome files on this page: Reference Genome Files (.gtf) - brain-map.org. I’m not sure off the top of my head exactly how you’d calculate the gene length, but it would take a bit of work. I think you could look for all the rows corresponding for unique exons for a given gene, and then add the distances between genome start and genome end for each exon, but other folks may have a better answer to that.

Best,
Jeremy