I would like to use the M1 10X dataset from @trygveb paper, ‘Evolution of cellular diversity in primary motor cortex of human, marmoset, monkey, and mouse’ as a reference for snRNA seq. I obtained the data from the allen brain map website: https://portal.brain-map.org/atlases-and-data/rnaseq/human-m1-10x. I see in this paper that the data was normalized and scaled. What I would like to know is if this dataset that is provided on the website is the raw, unprocessed data. I am using Seurat and SingleR to annotate cell types, which require normalized and scaled expression matrices. Do I need to normalize and scale this dataset before using as a reference? Or is the provided expression matrix already processed? Thanks!
Hi @danielcgingerich. The human data at this link above represents total reads assigned to a given gene for a given nucleus (introns + exons). Typically, we normalize and scale snRNA-seq data by calculating counts per million and then log normalizing: e.g.
log2(CPM(introns+exons)+1). In theory the unique molecular identifier (UMI) values represent the actual number of transcripts in the cell and don’t need to be normalized, but in practice we (and many others) find such normalization and scaling improves the results.
The data in the gene expression matrix .csv file are the raw UMIs. The values are not normalized and represent the direct output matrix from Cell Ranger. For analysis in Seurat, we find that using either the SCTransform normalization method or the standard NormalizeData function work well.
Code for reproducing our analyses will be provided upon peer-review publication of this manuscript.
Please let us know if you have additional questions!