CPM normalization of 10X mouse single-cell RNA-seq

RandallJEllis · August 19, 2021, 2:39pm

I am using the “WHOLE CORTEX & HIPPOCAMPUS - 10X GENOMICS (2020) WITH 10X-SMART-SEQ TAXONOMY (2020)” dataset, and I’m wanting to look at how certain genes are expressed in different clusters, regions, and subclasses, and their co-expression.

In the full dataset, there is a matrix.csv file which contains raw UMI counts for all cells and all genes, and there is a trimmed_means.csv file which shows normalized mean expression of all genes in all 378 clusters. I would like to look at normalized mean expression and variance (e.g., standard deviation, standard error of the mean) for not just clusters, but also subclasses and regions.

Looking at the documentation and the Transcriptomics Explorer, it seems that the trimmed_means.csv file is calculated by taking all cells in a cluster, and for each gene, removing the top 25% and bottom 25% of the data, and taking the log2(CPM + 1) of the data, and I assume taking the mean of these values?

I wish there was a normalized_matrix.csv file, but if I’m going to have to reverse engineer their trimmed_means.csv process, any help is appreciated.

EDIT: I was able to get closer to the values in trimmed_means.csv by taking the log2(CPM + 1) of all cells, then excluding the top and bottom 25%, and taking the mean of the remaining values. As an example, for a single cluster and a single gene, the trimmed_means.csv value is 6.651905, and I’m getting 6.652420762158425.

yzizhen · August 24, 2021, 8:57pm

Yes, you are right. log2(CPM+1) normalization is applied to all the data, then trimmed_means were computed for each cluster and each gene based on normalized data.

RandallJEllis · August 25, 2021, 7:25pm

Hm, ok. I’m not sure why I’m getting the mismatch of values in my edit, but I did exactly what you wrote so I will move forward. As a small recommendation, I think a lot of researchers would appreciate making available a normalized dataset for downstream applications. For anybody looking to normalize these data, this is the code I used:

import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa

df = pd.read_csv('matrix.csv', chunksize=1000)
print('iterator made')
c=0

for chunk in df:
    print(c)
    libsize = np.sum(chunk.iloc[:,1:].values, axis=1)
    logcpm = np.log2(((chunk.iloc[:,1:].T / libsize).T *1_000_000) + 1)
    logcpm = logcpm.astype('float32') #reduce memory from float64
    logcpm.insert(0, 'sample_name', chunk.iloc[:,0])
    logcpm = pa.Table.from_pandas(logcpm)

    pq.write_to_dataset(logcpm, root_path='cpm_matrix.parquet')

    c+=1

Topic		Replies	Views
Doubts about mouse RNA-Seq single cell data Science experiment-design , analysis , how-to , rna-seq , mouse , transcriptomics_explorer	4	896	March 8, 2021
Is smart-seq matrix human multiple cortical areas normalized? Science atlas-cell-types , rna-seq , human	1	438	July 18, 2022
Single cell Gene Expression by Cluster - Do 0s mean no expression?	2	578	April 18, 2022
Using 10X Genomics Mouse Transcriptomic matrix.csv file Technical transcriptomics , celltype , analysis , how-to	1	292	February 6, 2024
Normalization steps preformed on the BrainSpan RPKM gene-level RNA-seq data Technical atlas-human-brain-developing	4	273	November 27, 2024

CPM normalization of 10X mouse single-cell RNA-seq

Related topics