CPM normalization of 10X mouse single-cell RNA-seq

I am using the “WHOLE CORTEX & HIPPOCAMPUS - 10X GENOMICS (2020) WITH 10X-SMART-SEQ TAXONOMY (2020)” dataset, and I’m wanting to look at how certain genes are expressed in different clusters, regions, and subclasses, and their co-expression.

In the full dataset, there is a matrix.csv file which contains raw UMI counts for all cells and all genes, and there is a trimmed_means.csv file which shows normalized mean expression of all genes in all 378 clusters. I would like to look at normalized mean expression and variance (e.g., standard deviation, standard error of the mean) for not just clusters, but also subclasses and regions.

Looking at the documentation and the Transcriptomics Explorer, it seems that the trimmed_means.csv file is calculated by taking all cells in a cluster, and for each gene, removing the top 25% and bottom 25% of the data, and taking the log2(CPM + 1) of the data, and I assume taking the mean of these values?

I wish there was a normalized_matrix.csv file, but if I’m going to have to reverse engineer their trimmed_means.csv process, any help is appreciated.

EDIT: I was able to get closer to the values in trimmed_means.csv by taking the log2(CPM + 1) of all cells, then excluding the top and bottom 25%, and taking the mean of the remaining values. As an example, for a single cluster and a single gene, the trimmed_means.csv value is 6.651905, and I’m getting 6.652420762158425.

Yes, you are right. log2(CPM+1) normalization is applied to all the data, then trimmed_means were computed for each cluster and each gene based on normalized data.

Hm, ok. I’m not sure why I’m getting the mismatch of values in my edit, but I did exactly what you wrote so I will move forward. As a small recommendation, I think a lot of researchers would appreciate making available a normalized dataset for downstream applications. For anybody looking to normalize these data, this is the code I used:

import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa

df = pd.read_csv('matrix.csv', chunksize=1000)
print('iterator made')

for chunk in df:
    libsize = np.sum(chunk.iloc[:,1:].values, axis=1)
    logcpm = np.log2(((chunk.iloc[:,1:].T / libsize).T *1_000_000) + 1)
    logcpm = logcpm.astype('float32') #reduce memory from float64
    logcpm.insert(0, 'sample_name', chunk.iloc[:,0])
    logcpm = pa.Table.from_pandas(logcpm)

    pq.write_to_dataset(logcpm, root_path='cpm_matrix.parquet')