Metadata of mouse whole cortex and hippocampus 10x data set

I recently downloaded the “mouse whole cortex and hippocampus 10x” data set. As a first step, I loaded the “gene expression matrix” with python and created a tSNE plot. Individual data points were colored according to the “class_label” information (e.g. Glutamatergic) in the “table of cell metadata” .csv file. To my surprise, class labels did not highlight groups of clusters in the tSNE plot as expected, but rather were randomly spread out across the tSNE plot. Next, I verified whether the clusters in the tSNE plot were actually clusters of similar cell types by coloring the individual data points with regard to the transcript count of several canonical gene markers. As expected, cells expressing the same canonical gene marker (e.g. Vip) were clustering together. Therefore, it seems that I failed to retrieve the appropriate information on an individual cell from the “table of cell metadata”. Please correct me if I am wrong, but the identifier in the “sample_name” column is what I should use to retrieve metadata from the “table of cell metadata” .csv file for a particular cell/row in the “gene expression matrix”, right?
Thank you for your help!

Hi @Michael,

you are correct that the “sample_name” column is the identifier used in the matrix. Have you checked whether those identifiers match in the files you downloaded? The order of the samples in the matrix does not match the order of the samples in the metadata file perfectly. Could this cause the mismatch you observe?

Best,
Cindy

I’m having the same issue. Expression patterns on Allen cell browser (10X mouse cortex & hippocampus) do not match those in public data sets for download. Other data sets seem to be fine, including Smart-seq and Human Cortex.

A clear example of this is if you look at Pvalb expression patterns in the UCSC cell atlas (a browser session derived from allen brain atlas public files):

https://cells.ucsc.edu/?ds=allen-celltypes+mouse-cortex+mouse-cortex-2020&gene=Pvalb

with pvalb expression in the transcriptomics explorer:

https://celltypes.brain-map.org/rnaseq/mouse_ctx-hip_10x?selectedVisualization=Scatter+Plot&colorByFeature=Gene+Expression&colorByFeatureValue=Pvalb

Hi @dvera, yes same problem as described before. Since the order of the samples in the count matrix does not match the order of the samples in the metadata, have you checked whether cells are assigned correctly?

To the best of my knowledge, the metadata is matched to the expression matrix based on the cell names, not by index/row#.

Also note that this problem is specific to this particular dataset (mouse 10x cortex/hippocampus). The same methods for assigning metadata to cells works in all the other data sets, suggesting there is a problem with the public dataset itself for mouse 10x cortex/hippocampus.

Ok, I had to download the data to verify but I can’t reproduce the problem. I have used the following code to check the expression of Pvalb in the umap space. Could you give this a try and see if it works?

mat <- fread("matrix.csv")
colnames(mat)[1] <- "sample_name"

#meta <- fread("metadata.csv")
umap.2d <- fread("tsne.csv")


rd.dat = as.data.frame(umap.2d)
colnames(rd.dat)[1:3] = c("sample_name","Dim1", "Dim2")
sub.mat = select(mat, c("sample_name", "Pvalb"))
rd.dat$expr = sub.mat$Pvalb[match(rd.dat$sample_name, sub.mat$sample_name)]
rd.dat <- rd.dat[order(rd.dat$expr),]
p = ggplot(rd.dat, aes(Dim1, Dim2)) + geom_point(aes(color = expr), 
            size = 0.15)
        p = p + scale_color_gradient(low = "gray80", high = "red")
        p = p + theme_void() + theme(legend.position = "none")
        p = p + coord_fixed(ratio = 1)
        p = p + ggtitle("Pvalb")