MapMyCells cell type assignment question

Hello!

Is MapMyCells cell type assignment based wholly on presence of marker genes of a given cell type, or is it also affected by absence of marker genes in the same cell type? In other words, how would correct assignment be affected if we are not doing sequencing based transcriptomics, but instead using a pre-designed panel that does not have many of the cell type marker genes? I really appreciate the bootstrapping probability output feature, thank you for including that!

Hi @GraceRosen

If a marker gene is missing from the query data you submit, MapMyCells will just ignore that marker gene. Nothing will be inferred from its absence.

How is performance affected by the absence of expected marker genes? It depends. Not all marker genes are as effective as all other marker genes. The best example is the MERFISH data that was included in the Yao et al. 2023 publication. Since it was taken with MERFISH, the data only had a 500 gene panel. However, those 500 genes were selected to be especially good markers from the MapMyCells marker set. As such, when we take the scRNA data from Yao et al., downsample it to only include those 500 genes, and do a test/train hold out with MapMyCells, we find that mapping accuracy is basically unaffected by the fact that we only have 500 of the 6500 expected markers. Granted: this is a very special case. The 500 gene panel was designed with foreknowledge of the cell type taxonomy and the marker genes. I think the general lesson is that, if you are clever about your gene selection (i.e. if you actually choose interesting genes, which I suspect most people will), MapMyCells will do fine. Some things you can look at to quantify what I mean by “will do fine.”

Look at the avg_correlation score each mapping has. Because MapMyCells has to assign a cell somewhere in the taxonomy, you may find cases where the mapping is bizarre but bootstrapping_probability = 1.0 because you have to put the cell somewhere. In these cases, you should expect low avg_correlation scores (I would be suspect of anything under 0.4, but haven’t really put a ton of analysis here).

The JSON output will record all the marker genes that MapMyCells did use (look in the 'marker_genes’ field). You can look to see how many markers were available at each decision point in the taxonomy and whether or not they make sense.

In the most extreme case, you could download the Yao et al. scRNAseq data with the abc_atlas_access, downsample the data to only include genes from your gene panel, perform a mapping, and see how accurate MapMyCells was (the Yao et al. data is annotated with “ground truth” cell type assignments, since this is the data that was used to define the cell type taxonomy). That’s a lot of work though.

Let me know if you have any other questions (though, with the coming holidays, I probably won’t see this thread again until January 6).

Hi @danielsf , thank you so much for the explanation, and happy new year! That is a great idea bout the MERFISH data, I will go that route. I am having some trouble wrapping my head around how avg_correlation is calculated, could you help me understand?

I see in your notebook here the explanation that it is the correlation between the cell in question’s gene profile and the average Map My Cell atlas cell of that type, averaged over the bootstrap iterations that gave rise to that classification. Which variables are being correlated? In this case is the average gene expression collapsed down to a single value for the atlas cell average vs cell in question? Or is there a value for each marker gene? If it is the former, how is that single value calculated?

Thank you very much for considering. I have been appreciating the information you share about this tool in general.

Hi @GraceRosen ,

Happy new year!

The avg_correlation statistic is a single value for each (cell, cell type) pair. It is calculated as follows:

At each bootstrapping iteration, the cell will be represented by a vector of a few hundred genes

v_cell = [cell_gene0, cell_gene1, cell_gene2, cell_gene3, ...]

where the genes in the vector are the subset of marker genes selected to be used for that bootstrapping iteration. The values making up the vector are the values of the expression of the cell in those chosen genes.

Each cell type will similarly be represented by a vector

v_cell_type_N = [type_N_gene0, type_N_gene1, type_N_gene2, type_N_gene3, ...]

The genes are the same as in v_cell, but here the values are the average expression of that cell type in each of the chosen genes.

The correlation between a cell and a cell type is just the Pearson’s correlation coefficient

corr = mean[(v_cell-mean[v_cell])*(v_cell_type_M-mean[v_cell_type_M])]/(std[v_cell]*std[v_cell_type_M])

where the means and standard deviations are taken over the chosen genes.

The avg_correlation value reported by MapMyCells is, as you say, the average of this correlation coefficient over each of the bootstrapping iterations that actually chose the assigned cell type (i.e. if 52 of the 100 iterations chose cell_type_J, then the avg_correlation will be the average of these 52 corr values between the cell and cell_type_J).

Let me know if anything is still unclear.