Cell typing in limited cell types

Hi,

I’ve been working with CosMx data (1K gene panel), so a sparse dataset with limited reference genes present. I want to CellType my cells. My first questions are about Mapmycells, which I use in hierarchical clustering mouse panel (Gene IDs are not ENSEMBL)

  1. Can I limit mapmycells to only map cells of a certain subtype? From dissection / H&E I know the I’ve isolated hippocampi. However, mapmycells also maps celltypes which shouldn’t be present (e.g. 28 CB Gaba and 29 CB Glut).
  2. Should I convert to ENSEMBL IDs nevertheless?
  3. How well does Mapmycells handle sparse datasets? We are talking mean of 80 features and 120 transcripts per cell…

Another way of mapping would be using the Allan data to have lists of marker genes and score UMAP clusters based on these marker genes. Are the .csv files available or a way to generate these? E.g. if I select all ‘hippocampal (anatomical region)’ cells; is there a way to take marker genes from these celltypes?

Also, is there a file that puts the nomenclature to understandable sentences, things like 05 OB-IMN GABA sound gibberish to me.

Hi @roanvanscheppingen ,

Thanks for your interest in MapMyCells. Let me address your questions one at a time.

Can I limit mapmycells to only map cells of a certain subtype?

Unfortunately, no. That functionality is not available in MapMyCells at this time. You might try looking at the runner_up assignments made by MapMyCells. These are the cell type assignments that were almost but not quite chosen during hierarchical mapping. Those might be more believable in cases where you know the actual cell type assignment is wrong based on prior knowledge. The runner_up mappings are available in the JSON output file. We recently added a Jupyter notebook explaining how to access and explore that data here.

Should I convert to ENSEMBL IDs nevertheless?

That depends. MapMyCells will do its best to assign your genes to ENSEMBL IDs. The exact mapping performed by MapMyCells on your data is also available in the JSON output. The notebook I liked to above shows how to access that data, as well.

In my opinion, if you can, you should always assign Ensembl IDs yourself so that there is no ambiguity due to, for instance, different versions of the reference genome being used by your data and MapMyCells.

How well does Mapmycells handle sparse datasets? We are talking mean of 80 features and 120 transcripts per cell…

I don’t have a ready-to-hand answer to this question. We have been experimenting with running MapMyCells on a test 10X dataset downsampled to have N genes (where N < the full 32,000 gene panel in the reference dataset). With only 1000 genes, it does pretty poorly. However, these tests were performed with 1000 random genes. When, for instance, we ran it with the data downsampled to the 500 genes in the MERFISH panel published in Yao et al. 2023, the performance was nearly equivalent to performance with the full 32,000 gene panel. The key is the overlap between the genes in your gene panel and the marker genes used by MapMyCells. We are working on putting our list of marker genes in a publicly available space. I will post here when it is available. One thing you can access, though (also in the JSON data returned by MapMyCells), is the subset of marker genes used by MapMyCells on your specific data (see the “Marker genes” subsection of the notebook I linked to above). If there are more than a few hundred marker genes at each level in the taxonomy, I would expect that the performance is pretty good, but that is just an off-the-cuff estimate.

Are the .csv files [of marker genes] available or a way to generate these?

As I said, we are working on making these available. I will let you know when they are.

Also, is there a file that puts the nomenclature to understandable sentences

One of my colleagues (@jeremyinseattle ) pointed me to this table decoding the taxons for the Whole Mouse Brain taxonomy. He says there is an equivalent for the human taxonomy , if that is what you need.

Let me know if anything is unclear, or any new questions arise.

Hi @danielsf ,

Thanks for the explanation. This, in tandem with your presentation, clears up a lot of questions already!

Indeed, I hope that the Cosmx panel is more informative than 1000 random genes, but it is different then the MERFISH panel and we are not sure if it holds up with the Mapmycells. Taking the runner-ups might be an option, but we are worried this might introduce bias since we would actively have to decide whether we ‘like’ the first suggestion by Allen.

What if I were to download the Allen data, subset it towards the celltypes we are interested in (lets say hippocampal) and run a Seurat FindAllMarkers or Marker gene identification. Could we then take the top x markers, intersect them with the Cosmx gene panel and use this as a proxy? I’ve seen this before in Cosmx papers, using Allen marker genes. So I guess they were made by the respective labs.

Best,

Roan

Sorry for the late reply. I’m not sure what you are proposing when you say “take the top x markers, intersect them with the Cosmx gene panel and use this as a proxy?” What calculation are you proposing when you say “use this as a proxy”?

I am referring to how celltyping was done in this paper:
https://www.cell.com/cell-reports/fulltext/S2211-1247(24)00544-8?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2211124724005448%3Fshowall%3Dtrue#secsectitle0075

For analysis, a subset of the data was generated to only include regions annotated for ‘RSP’, ‘TEa-PERI-ECT’, ‘ACA’, ‘AI’, ‘SSs-GU-VISC-AIp’, ‘AUD’, ‘MOp’, ‘MOs_FRP’, ‘PL-ILA-ORB’, ‘PTLp’, ‘SSp’, ‘VIS’, ‘VISl’, ‘VISm’, ‘VISp’, ‘HIP’. The remaining cells were randomly sampled to retain 12.5% of the cells (131,169 total cells). Wilcoxon rank-sum test, with multiplicity correction using the Benjamini-Hochberg (BH) method, was used to obtain the top 200 markers for annotated cell types (Glutamatergic neurons, GABAergic neurons, microglia, astrocytes, endothelial, and oligodendrocytes).

These genes were then used as a way to calculate a cell type score per Leiden cluster and assign cluster names. Thats what I mean as proxy.

Best, Roan

Yes. That seems like it should work. It will just be time/compute intensive. Good luck!

Because I said I would post it here, we have made the marker genes and “precomputed statistics” files used by on-line MapMyCells available for download. Instructions for finding them are on this page.