Hi @roanvanscheppingen ,
Thanks for your interest in MapMyCells. Let me address your questions one at a time.
Can I limit mapmycells to only map cells of a certain subtype?
Unfortunately, no. That functionality is not available in MapMyCells at this time. You might try looking at the runner_up
assignments made by MapMyCells. These are the cell type assignments that were almost but not quite chosen during hierarchical mapping. Those might be more believable in cases where you know the actual cell type assignment is wrong based on prior knowledge. The runner_up
mappings are available in the JSON output file. We recently added a Jupyter notebook explaining how to access and explore that data here.
Should I convert to ENSEMBL IDs nevertheless?
That depends. MapMyCells will do its best to assign your genes to ENSEMBL IDs. The exact mapping performed by MapMyCells on your data is also available in the JSON output. The notebook I liked to above shows how to access that data, as well.
In my opinion, if you can, you should always assign Ensembl IDs yourself so that there is no ambiguity due to, for instance, different versions of the reference genome being used by your data and MapMyCells.
How well does Mapmycells handle sparse datasets? We are talking mean of 80 features and 120 transcripts per cell…
I don’t have a ready-to-hand answer to this question. We have been experimenting with running MapMyCells on a test 10X dataset downsampled to have N genes (where N < the full 32,000 gene panel in the reference dataset). With only 1000 genes, it does pretty poorly. However, these tests were performed with 1000 random genes. When, for instance, we ran it with the data downsampled to the 500 genes in the MERFISH panel published in Yao et al. 2023, the performance was nearly equivalent to performance with the full 32,000 gene panel. The key is the overlap between the genes in your gene panel and the marker genes used by MapMyCells. We are working on putting our list of marker genes in a publicly available space. I will post here when it is available. One thing you can access, though (also in the JSON data returned by MapMyCells), is the subset of marker genes used by MapMyCells on your specific data (see the “Marker genes” subsection of the notebook I linked to above). If there are more than a few hundred marker genes at each level in the taxonomy, I would expect that the performance is pretty good, but that is just an off-the-cuff estimate.
Are the .csv files [of marker genes] available or a way to generate these?
As I said, we are working on making these available. I will let you know when they are.
Also, is there a file that puts the nomenclature to understandable sentences
One of my colleagues (@jeremyinseattle ) pointed me to this table decoding the taxons for the Whole Mouse Brain taxonomy. He says there is an equivalent for the human taxonomy , if that is what you need.
Let me know if anything is unclear, or any new questions arise.