I would like to retreive the gene markers characterizing each class of the reference taxonomy used within mapmycells.
THe AWS repository AWS S3 Explorer provides the cl.df_CCN202307220.xlsx file, which reports the markers characterizing each cluster, supertype and subclass. However, no list of markers is reported for the top level of the hierarchy, i.e., for each class.
Is this information available anywhere?
Thanks and kind regards,
Hi @vincenzo.lagani ,
Are you asking about the marker genes used by the MapMyCells online app? If so, those markers are slightly different than the markers published in that spreadsheet. An additional requirement on the app (not imposed on the markers in that spreadsheet) was that the marker genes used have unambiguous Ensembl IDs.
The marker genes used by the app are actually available in the extended JSON file returned in your results
.tar file. The full contents of that JSON file are documented here.
Briefly: the JSON file represents a dict. The marker genes are listed under the key
'marker_genes'. That structure itself yields a dict, the keys of which are the parent nodes of the taxonomy, so
results['marker_genes']['CCN20230722_CLAS/CCN20230722_CLAS_01'] will give you this list of marker genes used to differentiate between the child classes of the class
CCN20230722_CLAS_01. To find the list of marker genes used to differentiate between the classes, look at
'None' meaning that there is no parent, i.e. we are working at the root of the taxonomy).
Note: In the event that your dataset did not contain all of the marker genes prescribed for MapMyCells. MapMyCells automatically downsampled the set of marker genes to include only those genes that were in your dataset. The dict of marker genes in your results.json file represent this downsampled list, i.e. the markers used to map your dataset, not the marker genes MapMyCells would have used to map a dataset that contained the same genes as our reference dataset.
Does this answer your question?
Thanks a lot for your answer, very informative. My question was slightly different: in the cl.df_CCN202307220.xlsx file it is easy to identify the marker genes / TFs characterizing a specific subclass. For example, the genes Egln3, Dlx1as, and Galnt18 are reported as the “subclass.markers.combo” for the subclass Sncg Gaba. This information is not available for any of the top classes, for example CTX-CGE GABA or CNU-HYa Glut. Do you know if class-specific lists of markers are available anywhere?
I would like to obtain these list mainly for confirmation purposes, meaning to visualize the expression levels of the markers of each class in a UMAP / Violin plots
On a different note, the
['marker_genes']['None'] field in my results contains 394 genes. I wonder if this sounds correct? Do you know if this the usual amount of genes used for differentiating between the top classes?
Thanks again and kind regards,
I do not know that the data you are asking for is saved anywhere.
At the root level, classes are chosen using 498 marker genes. These markers are chosen by considering all of the markers for discriminating between every possible pair of classes and doing combinatorics to choose something approximating the “smallest reasonable set of markers.” As such, I’m not sure I can say which markers are indicative of which particular classes.
I can say that, given this, the 394 genes listed under
['marker_genes']['None'] in your output package seem reasonable.
Sorry I cannot be more helpful.
Thanks anyway for your reply, greatly appreciated!