I have been working with some spatial sequencing data and have the data analyzed to different resolutions where based on the UMAP analysis I have either 16 clusters of cells, or with higher resolution I have 22 clusters of cells. When I run MapMyCells on using either subset of data, I get a nice diverse cell population with the 16 cluster data, but for the 22 clusters data I receive almost all prediction for 01 IT-ET Glut besides 2 clusters. I am curious where the error could be in my data.
Before suggesting anything more technical, may I ask when these two datasets were run through MapMyCells (I am asking for the specific dates of both runs; I just want to make sure that they didn’t hit different versions during an update, etc.)?
Edit: There’s actually an easier way to answer this question. The first ~ 5 lines of the CSV file you got from MapMyCells will list the version of MapMyCells that was run. The same information should be in the 'metadata'
field in the JSON file.
The mapping from the 20 clusters is from this run:
metadata = counts_10xWholeMouseBrain(CCN20230722)_HierarchicalMapping_UTC_1744231865796.json
taxonomy hierarchy = [“CCN20230722_CLAS”
readable taxonomy hierarchy = [“class”
algorithm: ‘hierarchical’; codebase: GitHub - AllenInstitute/cell_type_mapper: Repository for storing prototype functionality implementations for the BKP; version: 1.5.1
The mapping from the 16 clusters is from this run:
metadata = counts_10xWholeMouseBrain(CCN20230722)_HierarchicalMapping_UTC_1744130599119.json
taxonomy hierarchy = [“CCN20230722_CLAS”
readable taxonomy hierarchy = [“class”
algorithm: ‘hierarchical’; codebase: GitHub - AllenInstitute/cell_type_mapper: Repository for storing prototype functionality implementations for the BKP; version: 1.5.1
cell_id
Let me know if you need any additional info!
Okay. So they both ran on the same version of the code (1.5.1). That’s good.
At this point, without seeing your data, I can only suggest things you might want to examine to try and diagnose why the mappings are different.
-
You can look at the distributions of the quality metrics in the two runs. For each cell at each level of the taxonomy, you will find in the JSON output an
avg_correlation
, which is the correlation coefficient between the cell and the cell type to which it was assigned, and aaggregate_probability
which is a measure of the confidence that MapMyCells has in that cell type assignment (you can see this Jupyter notebook for a walk through of the contents of the JSON output file). I would look at the distribution of those two statistics in your two mapping results to see if either or both of them is much lower in the 22 cluster mapping. That wouldn’t tell you why, but it would tell you if MapMyCells is just less confident in the mappings it performed on that data than in the mappings it performed on the 16 cluster data. -
Full disclosure, I did my PhD in physics and don’t entirely know what you mean when you say the two datasets were run at different resolutions. Is this a question of the gene panels used in the two dataset, or how the spot images (?) were processed? If it is a difference in the gene panels, you can look at the
marker_genes
entry in the JSON output file. This will tell you which marker genes were actually used for your mapping run. Maybe the 22 cluster data had fewer marker genes in it for some reason. -
Absent any of that, it would be interesting to look at the difference in the two input datasets. Do the distributions in
- counts per cell
- non-zero genes per cell
- variance in counts across genes per cell
differ significantly between the two datasets? Maybe there is just a weaker signal in the 22 cluster dataset.
I’m going to make this a separate post because it is a more concrete idea that deserves its own list of bullet points.
Get the marker genes used by MapMyCells from the JSON output. This will be a dict mapping from cell types to lists of marker genes. The entry under 'None'
is the list of marker genes used to select the class for any given cell (you can see the full meaning of the marker gene table on this page).
For each of the 16 UMAP clusters in your first dataset and the 22 clusters in your second dataset, plot the average gene expression profile of the cells in each cluster in the space of the marker genes listed under marker_genes['None']
. Maybe, in the 22 cluster data, each of the 22 clusters really does look similar in this marker gene space (in a way the clusters from the 16 cluster data doesn’t). This, again, would not tell you why the clusters look similar, but it would explain why MapMyCells assigned all of the cells to the same class.
Note: In the event that MapMyCells had to transform between your gene identifiers and ENSEMBL IDs, you will find the mapping that MapMyCells used under 'gene_identifier_mapping'
in the JSON output file as discussed in cells [21] and [22] of this Jupyter notebook.
I hope this helps.