When using MapMyCells with a snRNAseq dataset produced from mouse brain samples I was able to successfully map my data. However, when looking at the results, a vast majority (though not all) of the cells are being mapped to IT-ET Glut neurons. I am wondering if there are potential issues with conversion between technologies to be aware of? Or if anyone has seen a similar result in analyzing their data? Thanks!
As a first pass, this post lists some things you might look at to assess what is happening. The fact that all of your cells are mapping to the alphabetically first class in the taxonomy is a little suspicious to me. Maybe your dataset doesn’t have any of the marker genes we are looking at. Or maybe the expression in the marker genes is so low that that the correlation coefficients between cells and cell types is effectively random.
If you feel comfortable, I could take a peek at your data and see if there is anything peculiar about it. I would just be looking at
a) what genes are present
b) what the distribution of numerical values in the cell-by-gene matrix is
If you shared with me the name of your output file package (the .zip file you downloaded from MapMyCells), I might be able to track it down in the MapMyCells pipeline.
Again: only if you’re comfortable with that
Hello,
Thank you for your response! I am currently using a publicly available dataset from Sziraki et al Nature Genetics 2023 (GEO Accession viewer) and comparing against a dataset I produced using the same protocol ( EasySci plate based method for single cell rna sequencing). I get the same results (IE ET mapping only) from both, so wondering if there might be an issue converting between technologies?
I believe that the markers for cell types are fairly robust since we are able to label groups of cells. For example attached is the expression of Slc17a7, Gad1, Plp1, and Gja1
The name of my output file was counts1_correct_formatcsv-2025-04-30-19-10-32_10xWholeMouseBrain(CCN20230722)_CorrelationMapping_UTC_1746040207790
I will also say that taking a look at the correlation mapping json file I see aggregate_probability values typically at 1 for each of the clusters and average correlations between 0.45-0.7
To start: your run is unfortunately so old that your data file has already been deleted from our cloud storage, so I cannot just look at the file you submitted for mapping.
That being said: the maximum values in the heat maps above appear to be very low. Did you submit a file with raw counts data or log normalized counts? One tell would be: are the values in your count matrix all integers (or floats that end in .0
). Raw counts should be integers.
If you submitted log normalized counts, that could be your problem. MapMyCells the on-line tool expects data in raw counts form. It then automatically performs the following normalization
a) normalize each individual cell so that counts are in counts per million (CPM)
b) take log2(CPM+1) of the normalized counts from (a)
if your data was already log normalized, you will be taking the log of a log, which will wash out a lot of signal.
If you do not have raw counts, you can map log normalized data by running the backend code on your own local machine. The readme page in this GitHub repository will link you to Jupyter notebooks showing how to do that.
To tell the mapper that your data is log normalized, just change the configuration dict (see cell [15] of this notebook) so that it looks like
{
....
"type_assignment": {
...
"normalization": "log2CPM",
...
}
...
}
If you want to map to the taxonomies supported by the MapMyCells on-line tool, you will need to download the data files linked to in this page. See also section 8 of this notebook.
Let me know if this doesn’t solve your problem (or if you know your data is raw count).
Hello,
I am taking my counts from a seurat object by selecting the counts layer only (not after normalization with log counts of sctransform) so all of my counts are raw and are in integer form. Is it possible that at this depth my mapping would not work? Please let me know if I can resubmit the mapping query to get feedback.
Sure. Resubmit your job and post here when you do. I should be able to find your job based on its start time.
(also, tell me the name of the file you submitted, just in case)
I just resubmitted it about 5 minutes ago and it looks like the mapping was successful. The name of the file uploaded is counts1_correct_format.csv.gz
Unfortunately, I did not receive an automated email telling me your job had started and I cannot find your job in our system, so I am unable to download your data for inspection. Sorry.
I’m not sure what else to say. The expression values in the heatmaps you posted look low to me, though I can’t say for certain without seeing the full distribution of expression values (i.e. what fraction are non-zero; what fraction are above 10; what fraction are above 20…).
I don’t have a good explanation why that should be the case, but I can imagine it would pose a problem for the mapping if the expression profiles for individual cells were too flat.
Ok, I appreciate your help. Is there a way that I might be able to send you the counts file directly so that you could take a look at the mapping on your end? Or I could resubmit at a particular time stamp if that would be helpful. The UMAPs that I was posting represent the normalized data, but happy to plot a particular range if that would be more helpful?
(sorry for the late response; it was wall-to-wall conferences over here last week)
For better or worse, our system isn’t designed to make it easy to track down the data for mapping runs that nominally succeeded, so I don’t know if your “submit the mapping at a specific timestamp” will help me.
I guess we could get cheeky and have you submit your data and try to map it to the wrong species (i.e. if your data is human, try mapping it to the mouse taxonomy, or vice-versa). That will fail and it will print the alpha numeric RunID to your browser. If I have that RunID, it is very easy to track down and download your data.
I will say that in the course of the conferences I was attending last week, someone pointed out that they were having trouble with running MapMyCells on stereo-seq data and suspected it was because their data had endemically low counts per cell. If that is what you see in your data, then maybe we should not be surprised that you are having this problem.
I would be interested in trying to figure out how/if we can get MapMyCells to give useful results for data of this type.
Maybe go ahead and submit a job that we know will fail (mapping your data to the wrong species) and give me the RunID that gets printed to your screen in the red “your job has failed” dialog box.
No worries, thank you for the help! I just resubmitted my job to the wrong species (I have mouse data but submitted mine to the human brain atlas). The run failed and I got the following error message:
Mapping Failed
Use log files for troubleshooting MapMyCells issues. Post them in the community forums for further assistance.
Mapping algorithm failed because of application errors.
Please confirm that your input data is in cell (rows) by gene (columns) format.
Run ID: 1747683744472-858f04c2-4423-4a07-992a-f69524622551
Post in the community forum for help
Also I wanted to post some of my metrics for nCount and nFeature here if that would be helpful to get a sense of the depth of our data. We are working with nuclei – but in general I don’t think the counts are that low.
I have successfully downloaded your CSV file.
I will play with it for a bit and come back with my thoughts.
Thank you for your patience.
Can you check that the data in the CSV you submitted and the data that you plotted earlier in this thread are the same?
The CSV I downloaded has 25542 cells in it, 25402 of which have zero gene expression in all genes
>>> src = anndata.read_csv('../archive/counts1_correct_format.csv.gz')
/Users/scott.daniel/miniconda3/envs/cell_type_mapper/lib/python3.12/site-packages/anndata/utils.py:434: FutureWarning: Importing read_csv from `anndata` is deprecated. Import anndata.io.read_csv instead.
warnings.warn(msg, FutureWarning)
>>> sum_over_rows = src.X.sum(axis=1)
>>> sum_over_rows.shape
(25542,)
>>> src.X.shape
(25542, 41222)
>>> (sum_over_rows==0).sum()
np.int64(25042)
>>>
That would definitely produce the behavior you are seeing (every cell getting assigned to the alphabetically first cell type class because it has zero correlation with any cell type).
Just to make sure we are talking about the same file, the md5 checksum of the file I downloaded is f75057afbdff003c9f230f6c0aaf8cf8
You can get this on a Linux machine by running
md5sum path/to/file.csv.gz
and on a Mac with
md5 path/to/file.csv.gz
Hmm that is weird, that is the same file. I think the issue I am running into is with the compressing system that I was using. When I upload the regular csv with zipping it the mapping looks a lot better. I will keep that in mind for next time, thank you so much for the help!