Low diversity in mapped results

cmccormick · May 11, 2025, 7:38pm

When using MapMyCells with a snRNAseq dataset produced from mouse brain samples I was able to successfully map my data. However, when looking at the results, a vast majority (though not all) of the cells are being mapped to IT-ET Glut neurons. I am wondering if there are potential issues with conversion between technologies to be aware of? Or if anyone has seen a similar result in analyzing their data? Thanks!

danielsf · May 12, 2025, 2:57am

As a first pass, this post lists some things you might look at to assess what is happening. The fact that all of your cells are mapping to the alphabetically first class in the taxonomy is a little suspicious to me. Maybe your dataset doesn’t have any of the marker genes we are looking at. Or maybe the expression in the marker genes is so low that that the correlation coefficients between cells and cell types is effectively random.

If you feel comfortable, I could take a peek at your data and see if there is anything peculiar about it. I would just be looking at

a) what genes are present
b) what the distribution of numerical values in the cell-by-gene matrix is

If you shared with me the name of your output file package (the .zip file you downloaded from MapMyCells), I might be able to track it down in the MapMyCells pipeline.

Again: only if you’re comfortable with that

cmccormick · May 12, 2025, 1:38pm

Hello,
Thank you for your response! I am currently using a publicly available dataset from Sziraki et al Nature Genetics 2023 (GEO Accession viewer) and comparing against a dataset I produced using the same protocol ( EasySci plate based method for single cell rna sequencing). I get the same results (IE ET mapping only) from both, so wondering if there might be an issue converting between technologies?

I believe that the markers for cell types are fairly robust since we are able to label groups of cells. For example attached is the expression of Slc17a7, Gad1, Plp1, and Gja1

The name of my output file was counts1_correct_formatcsv-2025-04-30-19-10-32_10xWholeMouseBrain(CCN20230722)_CorrelationMapping_UTC_1746040207790

cmccormick · May 12, 2025, 1:50pm

I will also say that taking a look at the correlation mapping json file I see aggregate_probability values typically at 1 for each of the clusters and average correlations between 0.45-0.7

danielsf · May 13, 2025, 3:26pm

To start: your run is unfortunately so old that your data file has already been deleted from our cloud storage, so I cannot just look at the file you submitted for mapping.

That being said: the maximum values in the heat maps above appear to be very low. Did you submit a file with raw counts data or log normalized counts? One tell would be: are the values in your count matrix all integers (or floats that end in .0). Raw counts should be integers.

If you submitted log normalized counts, that could be your problem. MapMyCells the on-line tool expects data in raw counts form. It then automatically performs the following normalization

a) normalize each individual cell so that counts are in counts per million (CPM)
b) take log2(CPM+1) of the normalized counts from (a)

if your data was already log normalized, you will be taking the log of a log, which will wash out a lot of signal.

If you do not have raw counts, you can map log normalized data by running the backend code on your own local machine. The readme page in this GitHub repository will link you to Jupyter notebooks showing how to do that.

To tell the mapper that your data is log normalized, just change the configuration dict (see cell [15] of this notebook) so that it looks like

{
  ....
  "type_assignment": {
    ...
    "normalization": "log2CPM",
    ...
  }
  ...
}

If you want to map to the taxonomies supported by the MapMyCells on-line tool, you will need to download the data files linked to in this page. See also section 8 of this notebook.

Let me know if this doesn’t solve your problem (or if you know your data is raw count).

cmccormick · May 13, 2025, 7:10pm

Hello,
I am taking my counts from a seurat object by selecting the counts layer only (not after normalization with log counts of sctransform) so all of my counts are raw and are in integer form. Is it possible that at this depth my mapping would not work? Please let me know if I can resubmit the mapping query to get feedback.

danielsf · May 13, 2025, 8:38pm

Sure. Resubmit your job and post here when you do. I should be able to find your job based on its start time.

danielsf · May 13, 2025, 8:38pm

(also, tell me the name of the file you submitted, just in case)

cmccormick · May 13, 2025, 8:59pm

I just resubmitted it about 5 minutes ago and it looks like the mapping was successful. The name of the file uploaded is counts1_correct_format.csv.gz

danielsf · May 14, 2025, 3:18pm

Unfortunately, I did not receive an automated email telling me your job had started and I cannot find your job in our system, so I am unable to download your data for inspection. Sorry.

I’m not sure what else to say. The expression values in the heatmaps you posted look low to me, though I can’t say for certain without seeing the full distribution of expression values (i.e. what fraction are non-zero; what fraction are above 10; what fraction are above 20…).

I don’t have a good explanation why that should be the case, but I can imagine it would pose a problem for the mapping if the expression profiles for individual cells were too flat.

cmccormick · May 14, 2025, 6:06pm

Ok, I appreciate your help. Is there a way that I might be able to send you the counts file directly so that you could take a look at the mapping on your end? Or I could resubmit at a particular time stamp if that would be helpful. The UMAPs that I was posting represent the normalized data, but happy to plot a particular range if that would be more helpful?

danielsf · May 19, 2025, 4:30pm

(sorry for the late response; it was wall-to-wall conferences over here last week)

For better or worse, our system isn’t designed to make it easy to track down the data for mapping runs that nominally succeeded, so I don’t know if your “submit the mapping at a specific timestamp” will help me.

I guess we could get cheeky and have you submit your data and try to map it to the wrong species (i.e. if your data is human, try mapping it to the mouse taxonomy, or vice-versa). That will fail and it will print the alpha numeric RunID to your browser. If I have that RunID, it is very easy to track down and download your data.

I will say that in the course of the conferences I was attending last week, someone pointed out that they were having trouble with running MapMyCells on stereo-seq data and suspected it was because their data had endemically low counts per cell. If that is what you see in your data, then maybe we should not be surprised that you are having this problem.

I would be interested in trying to figure out how/if we can get MapMyCells to give useful results for data of this type.

Maybe go ahead and submit a job that we know will fail (mapping your data to the wrong species) and give me the RunID that gets printed to your screen in the red “your job has failed” dialog box.

cmccormick · May 19, 2025, 7:50pm

No worries, thank you for the help! I just resubmitted my job to the wrong species (I have mouse data but submitted mine to the human brain atlas). The run failed and I got the following error message:

Mapping Failed

Use log files for troubleshooting MapMyCells issues. Post them in the community forums for further assistance.

Mapping algorithm failed because of application errors.

Please confirm that your input data is in cell (rows) by gene (columns) format.

Run ID: 1747683744472-858f04c2-4423-4a07-992a-f69524622551

Post in the community forum for help

Also I wanted to post some of my metrics for nCount and nFeature here if that would be helpful to get a sense of the depth of our data. We are working with nuclei – but in general I don’t think the counts are that low.

danielsf · May 19, 2025, 8:10pm

I have successfully downloaded your CSV file.
I will play with it for a bit and come back with my thoughts.
Thank you for your patience.

danielsf · May 19, 2025, 10:54pm

Can you check that the data in the CSV you submitted and the data that you plotted earlier in this thread are the same?

The CSV I downloaded has 25542 cells in it, 25402 of which have zero gene expression in all genes

>>> src = anndata.read_csv('../archive/counts1_correct_format.csv.gz')
/Users/scott.daniel/miniconda3/envs/cell_type_mapper/lib/python3.12/site-packages/anndata/utils.py:434: FutureWarning: Importing read_csv from `anndata` is deprecated. Import anndata.io.read_csv instead.
  warnings.warn(msg, FutureWarning)
>>> sum_over_rows = src.X.sum(axis=1)
>>> sum_over_rows.shape
(25542,)
>>> src.X.shape
(25542, 41222)
>>> (sum_over_rows==0).sum()
np.int64(25042)
>>>

That would definitely produce the behavior you are seeing (every cell getting assigned to the alphabetically first cell type class because it has zero correlation with any cell type).

Just to make sure we are talking about the same file, the md5 checksum of the file I downloaded is f75057afbdff003c9f230f6c0aaf8cf8

You can get this on a Linux machine by running

md5sum path/to/file.csv.gz

and on a Mac with

md5 path/to/file.csv.gz

cmccormick · May 28, 2025, 3:51pm

Hmm that is weird, that is the same file. I think the issue I am running into is with the compressing system that I was using. When I upload the regular csv with zipping it the mapping looks a lot better. I will keep that in mind for next time, thank you so much for the help!

Topic		Replies	Views
Mapping failed MapMyCells MapMyCells	7	132	July 12, 2024
Cell typing different based on clustering Technical celltype	4	72	April 10, 2025
MapMyCells User Guide MapMyCells how-to	8	1917	March 21, 2024
Mapping failed because of application errors troubleshooting MapMyCells	4	356	October 25, 2023
Mapping algorithm failed because of application error MapMyCells	3	423	November 30, 2023