MapMyCells Species Misidentification Issue with Unfiltered scRNA-seq Data ( rat to Bacillus megaterium)

muchangqing777 · October 29, 2025, 8:22pm

Problem Description

I’m experiencing a puzzling issue with MapMyCells where the same rat brain snRNA-seq dataset produces completely different results depending on quality control (QC) status:

Successful Case (With QC)

QC criteria: nCount_RNA >= 500 & nCount_RNA <= 55000 & nFeature_RNA >= 250 & nFeature_RNA <= 8000 & percent.mt <= 5
Result: Excellent annotation results that closely match manual annotation using established markers
Species correctly identified: Rat → Mouse mapping works perfectly

Failed Case (Without QC)

Same sample, no cell filtering
Same code for h5ad conversion
File uploads successfully but annotation fails with species misidentification

Error Details

Key error messages:

text

Based on 694 genes, your input data is from species 'Bacillus megaterium phage G:2884420'
Mapping genes from species 'Bacillus megaterium phage G:2884420' to 'Balb/c mouse:10090'
WARNING: None of your genes could be mapped to unique genes aligned to species 'Balb/c mouse:10090' and authority 'ENSEMBL'
RuntimeError: After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.

Gene ID comparison shows complete mismatch:

Query genes: ['INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1', ...]
Reference genes: ['ENSMUSG00000000028', 'ENSMUSG00000000037', ...]

What I’ve Tried

Gene alignment - No improvement
Same processing pipeline - Only difference is QC filtering
Multiple file validations - Both h5ad files pass validation

Questions for the Community

Why would unfiltered data cause species misidentification from rat to bacteriophage?
Could low-quality cells (dead cells, empty droplets, debris) interfere with MapMyCells’ species detection algorithm?
What specific aspects of low-quality cells might mimic bacteriophage gene expression patterns?
Has anyone encountered similar species misidentification with unfiltered single-cell data?
Are there minimum QC thresholds required for MapMyCells to function correctly?

Technical Context

Sample: Rat brain snRNA-seq (Sham group)
Reference: Balb/c mouse (10090) with ENSEMBL gene IDs
Tool: MapMyCells web interface
Gene format: Gene symbols (work perfectly in QC-filtered data)

The identical data works flawlessly after basic QC but completely fails species identification without filtering. Any insights would be greatly appreciated!

Complete Error Log:
2.80718e+00 seconds == DONE VALIDATING Sham_clean0.h5ad; no changes required
3.03144e+00 seconds == CLEANING UP
3.11604e+00 seconds == Validation run time: 3.0492148399353027
4.36425e-03 seconds == ENV: is_torch_available: False
4.37570e-03 seconds == ENV: is_cuda_available: False
4.37927e-03 seconds == ENV: use_torch: False
4.38309e-03 seconds == ENV: multiprocessing start method: fork
4.38595e-03 seconds == ENV: Python version: 3.10.13 (main, Sep 17 2025, 15:25:21) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)]
4.38905e-03 seconds == ENV: anndata version: 0.11.4
4.39167e-03 seconds == ENV: numpy version: 2.2.6
4.63963e-03 seconds == BENCHMARK: spent 2.4033e-04 seconds validating config and copying data
4.70448e-03 seconds == using precomputed_stats_ABC_revision_230821.h5 for precomputed_stats
4.70734e-03 seconds == reading taxonomy_tree from precomputed_stats_ABC_revision_230821.h5
5.20512e-01 seconds == ***Checking to see if we need to map query genes onto reference dataset
2.42698e+01 seconds == Reference data belongs to species Balb/c mouse:10090
2.42716e+01 seconds == Reference genes are from authority ‘ENSEMBL’
2.42995e+01 seconds == Mapping input genes to ‘Balb/c mouse:10090 – ENSEMBL’ using
GitHub - AllenInstitute/mmc_gene_mapper: Gene ID mapper/ortholog finder for MapMyCells version 0.2.1
backed by database file: mmc_gene_mapper.2025-08-04.db
created on: 2025-08-04-18-10-52
hash: md5:755b0724c2ff00cc199f48e2718a09e5
2.56525e+01 seconds == Based on 694 genes, your input data is from species ‘Bacillus megaterium phage G:2884420’
2.56706e+01 seconds == Input genes are from species ‘Bacillus megaterium phage G:2884420’
2.56751e+01 seconds == Mapping 24992 input genes from ‘symbols’ to ‘NCBI’ (e.g. [‘0’ ‘1’ ‘2’ ‘3’ ‘4’])
2.59965e+01 seconds == Mapping genes from species ‘Bacillus megaterium phage G:2884420’ to ‘Balb/c mouse:10090’
2.63190e+01 seconds == Mapping input genes from ‘NCBI’ to ‘ENSEMBL’
2.66836e+01 seconds == WARNING: None of your genes could be mapped to unique genes aligned to species ‘Balb/c mouse:10090’ and authority ‘ENSEMBL’
2.66949e+01 seconds == ***Mapping of query genes to reference dataset complete
2.69166e+01 seconds == an ERROR occurred ====
Traceback (most recent call last):
File cell_type_mapper/cli/from_specified_markers.py, line 164, in run_mapping
output = _run_mapping(
File cell_type_mapper/cli/from_specified_markers.py, line 429, in _run_mapping
create_marker_cache_from_specified_markers(
File cell_type_mapper/type_assignment/marker_cache_v2.py, line 115, in create_marker_cache_from_specified_markers
marker_lookup = validate_marker_lookup(
File cell_type_mapper/type_assignment/marker_cache_v2.py, line 785, in validate_marker_lookup
raise RuntimeError(error_msg)
RuntimeError: After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.
Example of genes in query set:
[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]
Example of marker genes:
[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]

2.69167e+01 seconds == CLEANING UP
e=RuntimeError(“After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.\nExample of genes in query set:\n[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]\nExample of marker genes:\n[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]”), type(e)=<class ‘RuntimeError’>, fname=‘run.py’, lineno=290
Traceback (most recent call last):
File “/apps/run.py”, line 290, in run
runner.run()
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/cli/from_specified_markers.py”, line 80, in run
self.run_mapping(write_to_disk=True)
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/cli/from_specified_markers.py”, line 164, in run_mapping
output = _run_mapping(
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/cli/from_specified_markers.py”, line 429, in _run_mapping
create_marker_cache_from_specified_markers(
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/type_assignment/marker_cache_v2.py”, line 115, in create_marker_cache_from_specified_markers
marker_lookup = validate_marker_lookup(
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/type_assignment/marker_cache_v2.py”, line 785, in validate_marker_lookup
raise RuntimeError(error_msg)
RuntimeError: After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.
Example of genes in query set:
[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]
Example of marker genes:
[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]

Mapping algorithm failed because of application errors.
Unexpected e=RuntimeError(“After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.\nExample of genes in query set:\n[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]\nExample of marker genes:\n[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]”), type(e)=<class ‘RuntimeError’>, fname=‘run.py’, lineno=375

danielsf · October 29, 2025, 8:29pm

Hi,

Do you mind running the failed again and posting the Run ID (a big alphanumeric string that will appear in the red “your mapping failed” box in the browser)

That will allow me to download your data and inspect the gene identifiers in the file you submitted. That will give me some insight into why the code doesn’t correctly identify your data as rat data.

muchangqing777 · October 29, 2025, 8:31pm

(post deleted by author)

danielsf · October 29, 2025, 9:23pm

Hi @muchangqing777

I downloaded the data that failed. It looks like, for some reason, the index on your var dataframe (the identifiers of your genes) is just a bunch of integers as strings.

>>> import anndata
>>> src = anndata.read_h5ad(fname, backed='r')
>>> src.var.index.values
array(['0', '1', '2', ..., '24989', '24990', '24991'],
      shape=(24992,), dtype=object)

The gene identifiers are in your var dataframe, but they are not the index. They are in their own column

>>> src.var
            gene_id
0      LOC103693496
1      LOC102556157
2      LOC102548633
3      LOC103690914
4      LOC102556098
...             ...
24987         Usp9y
24988  LOC120099595
24989  LOC120099581
24990  LOC120099587
24991  LOC120099591

For better or worse, MapMyCells expects the gene identifiers (or valid gene symbols) to be the index of var, not a column. So, I would recommend going back into whatever code created the h5ad file and running something like

var = var.set_index('gene_id')
ad = anndata.AnnData(X=X, obs=obs, var=var)
ad.write_h5ad('path/to/file.h5ad')

and then resubmitting.

To answer the implicit question “how does MapMyCells determine what species my data is taken from?” That determination is made exclusively using the index of the var dataframe. It does not depend on the quality of the data at all.

muchangqing777 · November 4, 2025, 11:26am

Thank you for your solution.

I just tried converting all the gene names to ENSEMBL IDs and attempted the conversion of the Seurat object to an h5ad file using the following code:

r

ad <- AnnData(
  X = t(as(count_matrix, "CsparseMatrix")),  # Using sparse matrix format
  obs = obs,  # Sample metadata
  var = data.frame(gene_id = gene)  # Gene metadata
)
# Set the output path
output_path <- "Sham_clean1.ENSEMBL.h5ad"
# Write the h5ad file
write_h5ad(ad, output_path, compression = "gzip")

After that, I noticed that the annotation worked perfectly fine. This is quite a peculiar phenomenon.
I wonder if it has something to do with what you mentioned: 【MapMyCells expects the gene identifiers (or valid gene symbols) to be the index of , not a column】?

Topic		Replies	Views
Mapping failed MapMyCells MapMyCells	7	166	July 12, 2024
Low diversity in mapped results MapMyCells analysis	15	121	May 28, 2025
Error in data output in MapMyCells MapMyCells celltype	17	208	April 3, 2025
Mapping failed because of application errors MapMyCells	12	502	February 26, 2025
Mapping failed due to application errors MapMyCells	6	108	February 28, 2025