MapMyCells Species Misidentification Issue with Unfiltered scRNA-seq Data ( rat to Bacillus megaterium)

Problem Description

I’m experiencing a puzzling issue with MapMyCells where the same rat brain snRNA-seq dataset produces completely different results depending on quality control (QC) status:

Successful Case (With QC)

  • QC criteria: nCount_RNA >= 500 & nCount_RNA <= 55000 & nFeature_RNA >= 250 & nFeature_RNA <= 8000 & percent.mt <= 5

  • Result: Excellent annotation results that closely match manual annotation using established markers

  • Species correctly identified: Rat → Mouse mapping works perfectly

Failed Case (Without QC)

  • Same sample, no cell filtering

  • Same code for h5ad conversion

  • File uploads successfully but annotation fails with species misidentification

Error Details

Key error messages:

text

Based on 694 genes, your input data is from species 'Bacillus megaterium phage G:2884420'
Mapping genes from species 'Bacillus megaterium phage G:2884420' to 'Balb/c mouse:10090'
WARNING: None of your genes could be mapped to unique genes aligned to species 'Balb/c mouse:10090' and authority 'ENSEMBL'
RuntimeError: After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.

Gene ID comparison shows complete mismatch:

  • Query genes: ['INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1', ...]

  • Reference genes: ['ENSMUSG00000000028', 'ENSMUSG00000000037', ...]

What I’ve Tried

  1. Gene alignment - No improvement

  2. Same processing pipeline - Only difference is QC filtering

  3. Multiple file validations - Both h5ad files pass validation

Questions for the Community

  1. Why would unfiltered data cause species misidentification from rat to bacteriophage?

  2. Could low-quality cells (dead cells, empty droplets, debris) interfere with MapMyCells’ species detection algorithm?

  3. What specific aspects of low-quality cells might mimic bacteriophage gene expression patterns?

  4. Has anyone encountered similar species misidentification with unfiltered single-cell data?

  5. Are there minimum QC thresholds required for MapMyCells to function correctly?

Technical Context

  • Sample: Rat brain snRNA-seq (Sham group)

  • Reference: Balb/c mouse (10090) with ENSEMBL gene IDs

  • Tool: MapMyCells web interface

  • Gene format: Gene symbols (work perfectly in QC-filtered data)

The identical data works flawlessly after basic QC but completely fails species identification without filtering. Any insights would be greatly appreciated!

Complete Error Log:
2.80718e+00 seconds == DONE VALIDATING Sham_clean0.h5ad; no changes required
3.03144e+00 seconds == CLEANING UP
3.11604e+00 seconds == Validation run time: 3.0492148399353027
4.36425e-03 seconds == ENV: is_torch_available: False
4.37570e-03 seconds == ENV: is_cuda_available: False
4.37927e-03 seconds == ENV: use_torch: False
4.38309e-03 seconds == ENV: multiprocessing start method: fork
4.38595e-03 seconds == ENV: Python version: 3.10.13 (main, Sep 17 2025, 15:25:21) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)]
4.38905e-03 seconds == ENV: anndata version: 0.11.4
4.39167e-03 seconds == ENV: numpy version: 2.2.6
4.63963e-03 seconds == BENCHMARK: spent 2.4033e-04 seconds validating config and copying data
4.70448e-03 seconds == using precomputed_stats_ABC_revision_230821.h5 for precomputed_stats
4.70734e-03 seconds == reading taxonomy_tree from precomputed_stats_ABC_revision_230821.h5
5.20512e-01 seconds == ***Checking to see if we need to map query genes onto reference dataset
2.42698e+01 seconds == Reference data belongs to species Balb/c mouse:10090
2.42716e+01 seconds == Reference genes are from authority ‘ENSEMBL’
2.42995e+01 seconds == Mapping input genes to ‘Balb/c mouse:10090 – ENSEMBL’ using
GitHub - AllenInstitute/mmc_gene_mapper: Gene ID mapper/ortholog finder for MapMyCells version 0.2.1
backed by database file: mmc_gene_mapper.2025-08-04.db
created on: 2025-08-04-18-10-52
hash: md5:755b0724c2ff00cc199f48e2718a09e5
2.56525e+01 seconds == Based on 694 genes, your input data is from species ‘Bacillus megaterium phage G:2884420’
2.56706e+01 seconds == Input genes are from species ‘Bacillus megaterium phage G:2884420’
2.56751e+01 seconds == Mapping 24992 input genes from ‘symbols’ to ‘NCBI’ (e.g. [‘0’ ‘1’ ‘2’ ‘3’ ‘4’])
2.59965e+01 seconds == Mapping genes from species ‘Bacillus megaterium phage G:2884420’ to ‘Balb/c mouse:10090’
2.63190e+01 seconds == Mapping input genes from ‘NCBI’ to ‘ENSEMBL’
2.66836e+01 seconds == WARNING: None of your genes could be mapped to unique genes aligned to species ‘Balb/c mouse:10090’ and authority ‘ENSEMBL’
2.66949e+01 seconds == ***Mapping of query genes to reference dataset complete
2.69166e+01 seconds == an ERROR occurred ====
Traceback (most recent call last):
File cell_type_mapper/cli/from_specified_markers.py, line 164, in run_mapping
output = _run_mapping(
File cell_type_mapper/cli/from_specified_markers.py, line 429, in _run_mapping
create_marker_cache_from_specified_markers(
File cell_type_mapper/type_assignment/marker_cache_v2.py, line 115, in create_marker_cache_from_specified_markers
marker_lookup = validate_marker_lookup(
File cell_type_mapper/type_assignment/marker_cache_v2.py, line 785, in validate_marker_lookup
raise RuntimeError(error_msg)
RuntimeError: After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.
Example of genes in query set:
[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]
Example of marker genes:
[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]

2.69167e+01 seconds == CLEANING UP
e=RuntimeError(“After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.\nExample of genes in query set:\n[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]\nExample of marker genes:\n[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]”), type(e)=<class ‘RuntimeError’>, fname=‘run.py’, lineno=290
Traceback (most recent call last):
File “/apps/run.py”, line 290, in run
runner.run()
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/cli/from_specified_markers.py”, line 80, in run
self.run_mapping(write_to_disk=True)
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/cli/from_specified_markers.py”, line 164, in run_mapping
output = _run_mapping(
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/cli/from_specified_markers.py”, line 429, in _run_mapping
create_marker_cache_from_specified_markers(
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/type_assignment/marker_cache_v2.py”, line 115, in create_marker_cache_from_specified_markers
marker_lookup = validate_marker_lookup(
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/type_assignment/marker_cache_v2.py”, line 785, in validate_marker_lookup
raise RuntimeError(error_msg)
RuntimeError: After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.
Example of genes in query set:
[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]
Example of marker genes:
[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]

Mapping algorithm failed because of application errors.
Unexpected e=RuntimeError(“After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.\nExample of genes in query set:\n[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]\nExample of marker genes:\n[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]”), type(e)=<class ‘RuntimeError’>, fname=‘run.py’, lineno=375

Hi,

Do you mind running the failed again and posting the Run ID (a big alphanumeric string that will appear in the red “your mapping failed” box in the browser)

That will allow me to download your data and inspect the gene identifiers in the file you submitted. That will give me some insight into why the code doesn’t correctly identify your data as rat data.

(post deleted by author)

Hi @muchangqing777

I downloaded the data that failed. It looks like, for some reason, the index on your var dataframe (the identifiers of your genes) is just a bunch of integers as strings.

>>> import anndata
>>> src = anndata.read_h5ad(fname, backed='r')
>>> src.var.index.values
array(['0', '1', '2', ..., '24989', '24990', '24991'],
      shape=(24992,), dtype=object)

The gene identifiers are in your var dataframe, but they are not the index. They are in their own column

>>> src.var
            gene_id
0      LOC103693496
1      LOC102556157
2      LOC102548633
3      LOC103690914
4      LOC102556098
...             ...
24987         Usp9y
24988  LOC120099595
24989  LOC120099581
24990  LOC120099587
24991  LOC120099591

For better or worse, MapMyCells expects the gene identifiers (or valid gene symbols) to be the index of var, not a column. So, I would recommend going back into whatever code created the h5ad file and running something like

var = var.set_index('gene_id')
ad = anndata.AnnData(X=X, obs=obs, var=var)
ad.write_h5ad('path/to/file.h5ad')

and then resubmitting.

To answer the implicit question “how does MapMyCells determine what species my data is taken from?” That determination is made exclusively using the index of the var dataframe. It does not depend on the quality of the data at all.

Thank you for your solution.

I just tried converting all the gene names to ENSEMBL IDs and attempted the conversion of the Seurat object to an h5ad file using the following code:

r

ad <- AnnData(
  X = t(as(count_matrix, "CsparseMatrix")),  # Using sparse matrix format
  obs = obs,  # Sample metadata
  var = data.frame(gene_id = gene)  # Gene metadata
)
# Set the output path
output_path <- "Sham_clean1.ENSEMBL.h5ad"
# Write the h5ad file
write_h5ad(ad, output_path, compression = "gzip")

After that, I noticed that the annotation worked perfectly fine. This is quite a peculiar phenomenon.
I wonder if it has something to do with what you mentioned: 【MapMyCells expects the gene identifiers (or valid gene symbols) to be the index of , not a column】?