Problem Description
I’m experiencing a puzzling issue with MapMyCells where the same rat brain snRNA-seq dataset produces completely different results depending on quality control (QC) status:
Successful Case (With QC)
-
QC criteria:
nCount_RNA >= 500 & nCount_RNA <= 55000 & nFeature_RNA >= 250 & nFeature_RNA <= 8000 & percent.mt <= 5 -
Result: Excellent annotation results that closely match manual annotation using established markers
-
Species correctly identified: Rat → Mouse mapping works perfectly
Failed Case (Without QC)
-
Same sample, no cell filtering
-
Same code for h5ad conversion
-
File uploads successfully but annotation fails with species misidentification
Error Details
Key error messages:
text
Based on 694 genes, your input data is from species 'Bacillus megaterium phage G:2884420'
Mapping genes from species 'Bacillus megaterium phage G:2884420' to 'Balb/c mouse:10090'
WARNING: None of your genes could be mapped to unique genes aligned to species 'Balb/c mouse:10090' and authority 'ENSEMBL'
RuntimeError: After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.
Gene ID comparison shows complete mismatch:
-
Query genes:
['INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1', ...] -
Reference genes:
['ENSMUSG00000000028', 'ENSMUSG00000000037', ...]
What I’ve Tried
-
Gene alignment - No improvement
-
Same processing pipeline - Only difference is QC filtering
-
Multiple file validations - Both h5ad files pass validation
Questions for the Community
-
Why would unfiltered data cause species misidentification from rat to bacteriophage?
-
Could low-quality cells (dead cells, empty droplets, debris) interfere with MapMyCells’ species detection algorithm?
-
What specific aspects of low-quality cells might mimic bacteriophage gene expression patterns?
-
Has anyone encountered similar species misidentification with unfiltered single-cell data?
-
Are there minimum QC thresholds required for MapMyCells to function correctly?
Technical Context
-
Sample: Rat brain snRNA-seq (Sham group)
-
Reference: Balb/c mouse (10090) with ENSEMBL gene IDs
-
Tool: MapMyCells web interface
-
Gene format: Gene symbols (work perfectly in QC-filtered data)
The identical data works flawlessly after basic QC but completely fails species identification without filtering. Any insights would be greatly appreciated!
Complete Error Log:
2.80718e+00 seconds == DONE VALIDATING Sham_clean0.h5ad; no changes required
3.03144e+00 seconds == CLEANING UP
3.11604e+00 seconds == Validation run time: 3.0492148399353027
4.36425e-03 seconds == ENV: is_torch_available: False
4.37570e-03 seconds == ENV: is_cuda_available: False
4.37927e-03 seconds == ENV: use_torch: False
4.38309e-03 seconds == ENV: multiprocessing start method: fork
4.38595e-03 seconds == ENV: Python version: 3.10.13 (main, Sep 17 2025, 15:25:21) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)]
4.38905e-03 seconds == ENV: anndata version: 0.11.4
4.39167e-03 seconds == ENV: numpy version: 2.2.6
4.63963e-03 seconds == BENCHMARK: spent 2.4033e-04 seconds validating config and copying data
4.70448e-03 seconds == using precomputed_stats_ABC_revision_230821.h5 for precomputed_stats
4.70734e-03 seconds == reading taxonomy_tree from precomputed_stats_ABC_revision_230821.h5
5.20512e-01 seconds == ***Checking to see if we need to map query genes onto reference dataset
2.42698e+01 seconds == Reference data belongs to species Balb/c mouse:10090
2.42716e+01 seconds == Reference genes are from authority ‘ENSEMBL’
2.42995e+01 seconds == Mapping input genes to ‘Balb/c mouse:10090 – ENSEMBL’ using
GitHub - AllenInstitute/mmc_gene_mapper: Gene ID mapper/ortholog finder for MapMyCells version 0.2.1
backed by database file: mmc_gene_mapper.2025-08-04.db
created on: 2025-08-04-18-10-52
hash: md5:755b0724c2ff00cc199f48e2718a09e5
2.56525e+01 seconds == Based on 694 genes, your input data is from species ‘Bacillus megaterium phage G:2884420’
2.56706e+01 seconds == Input genes are from species ‘Bacillus megaterium phage G:2884420’
2.56751e+01 seconds == Mapping 24992 input genes from ‘symbols’ to ‘NCBI’ (e.g. [‘0’ ‘1’ ‘2’ ‘3’ ‘4’])
2.59965e+01 seconds == Mapping genes from species ‘Bacillus megaterium phage G:2884420’ to ‘Balb/c mouse:10090’
2.63190e+01 seconds == Mapping input genes from ‘NCBI’ to ‘ENSEMBL’
2.66836e+01 seconds == WARNING: None of your genes could be mapped to unique genes aligned to species ‘Balb/c mouse:10090’ and authority ‘ENSEMBL’
2.66949e+01 seconds == ***Mapping of query genes to reference dataset complete
2.69166e+01 seconds == an ERROR occurred ====
Traceback (most recent call last):
File cell_type_mapper/cli/from_specified_markers.py, line 164, in run_mapping
output = _run_mapping(
File cell_type_mapper/cli/from_specified_markers.py, line 429, in _run_mapping
create_marker_cache_from_specified_markers(
File cell_type_mapper/type_assignment/marker_cache_v2.py, line 115, in create_marker_cache_from_specified_markers
marker_lookup = validate_marker_lookup(
File cell_type_mapper/type_assignment/marker_cache_v2.py, line 785, in validate_marker_lookup
raise RuntimeError(error_msg)
RuntimeError: After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.
Example of genes in query set:
[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]
Example of marker genes:
[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]
2.69167e+01 seconds == CLEANING UP
e=RuntimeError(“After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.\nExample of genes in query set:\n[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]\nExample of marker genes:\n[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]”), type(e)=<class ‘RuntimeError’>, fname=‘run.py’, lineno=290
Traceback (most recent call last):
File “/apps/run.py”, line 290, in run
runner.run()
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/cli/from_specified_markers.py”, line 80, in run
self.run_mapping(write_to_disk=True)
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/cli/from_specified_markers.py”, line 164, in run_mapping
output = _run_mapping(
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/cli/from_specified_markers.py”, line 429, in _run_mapping
create_marker_cache_from_specified_markers(
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/type_assignment/marker_cache_v2.py”, line 115, in create_marker_cache_from_specified_markers
marker_lookup = validate_marker_lookup(
File “/usr/local/lib/python3.10/site-packages/cell_type_mapper/type_assignment/marker_cache_v2.py”, line 785, in validate_marker_lookup
raise RuntimeError(error_msg)
RuntimeError: After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.
Example of genes in query set:
[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]
Example of marker genes:
[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]
Mapping algorithm failed because of application errors.
Unexpected e=RuntimeError(“After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.\nExample of genes in query set:\n[‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_1’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_10’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_100’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_101’, ‘INVALID_QUERY_GENE:ortholog:UNMAPPABLE_NO_MATCH_102’]\nExample of marker genes:\n[‘ENSMUSG00000000028’, ‘ENSMUSG00000000037’, ‘ENSMUSG00000000056’, ‘ENSMUSG00000000058’, ‘ENSMUSG00000000078’]”), type(e)=<class ‘RuntimeError’>, fname=‘run.py’, lineno=375