Gene ID Formatting to match Reference

Hi Everyone,

I’m having an error getting my genes to map to those in the Mouse whole brain reference. I get this error: “RuntimeError: After comparing query data to reference data, no valid marker genes could be found at any level in the taxonomy.”

I’ve confirmed that the reference JSON uses Ensembl IDs and those are contained in my H5AD. However, I think my issue is arising from the way they are saved under gene_ids. The Ensembl IDs are present, but the mapper is only referencing the first column. I feel like its an easy fix but I’m stumped, any guidance?

image

Hello @mmond,

The MapMyCells code looks for gene identifiers in the index of the var dataframe. Right now, it looks like your dataframe is such that

var.index.values = ['Xkr4', 'Gm1992', 'Gm19938'...]

and you need to be in a state where

var.index.values = ['ENSMUSG00000051951', 'ENSMUSG00000089699', ...]

Off-the-cuff, the easiest way to make this transformation would be

import anndata
src = anndata.read_h5ad('/path/to/original_file.h5ad')

src_var = src.var
dst_var = src_var.reset_index().set_index('gene_ids')

dst = anndata.AnnData(X=src.X, obs=src.obs, var=dst_var)
dst.write_h5ad('/path/to/reformatted_file.h5ad')

I am a little confused that you are having this problem. The online MapMyCells tool has a step that infers ENSEMBL IDs from gene symbols, in the event that var is indexed on gene symbols. Are you using the online app, or running the code locally*? If you are running the online MapMyCells tool, would you mind posting the full run-ID when/if you encounter this error again. I would be fascinated to see what it is not inferring ENSEMBL IDs from your gene symbols.

*if you are running the code locally, I am not surprised. The step that transforms gene symbols to Ensembl IDs is in a separate data validator module that isn’t a default part of the pipeline when running the code locally.

That seems to have done it. Running now, thanks. I am running locally, so that must be the issue.