How can I optimize my snRNA-seq workflow—both experimentally and bioinformatically—to enable high-resolution transcriptomic cell type mapping?

Hello!

I’m writing to get some guidance on optimizing my snRNA-seq workflow to facilitate accurate mapping to the Allen Institute’s transcriptomic types at the highest granularity possible.

Specifically,

  1. Is there an optimal number/range of reads per nucleus to aim for when sequencing to ensure robust and confident mapping to the Allen Institute’s transcriptomic types using MapMyCells or similar label transfer methods?
  2. Any recommendations for sample preparation and sequencing platform selection to maximize data quality and enable unambiguous classification of cell types?

Additionally, I’d like to know whether insufficient reads or genes detected could prevent classification into finer granularity levels (e.g., class, subclass, subtype, cluster, or supertype).

  • Is there a recommended number/range of genes per nucleus that ensures accurate classification at a given level of classification?
  • Are there specific metrics or thresholds (e.g., sequencing saturation, % exonic/intronic reads, etc.) that should guide quality assessment to achieve the highest resolution possible?
  • What factors or biases (e.g., dropouts, low-abundance transcripts) could impact the ability to resolve rare or specific subtypes, and how can I address these?

Beyond sequencing depth, I’d love insights on other factors that might influence the outcomes:

  • Downstream analysis: What bioinformatics best practices (e.g., normalization, batch correction, integration) can support unambiguous clustering?
  • Validation of cell types: How can I confirm that all expected populations, especially rare or ambiguous ones, are captured and classified correctly?

Finally, are there benchmarks or datasets from the Allen Institute or other resources that I can use to validate my clustering and assess whether my data aligns well at the desired level of granularity?

My ultimate goal is to optimize my workflow so that the quality of my data—whether genes detected, reads, or other factors—is not a bottleneck for achieving high-resolution classification. If there are any other considerations or questions I should ask, I’d greatly appreciate your advice!

Thank you so much for your support!

Warmly,
Sai

Hi @sbhamidipati ,

I’m a data scientist/software engineer here at the Allen Institute. My principal qualification to answer your question is that I maintain the python code providing the backend for MapMyCells. I say this to clarify that I have precisely zero bench science experience and will let others speak to your questions regarding sample preparation and sequencing platform selection.

A lot of your questions reflect tests that we really would have liked to do on the MapMyCells algorithm but de-prioritized as our release date approached. That being said, you can download the data that the MapMyCells algorithm was “trained”* on using the Jupyter notebooks/python package provided in the abc_atlas_access repository. Specifically, you will want to start with this notebook and, for Whole Mouse Brain (I am assuming you are working with mouse data; everything I say also applies to Whole Human Brain) this notebook.

Using those tools, you can get both the single cell RNA sequencing training data and the ground truth cell type annotations. You can then alter the data to simulate the various effects you are concerned about (drop-outs, varying read rates per nucleus, etc.) and see how the simulated affects alter the quality of the mapping. (Again, I would have loved to do this test myself, but haven’t made the time.)

If the website interface proves too clunky for your work, the python code I linked to when introducing myself can be installed and run on your own personal machine. There is an example notebook here that shows how to run the cell type mapper locally. If you want to map to the officially supported taxonomies on the website, you will need to download some additional data files as explained here.

Additionally, the abc_atlas_access repository I pointed you to above will give you the ability to download the spatial transcriptomics datasets currently being served through the ABC Atlas. These also come with cell type annotations that were derived using the MapMyCells algorithm and vetted by Allen Institute scientists. This will give you another handle on validation (though it sounds like you are doing single cell sequencing, not spatial transcriptomics).

I know these are a lot of “physician, heal thyself” recommendation. Please reach out if anything doesn’t work, is unclear, or isn’t what you wanted.

One final note: you asked about batch correction. If you are just using MapMyCells (i.e., you are trying to a pre-existing cell type taxonomy, rather than deriving de novo cell type clusters from your data) I do not think you will need to worry about batch correction. The MapMyCells algorithm maps each cell independently of all other cells, so there isn’t really a batch to correct for. If you want to see what I am talking about this link points to a webinar I gave on how to use MapMyCells. I explain the algorithm at around the 13:20 mark. The mapping algorithm is also described in text on this page.

* I put quotations marks around the word “trained” because, with the exception of one algorithm that is only supported for the SEA-AD taxonomy, the MapMyCells algorithms are not deep learning algorithm. “Training data” the context of MapMyCells really just means “the set of cells we used to calculate the summary statistics representing the cell types in the taxonomy.” That is technically a valid use of the term “training data,” but I feel like the deep learning implications of the term are slowly swallowing our communal lexicon, so I thought I would clarify.

1 Like

Hi @sbhamidipati

To briefly address a few of your other questions:

  1. Is there an optimal number/range of reads per nucleus to aim for…: The short answer is no. The better QCed your data, and the closer your method is to the method used in the reference taxonomy you are mapping against, the better your results will be. Confidence scores and post-hoc biological sanity checks can be used to determine how well you can believe the results.
  2. Any recommendations for sample preparation and sequencing platform selection…: For a variety of biological, technical, and logistical reasons (that I am also not the right person to comment on), the methods that we use are the ones we’ve decided are best in our hands, and therefore are the ones we’d recommend. Allen Institute protocols can be found on our protocols.io page, and a good starting point is here for human snRNA-seq and here for mouse scRNA-seq
  3. Bioinformatics best practices: I think you’d get a different answer from anyone who you ask. I’d encourage others to comment as well, but I’d recommend starting with any highly-published, well establish method, like the Seurat or scvi-tool suites.
  4. Benchmarking: We’ve done benchmarking for all our reference data sets. We’re still working to get the results public (more soon on brain-map.org!) but you can see what they will look like and the repository for running benchmarking code here on GitHub

Best,
Jeremy