Hi! I want to compare some cortical snRNA-seq data with a reference dataset. I’ve seen this reference dataset [ Whole Cortex & Hippocampus - SMART-seq (2019)
with 10x-SMART-seq taxonomy (2021)] but it is too big to work with it, so I was planning on filtering it. How can I eliminate non-cortical cells? Also, I am using the Seurat object upload in there, should I work with some other file?
Thank you!
To subset the data you will first need to load the full file somewhere (I assume using R if you are using the Seurat object?). Any of the methods that you can get loaded in (Seurat or otherwise) should work. I like the “fread” function in the “data.table” R library for reading csv files.
Separately, you’ll need to download and read in the table of metadata on the download page (direct link). To eliminate non-cortical cells you can use one or both of two strategies based on the “region” (or “region_label”) column in the metadata.
- Filter out any cells with regions included cortical areas you don’t care about. You can look at the anatomic reference atlas to determine the specific name and locations of the abbreviations in this region.
- Filter out any cell types that almost entirely include cells from regions you don’t care about. This would require making some judgement calls (e.g., what fraction of cells in cortex is “too much”?), and you could do it computationally (recommended) or visually by looking at the Sampling Strategy heatmap (not recommended).
Another option would be to filter cells (see #1 above) and then also omit any clusters with “CTX” in the name, which indicates that cells in that cluster are primarily found in cortex.
A final option if you want to do cluster-level analysis, rather than cell-level analysis, would be to download on of the “Gene Expression grouped by Cluster…” tables on the download page. These are much smaller and may be easier to work with, but you’re limited in what you can do with these kinds of summary files.
Best,
Jeremy