Gene expression matrix.cvs is too large to load it

Hello, @jeremyinseattle and @tylermo!

I’m reaching out with a few questions about handling the Whole Cortex and Hippocampus 10X dataset in Seurat:

  1. Complete Dataset in Seurat: Has there been any success in loading the entire dataset into a Seurat object since its release? If so, is this object accessible for use? I ask because it was alluded to earlier here that it might be possible to do that using Python with more memory. I’m under the impression that working with this dataset becomes significantly easier once it’s in a Seurat object format. Please let me know if my understanding is incorrect.

  2. Label Transfer: For cell type annotation in our recent cortical snRNAseq data, I’ve been effectively using MapMyCells. To complement this, I’m exploring Seurat’s TransferData function to transfer labels onto our dataset AND get prediction confidence scores. This approach would mirror MapMyCells’ functionality but offer more flexibility with parameter adjustments if needed. For this, I need both our data and the whole cortex and hippocampus dataset in Seurat object form, hence my initial query above.

As per @jeremyinseattle’s suggestion, I understand that subsetting the data might be a practical solution, but I have some reservations:

  1. Best Practices for Subsetting: I’m seeking guidance on the most effective subsetting techniques. A key concern is the potential exclusion of rare cell types or crucial data elements, and I wonder how significant this issue might be. Are there established criteria or methods to ensure that a randomly chosen subset accurately mirrors the entire dataset? It might perhaps depend on the question at hand, but I’d love to hear your views on this. Additionally, I’m interested in determining the optimal number of cells for a subset that balances practicality with comprehensiveness. Is there a recommended ‘sanity check’ or verification process to confirm the validity and reliability of findings derived from subsetted data?

I’d really appreciate your insights!
Thank you!

Best,
Sai

Hi Sai,

For #1, our best guidance really comes from Seurat docs directly:

In Seurat v5, we introduce new infrastructure and methods to analyze, interpret, and explore exciting datasets spanning millions of cells, even if they cannot be fully loaded into memory. We introduce support for ‘sketch’-based analysis, where representative subsamples of a large dataset are stored in-memory to enable rapid and iterative analysis - while the full dataset remains accessible via on-disk storage.

Re: label transfer, thanks for sharing the details of your workflow. That’s a great way for us to improve MapMyCells!

For #3 - subsetting, if you can sample cells up to 100 cells per each cluster, that would be sufficient.

Hi, I’ve followed this discussion for a while and have recently tried the sketch-based approach that you’ve quoted from the Seurat docs.

I have had no luck with this because the Seurat’s sketch-based workflow relies on BPCells to load matrices from .h5 files and, in so far as I can tell, expects these to be sparse matrices. As @jeremyinseattle pointed out, the full matrix cannot be loaded, converted to a sparse matrix or transposed easily. I had originally hoped to be able to do this so that I could perform the necessary matrix operations and write the matrix to a new .h5 file, and then proceed with Seurat’s recommended sketch based workflow. Basically, because of the format of the matrix in the .h5 file, loading the full dataset into memory seems at some point seems inevitable. Of course, subsetting as per @jeremyinseattle 's script is an option but I share @sbhamidipati 's concerns about the rare cell types and also sought to try the leverage-score-based sampling method that appears a major advantage of the Sketch-based approach offered in Seurat

Does anyone know of a workaround to get use the full matrix with Seurat’s sketch-based workflow?

thanks,
Tom