Gene expression matrix.cvs is too large to load it

Hi all,

It seems like there are a couple of unresolved issues on this thread related to the ‘Mouse Whole Cortex and Hippocampus 10x dataset (2021)’, which I will try and address (and hopefully others can improve on!); namely, 1) ‘What is the best way to read the data and metadata into R?’ and 2) ‘How can these data be loaded into Seurat?’. I’ve included an R script which goes into a bit more detail about this. For reading the data into R, I’d suggest either the rhdf5 or the HDF5Array R libraries. Note that count matrix in expression_matrix.hdf5 is a dense matrix, and that I was unable to convert this into a sparse matrix format using an R server with 1TB of memory. I was also unable to transpose the full matrix (although this step is possible since @Delong.Zhou did it).

In terms of Seurat, I was also unable to read the full matrix into Seurat. I would strongly suggest applying Seurat only to a subset of the data at a time (e.g., run glia, and distinct neuronal subclasses as separate analyses). If you want to run an analysis on the full data set, there are strategies in python that use memory more efficiently that might be a better bet (hopefully @christoph-hue or others that work with python can comment on this?). Finally, @Delong.Zhou, if you get a solution from the Seurat team, please post the link to this thread to let everyone else know.

Best,
Jeremy

## SCRIPT FOR READING LARGE hdf5 DATA SET INTO R

## Load necessary libraries
library(rhdf5)      # For reading hdf5; https://www.bioconductor.org/packages/devel/bioc/vignettes/rhdf5/inst/doc/rhdf5.html
library(HDF5Array)  # Alternatively option for reading the data matrix with less memory: https://rdrr.io/github/Bioconductor/HDF5Array/
library(data.table) # For fast reading of csv files
library(Seurat)


## Read in sample and gene names for use later
genes   <- h5read("expression_matrix.hdf5","/data/gene")
samples <- h5read("expression_matrix.hdf5","/data/samples")


## Read in the metadata using fread (fast!)
metadata <- fread("metadata.csv")
metadata <- as.data.frame(metadata)
rownames(metadata) <- metadata$sample_name
# Note that the order of metadata and counts and the number of cells are DIFFERENT 
#   (there are 107 cells with no metadata). This will be critical later!


## Subsample your data
## -- If you want to analyze these data in R and especially using Seurat, you will
## --   have a better chance of success if you only work on parts of the data at a
## --   time.  We suggest, using information in the metadata (e.g., cell type or
## --   brain region columns) to subset.   
## -- For this example code, I'm selecting 1000 random cells checking overlap with samples
set.seed(42)
use_samples  <- intersect(sample(rownames(metadata),1000),samples)
read_samples <- sort(match(use_samples,samples))


## Read in the count matrix in one of three ways

## Strategy #1: Read in the entire data set using h5read. The data set is ~145GB total,
##   and as far as I can tell there is no way to convert it into a sparse matrix, transpose
##   it, or read it into Seurat without subsetting (at least with the 1TB of RAM my
##   computer has.  It may be possible with python or if you have more memory.

system.time({
  counts <- h5read("expression_matrix.hdf5", "/data/counts")
}) 
#    user  system elapsed 
# 398.307 146.285 545.006 

## Strategy #2: Read in only a relevant subset of data using h5read. This is the 
##   method that probably works best in most situations.
system.time({
  counts1 <- h5read("expression_matrix.hdf5", "/data/counts", index = list(read_samples,NULL))
  counts1 <- t(counts1)
  subcounts1 <- as(counts1, "dgCMatrix")
})  
#    user  system elapsed 
# 270.724  36.517 307.828    

## Strategy #3: Work with DelayedArray format before reading in (via HDF5Array library).  
##   This has the advantage of being able to manipulate the data on disk, and can read
##   the file directly into sparse matrix format and therefore requires less memory.
##   However, this method is much slower as the number of non-sequential cells increases
##   and will crash if you have too many cells (I'm not sure what the cutoff is, but I 
##   couldn't read the whole data set this way.)

system.time({
  counts2 <- HDF5Array("expression_matrix.hdf5", "/data/counts")  # Note that COLUMNS are genes
  counts2 <- t(counts2)
  subcounts2 <- as(counts2[,read_samples], "dgCMatrix")
})
#    user  system elapsed 
# 310.024  63.176 375.114 


## Assess variable sizes (optional)
sapply(ls(),function(x){object.size(get(x))})[c("counts","counts1","subcounts1","counts2","subcounts2")]
#        counts      counts1   subcounts1      counts2   subcounts2
#  145243576056    124212216     49413768         3368     49413768 


## Add gene and sample names to the data matrix
## -- Note: We'll works with the recommended strategy (#2) for the rest of the script
rownames(subcounts1) <- as.character(genes)
colnames(subcounts1) <- as.character(samples) [read_samples]


## Read the subsetted data and metadata into Seurat
## -- Note that this **WILL NOT WORK** for the full 1 Million+ cell data set!
## -- I would strongly encourage other methods for analysis when dealing with more than ~100,000 cells at a time
## -- Also recall that the order of data and meta do not match, so reorder here
seu <- CreateSeuratObject(counts=subcounts1)          # Put in a Seurat object
met <- as.data.frame(metadata[colnames(subcounts1),]) # Format the metadata
seu <- AddMetaData(seu,met)                           # Add the metadata

1 Like