Gene expression matrix.cvs is too large to load it

I just download human/mouse brain matrix from https://portal.brain-map.org/atlases-and-data/rnaseq and found that these two matrix.csv files were too large to load it by read.csv(filepath,header=T,row.names=1). Is there any advice to load it successfully ? actually, I just want to analyze some cell types, like oligodendrite cells, is there any advice to gain these cells’ expression matrix?

1 Like

Hi @pommy: I would suggest trying the fread function in the data.table R library. This is typically about 100x faster than read.csv. I don’t know of a way of only loading a subset of the data matrix from a csv file.

1 Like

Hi @pommy,

We recognize there are some issues loading the full csv. CSVs are not an ideal method for transport, and we intend to have more efficient methods in the near future.

It’s helpful to hear your use case of accessing specific file types.

Here are a few options we’re considering, please feel free to share which of them would work best for your needs:

  • Efficient file format, like hdf5 or LOOM
  • Direct API access
  • Separate downloads by cell type
1 Like

Hi,

Is there any update on this? Providing the data in a compressed/sparse matrix format would be useful. I can load the mouse brain matrix.csv file into R with the fread package using multiple threads, but I haven’t been able to create a Seurat object after loading the file.

Thanks

Hi Bruno,

We haven’t fixed this issue yet, but we are considering a few options in coming months for addressing this. We were hoping for exactly the kind of input you provided: a sparse/compressed matrix format with the full dataset would be really valuable (as opposed to direct API access for subsets or separate downloads for individual cell types or subclasses).

I’ll post back here when we have a planned update on our roadmap. (The team is busy working on some exciting features for patch-seq data at the moment!)

I had the same issue and i was able read the 7 GB rna-seq data “aibs_human_m1_10x” quite quickly while maintaining a low memory profile using the DASK-library as explained here:

here is some example-code:

from dask import dataframe as dd

# install like this (according to https://docs.dask.org/en/latest/install.html#pip):
# pip install "dask[complete]"

# use dask to circumvent memory-issues, which occur according to 
# https://community.brain-map.org/t/reading-rna-seq-data-into-python/658
def read():
  return dd.read_csv(
urlpath='data-rnaseq/aibs_human_m1_10x/matrix.csv', 
sample=256000 * 100)

fyi, why I provided a larger sample-parameter: https://stackoverflow.com/questions/61647974/valueerror-sample-is-not-large-enough-to-include-at-least-one-row-of-data-plea

Hi Tyler,

I also believe that the Matrix.csv found at Mouse Whole Cortex and Hippocampus 10x - brain-map.org is transposed. All of the other files I have used (including my own through the Cell Ranger workflow) are patients as columns and genes as rows. I understand that putting patients as rows makes the file deeper than longer, thus conserving space, but using a cluster, I am unable to transpose the file into the standard format that Seurat uses. Would there be any way to either upload a transposed matrix or instructions on how to transpose it in an ideal manner?

Thank you,

-Damien

Hello,

I would like to follow up on this question to see if there have been any developments with regard to producing a version of the data set that would be compatible with downstream analyses, for example, using the seurat package in R.

Transposing the matrix is necessary for importing the data as a seurat object, but results in a file size in excess of 130GB. This means that the file becomes prohibitively large to work with, even when trying to import the data on a HPC cluster with ~1500 GiB of memory available on the node. I have tried a number of ways to work around this but without success, as e.g. reading in the data sequentially is not an option (everything must stay in memory). Converting to e.g. HDF5 is not possible either, as with that file type there is an inherent limit for column header metadata and in practice, anything exceeding 2K columns is ruled out.

So far it has been possible to load in the transposed table as a data frame, but attempting to convert the data to a seurat object seems to invariably result in a segfault.

Many thanks for any updates / suggestions concerning this issue.

Best wishes,
Jesse Harrison

Hi Jesse,

Sorry for the late reply.

We have been focused on other areas, and have not made improvement yet on this topic. However, we’ll be updating the data for both the Mouse SMART-seq and the Mouse 10x datasets to match the recent paper in Cell by Yao et al. As part of that update, I’m hoping we can share a native seurat object to help you and others in this regard.

This challenge you’re facing is important to us, and clearly impedes your work and our mission of open science. Thank you for raising attention to this again, and I hope we can find a good solution for you and others in the near term.

hi, since the Cell paper is out, there is a link to an updated dataset:
NeMO Data Archive Assets (nemoarchive.org)
it would be great to work directly on the updated dataset, but several colleagues and myself were unable to download the data. Would be great if you could provide some advice on how to access the data.

Thanks

PS: BDBag replies to me like this:
ERROR - Host data.nemoarchive.org responded:
The requested URL /biccn/lab/zeng/transcriptome/scell/SMARTer/processed/YaoHippo2020/smrt.h5 was not found on this server

Hi All,

For Mouse Whole Cortex & Hipp 10x it looks like there was some progress, as there is an hdf5 expression matrix. It’s not talked about in the readme, and it doesn’t appear to be like a loom or ad. Is there some guidance on the easiest way to convert this to a Seurat Object?

Thanks

Hi!

To use Mouse Whole Cortex and Hippocampus 10x dataset (2021), I downloaded the HDF5 file.

I tried to read the hdf5 file in Seurat by this command, but it keeps failing.
allen ← Read10X_h5(“…/expression_matrix.hdf5”)

Anyone can help me with what could be wrong? I wish there was a direct Seurat object file for this specific dataset for downloading - I saw some other dataset having Seurat object available for download.

Thank you,

Jiseok

1 Like

I’m experiencing the same problem:

> Read10X_h5("expression_matrix.hdf5")
Error in `[[.H5File`(infile, paste0(genome, "/data")) : 
  An object with name data/data does not exist in this group

The problem persists when trying to read in the file using the hdf5r library:

> allen <- H5File$new("expression_matrix.hdf5", mode = "r")
> allen$data
NULL
> allen$file_info()
  super_ext_size sohm.hdr_size sohm.msgs_info.index_size
1              0             0                         0
  sohm.msgs_info.heap_size
1                        0
> 

I’ve managed to load the CSV using the data.table package function fread(), but attempting to coerce the object to a sparseMatrix to take up less space produces a segmentation fault.

I would be grateful for any ideas for what to try next!

Same thing here… With the Matrix package I manage to load the matrix.csv and metadata and perform the transpose but Seurat will not load the count matrix because sparse matrix contains too many non zero elements.

I’ve opened a ticket on Seurat as well

counts ← fread(“mouse/matrix.csv”)
setDF(counts)
rownames(counts) ← counts[[1]]
counts[[1]] ← NULL
counts ← t(as.matrix(counts))

metadata ← fread(“mouse/meta.data.csv”)
setDF(metadata)
rownames(metadata) ← metadata[[1]]
metadata[[1]] ← NULL

mouse = CreateSeuratObject(counts = counts, meta.data = metadata ,min.cells = 3, min.features = 200)

Hi all,

It seems like there are a couple of unresolved issues on this thread related to the ‘Mouse Whole Cortex and Hippocampus 10x dataset (2021)’, which I will try and address (and hopefully others can improve on!); namely, 1) ‘What is the best way to read the data and metadata into R?’ and 2) ‘How can these data be loaded into Seurat?’. I’ve included an R script which goes into a bit more detail about this. For reading the data into R, I’d suggest either the rhdf5 or the HDF5Array R libraries. Note that count matrix in expression_matrix.hdf5 is a dense matrix, and that I was unable to convert this into a sparse matrix format using an R server with 1TB of memory. I was also unable to transpose the full matrix (although this step is possible since @Delong.Zhou did it).

In terms of Seurat, I was also unable to read the full matrix into Seurat. I would strongly suggest applying Seurat only to a subset of the data at a time (e.g., run glia, and distinct neuronal subclasses as separate analyses). If you want to run an analysis on the full data set, there are strategies in python that use memory more efficiently that might be a better bet (hopefully @christoph-hue or others that work with python can comment on this?). Finally, @Delong.Zhou, if you get a solution from the Seurat team, please post the link to this thread to let everyone else know.

Best,
Jeremy

## SCRIPT FOR READING LARGE hdf5 DATA SET INTO R

## Load necessary libraries
library(rhdf5)      # For reading hdf5; https://www.bioconductor.org/packages/devel/bioc/vignettes/rhdf5/inst/doc/rhdf5.html
library(HDF5Array)  # Alternatively option for reading the data matrix with less memory: https://rdrr.io/github/Bioconductor/HDF5Array/
library(data.table) # For fast reading of csv files
library(Seurat)


## Read in sample and gene names for use later
genes   <- h5read("expression_matrix.hdf5","/data/gene")
samples <- h5read("expression_matrix.hdf5","/data/samples")


## Read in the metadata using fread (fast!)
metadata <- fread("metadata.csv")
metadata <- as.data.frame(metadata)
rownames(metadata) <- metadata$sample_name
# Note that the order of metadata and counts and the number of cells are DIFFERENT 
#   (there are 107 cells with no metadata). This will be critical later!


## Subsample your data
## -- If you want to analyze these data in R and especially using Seurat, you will
## --   have a better chance of success if you only work on parts of the data at a
## --   time.  We suggest, using information in the metadata (e.g., cell type or
## --   brain region columns) to subset.   
## -- For this example code, I'm selecting 1000 random cells checking overlap with samples
set.seed(42)
use_samples  <- intersect(sample(rownames(metadata),1000),samples)
read_samples <- sort(match(use_samples,samples))


## Read in the count matrix in one of three ways

## Strategy #1: Read in the entire data set using h5read. The data set is ~145GB total,
##   and as far as I can tell there is no way to convert it into a sparse matrix, transpose
##   it, or read it into Seurat without subsetting (at least with the 1TB of RAM my
##   computer has.  It may be possible with python or if you have more memory.

system.time({
  counts <- h5read("expression_matrix.hdf5", "/data/counts")
}) 
#    user  system elapsed 
# 398.307 146.285 545.006 

## Strategy #2: Read in only a relevant subset of data using h5read. This is the 
##   method that probably works best in most situations.
system.time({
  counts1 <- h5read("expression_matrix.hdf5", "/data/counts", index = list(read_samples,NULL))
  counts1 <- t(counts1)
  subcounts1 <- as(counts1, "dgCMatrix")
})  
#    user  system elapsed 
# 270.724  36.517 307.828    

## Strategy #3: Work with DelayedArray format before reading in (via HDF5Array library).  
##   This has the advantage of being able to manipulate the data on disk, and can read
##   the file directly into sparse matrix format and therefore requires less memory.
##   However, this method is much slower as the number of non-sequential cells increases
##   and will crash if you have too many cells (I'm not sure what the cutoff is, but I 
##   couldn't read the whole data set this way.)

system.time({
  counts2 <- HDF5Array("expression_matrix.hdf5", "/data/counts")  # Note that COLUMNS are genes
  counts2 <- t(counts2)
  subcounts2 <- as(counts2[,read_samples], "dgCMatrix")
})
#    user  system elapsed 
# 310.024  63.176 375.114 


## Assess variable sizes (optional)
sapply(ls(),function(x){object.size(get(x))})[c("counts","counts1","subcounts1","counts2","subcounts2")]
#        counts      counts1   subcounts1      counts2   subcounts2
#  145243576056    124212216     49413768         3368     49413768 


## Add gene and sample names to the data matrix
## -- Note: We'll works with the recommended strategy (#2) for the rest of the script
rownames(subcounts1) <- as.character(genes)
colnames(subcounts1) <- as.character(samples) [read_samples]


## Read the subsetted data and metadata into Seurat
## -- Note that this **WILL NOT WORK** for the full 1 Million+ cell data set!
## -- I would strongly encourage other methods for analysis when dealing with more than ~100,000 cells at a time
## -- Also recall that the order of data and meta do not match, so reorder here
seu <- CreateSeuratObject(counts=subcounts1)          # Put in a Seurat object
met <- as.data.frame(metadata[colnames(subcounts1),]) # Format the metadata
seu <- AddMetaData(seu,met)                           # Add the metadata

2 Likes

In my code I included the following lines:

library(unix)
rlimit_as(1e13)

This is supposed to increase memory limit allocated by R. Although I’ve read somewhere that the latest versions of R allocate memory dynamically and render these two lines obsolete.

I was using supercomputer provided by Compute Canada, more specifically Beluga. The peak memory use for me was 450G, but I fear that the memory usage would increase by a large margin for the dimension reduction steps.

So indeed I would also recommand subsetting the data by regions or cell types.

I was only working on ABA-data during an internship and trying to read the RNAseq was simply a proof-of-concept I did as a side-task (see: aba-analysis/HumanRnaSeq.py at master · christoph-hue/aba-analysis · GitHub). I did no downstream-analysis for RNAseq, so unfortunately, I can’t comment on more than that.

I refer to discussions such as this one, where the emphasis is put on picking a proper library, regardless of the language: https://www.quora.com/Is-R-or-Python-a-better-programming-language-for-big-data

@jeremyinseattle Oh my goodness, thank you SO MUCH for the script.
Your script ran without any issue for me (using strategy #2, took like 5 min) and I successfully got a small size seurat object (31053 x 1000) in the end. Now I can use this small subset as a reference to annotate my single-cell data. Maybe for that purpose, I might have to sample more than 1000 cells? This was VERY VERY helpful!!! :star_struck: :+1:

1 Like

Thanks for the discussion and scripts here, we eventually figured out how to extract scRNA data for specific cell types from the large h5 file for the data from the Cell paper.

Now, I’m wondering how to do that for the new Allen Brain Cell Atlas?

The metadata (https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/metadata/WMB-10X/20230830/cell_metadata.csv) seem to lack information that connects the cell types and cell IDs/labels or did I miss something?

As shown here (ABC Atlas: August Update Notes), it is possible to manually extract some cell IDs together with cell type information but the cell IDs seem to be in a different format than in the metadata file or are they related in some way?

If possible, it would be great to have a list of cell IDs together with the corresponding cell type (cluster, supertype, subclass) as for the previous data.

@bob Thanks for your question. Unfortunately it is not possible to match the IDs of cells extracted from the ABC Atlas Cell Selection download using the metadata files. We agree this would be very useful and are working on getting a file together to help you connect the cell types and cell IDs. I will follow up here when this is available. Thank you for providing this feedback! Please don’t hesitate to request other features or data that you need to help us improve the resources we provide.