Gene expression matrix.cvs is too large to load it

I just download human/mouse brain matrix from https://portal.brain-map.org/atlases-and-data/rnaseq and found that these two matrix.csv files were too large to load it by read.csv(filepath,header=T,row.names=1). Is there any advice to load it successfully ? actually, I just want to analyze some cell types, like oligodendrite cells, is there any advice to gain these cells’ expression matrix?

Hi @pommy: I would suggest trying the fread function in the data.table R library. This is typically about 100x faster than read.csv. I don’t know of a way of only loading a subset of the data matrix from a csv file.

Hi @pommy,

We recognize there are some issues loading the full csv. CSVs are not an ideal method for transport, and we intend to have more efficient methods in the near future.

It’s helpful to hear your use case of accessing specific file types.

Here are a few options we’re considering, please feel free to share which of them would work best for your needs:

  • Efficient file format, like hdf5 or LOOM
  • Direct API access
  • Separate downloads by cell type

Hi,

Is there any update on this? Providing the data in a compressed/sparse matrix format would be useful. I can load the mouse brain matrix.csv file into R with the fread package using multiple threads, but I haven’t been able to create a Seurat object after loading the file.

Thanks

Hi Bruno,

We haven’t fixed this issue yet, but we are considering a few options in coming months for addressing this. We were hoping for exactly the kind of input you provided: a sparse/compressed matrix format with the full dataset would be really valuable (as opposed to direct API access for subsets or separate downloads for individual cell types or subclasses).

I’ll post back here when we have a planned update on our roadmap. (The team is busy working on some exciting features for patch-seq data at the moment!)

I had the same issue and i was able read the 7 GB rna-seq data “aibs_human_m1_10x” quite quickly while maintaining a low memory profile using the DASK-library as explained here:

here is some example-code:

from dask import dataframe as dd

# install like this (according to https://docs.dask.org/en/latest/install.html#pip):
# pip install "dask[complete]"

# use dask to circumvent memory-issues, which occur according to 
# https://community.brain-map.org/t/reading-rna-seq-data-into-python/658
def read():
  return dd.read_csv(
urlpath='data-rnaseq/aibs_human_m1_10x/matrix.csv', 
sample=256000 * 100)

fyi, why I provided a larger sample-parameter: https://stackoverflow.com/questions/61647974/valueerror-sample-is-not-large-enough-to-include-at-least-one-row-of-data-plea

Hi Tyler,

I also believe that the Matrix.csv found at Mouse Whole Cortex and Hippocampus 10x - brain-map.org is transposed. All of the other files I have used (including my own through the Cell Ranger workflow) are patients as columns and genes as rows. I understand that putting patients as rows makes the file deeper than longer, thus conserving space, but using a cluster, I am unable to transpose the file into the standard format that Seurat uses. Would there be any way to either upload a transposed matrix or instructions on how to transpose it in an ideal manner?

Thank you,

-Damien