When try to import the get_gene_data function as instructed in the tutorial: >> from abc_atlas_access.abc_atlas_cache.anndata_utils import get_gene_data
I receive the following error:
ModuleNotFoundError: No module named 'abc_atlas_access.abc_atlas_cache.anndata_utils'
It seems this function has been moved or is no longer accessible at this path in the latest version of the library.
Are you running this code in a notebook, or from a terminal window.
When I run the commands you are running from a terminal window, everything works. I am able to import get_gene_data from the prescribed location.
Is it possible you need to restart your notebook kernel after running the !pip install… command so that the notebook can detect the newly installed software? Are you able to successfully run
import abc_atlas_access
Do you mind providing a link to the notebook you are using, so that I can see exactly what you are seeing?
hi @danielsf , Thanks so much for your reply! You were right! it seems to have been a kernel issue. After restarting the notebook kernel, the ModuleNotFoundError was resolved, and I was able to import the get_gene_data function correctly.
I have a followup question regarding the memory (RAM) usage of the get_gene_data function when running in Google Colab notebook.
I’m observing that it consumes a very large amount of RAM, which scales with the number of genes in my list.
With a list of ~150 genes, the process requires nearly 50GB of RAM to complete.
When I increase the list to ~500 genes, the notebook runs out of memory and crashes.
I tried to fix this by reducing the chunk_size, hoping it would process the data in smaller, more manageable pieces:
expression_df_background = get_gene_data(
abc_atlas_cache=abc_cache,
all_cells=cell_metadata,
all_genes=gene_metadata,
selected_genes=background_symbols, # My list of ~500 genes
data_type='log2',
chunk_size=1024
)
However, even with this change, the process became extremely slow (running for ~1hour) and then crashed again….
My question is: What is the recommended way to load expression data for a larger list of genes (e.g., 500+) in a memory-constrained environment like Google Colab? Is there a parameter I’m missing or a different approach I should be taking to process this data in chunks without exhausting all my RAM?
The reason this is happening is that the get_gene_data function reads the data into an empty pandas DataFrame. However, it creates that empty DataFrame without giving pandas any indication what dtype to expect (ints, floats, strings, something else…). This causes pandas to do something very inefficient with memory as it thrashes about, trying to determine what it is being given.
We have a bugfix that addresses this. It “learns” the dtype from the data you are loading and tells pandas to expect that. I ran a test with this on our Whole Human dataset and got a process that had previously used >= 50 GB before I terminated it unsuccessfully (I was running on a workstation with only 64 GB of memory) to run to completion in 20 minutes, only ever using ~ 13 GB at most.
That bugfix should be merged later this week. I will post here to let you know when it is merged.
awesome!
Thank you so much for the quick reply and the clear explanation, Scott. Having a faster version of this function is a huge help!
The current solution I already found is splitting my gene list into several smaller batches (e.g., if i have ~500 genes, I split them into 4-5 batches of ~100 genes).
Then loop through these batches, calling get_gene_data for each one, saving the large (3.3M x ~100) DataFrame to a Parquet format df_batch.to_parquet(file_path), and then manually clearing the memory with del df_batch and gc.collect() before processing the next batch. haha!
This process works for managing memory, but as you can imagine, it is very time-consuming.
The reason time efficiency is so important for me is that I need to run a large-scale null permutation analysis (10,000 iterations). To do this, I first need to build a “master” enrichment matrix for my entire background gene pool. With the current function, processing this many genes in batches will take many hours.
The bug fix you described, sounds like it will be a perfect solution for this case.
Thanks again, and I’ll look forward to the update!
Well I have a background set of ~1300 genes. From this set, I’ve selected 120 genes for cell enrichment analysis. To assess the statistical significance of my results, I plan to perform a null permutation test. This involves randomly selecting for example 10,000 sets of genes (each set containing 120 genes) from the initial ~1300 gene pool and performing the same enrichment analysis on each randomized set–> get p value
The bugfix I mentioned is live now, so if you update your installation of abc_atlas_access you should get it.
I’m not 100% sure that get_gene_data is what you want in this case. get_gene_data is designed to read in the gene data for all of the cells and produce a dataframe that is held in memory. 3.3 million cells by 1300 genes represented as 64 bit floats would come out to 32 GB without any overhead. Obviously, this would be reduced by half if the data was represented as 32 bit floats.
If all you really want to do is write the data to a parquet file, you might be better off looping over the h5ad files and using anndata’s own API to grab the data you want and write it to disk one chunk at a time. The code below loops over all of the Whole Mouse Brain 10Xv3 expression matrices (about 2.3 million cells) and writes 1300 genes to a parquet file. It ran in 18 minutes and only ever needed 18 GB of memory. You should be able to control the memory footprint by changing cell_chunk_size to read fewer cells into memory at a time (the costly part is when you have to read a (cell_chunk_size, all_genes) array into memory before slicing it down to just the genes you want).
I’m just posting it here in case it would be helpful. If get_gene_data works for you, go ahead and run with that.
import anndata
import numpy as np
import pathlib
import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq
import time
import abc_atlas_access.abc_atlas_cache.abc_project_cache as cache_module
t0 = time.time()
cache_dir = pathlib.Path(
"path/to/abc/atlas/cache/directory"
)
abc_cache = cache_module.AbcProjectCache.from_cache_dir(
cache_dir
)
metadata_dir = "WMB-10X"
gene_df = abc_cache.get_metadata_dataframe(
directory=metadata_dir,
file_name="gene"
)
# randomly choose 1300 genes to sample
rng = np.random.default_rng(22131)
chosen_gene_idx = np.sort(
rng.choice(
np.arange(len(gene_df)),
1300,
replace=False
)
)
chosen_genes = gene_df['gene_identifier'].values[chosen_gene_idx]
# get the paths to all h5ad files with raw (as opposed to log2
# normalized) expression values
data_dir = "WMB-10Xv3"
files_to_read = [
pth for pth in abc_cache.list_data_files(directory=data_dir)
if pth.endswith('raw')
]
file_path_list = [
abc_cache.get_data_path(
directory=data_dir,
file_name=name
)
for name in files_to_read
]
# number of cells to read in at once
cell_chunk_size = 100000
# define schema for parquet writer
expression_dtype_np = np.float32
expression_dtype_pa = pa.float32()
schema = pa.schema(
[pa.field(gene, expression_dtype_pa) for gene in chosen_genes]
+ [pa.field('cell_label', pa.string())]
)
dst_path = "/local1/scott_daniel/scratch/enrichment.parquet"
total_cells = 0
with pq.ParquetWriter(dst_path, schema=schema) as writer:
for file_path in file_path_list:
print(f'READING {file_path.name}')
src = anndata.read_h5ad(file_path, backed='r')
obs = src.obs
chunk_iterator = src.chunked_X(cell_chunk_size)
local_cells = 0
for chunk, r0, r1 in chunk_iterator:
cell_id_chunk = obs.index.values[r0:r1]
chunk = chunk.toarray()[:, chosen_gene_idx]
total_cells += chunk.shape[0]
local_cells += chunk.shape[0]
print(f' {local_cells} cells of {len(obs)}')
chunk = pd.DataFrame(
chunk,
columns=chosen_genes,
index=cell_id_chunk,
dtype=expression_dtype_np
)
chunk['cell_label'] = cell_id_chunk
writer.write_table(
pa.Table.from_pandas(
chunk,
preserve_index=False,
schema=writer.schema
)
)
src.file.close()
del src
print(f'WROTE {dst_path}')
print(f'{total_cells} total cells')
dur = (time.time()-t0)/60.0
print(f'that took {dur:.2e} minutes')