Thank you for your detailed tutorial and support. I was able to retrieve my genes of interest from the WHB-10Xv3 20241115 dataset.
Since the dataset contains over 3 million cells, I’m looking to filter for subsets based on QC metrics and homogeneity in each cell types. Is there mapping data available for each cell, such as those equivalent to Seurat’s nCount and nFeature parameters for each cell.
As a follow-up, the nCount and nFeature parameters can be calculated in a straightforward way from the data itself.
I believe nCount is just the total counts per cell and nFeature is the total number of genes detected, both of which could be calculated in a single line of code in R (and I assume python).
I stared learning your notebook tutorial last week. It looks the get_gene_data() points to log2(CPM+1) data by default when creating an expression data frame. We need to use raw data for nCount and nFeature, right? Do I need to create an all genes by all cells data matrix first, in order to run rowSums? Any other more straight forward or light-weighted approach?
I’m splitting the 3M+ cells into 50K chunks and looping through get_gene_data() to calculate nCount and nFeature for each chunk. After a full day of processing, I’m only about halfway done.
You mush have generated these metrics during your initial data processing. It would be really helpful if they could be included in cell_metadata in future releases, so others don’t have to go through this expensive step again.
The get_gene_data() function is meant to select specific genes from the cell-by-gene array. This is admittedly quite slow since the data is stored as a sparse matrix in CSR format, which is optimized for slicing by rows (cells). If, as Jeremy suggested, you really just want to sum up the number of counts per cell and the number of genes with non-zero counts per cell, you can do that pretty quickly using anndata’s own API
import anndata
import numpy as np
import pathlib
import time
from abc_atlas_access.abc_atlas_cache.abc_project_cache import (
AbcProjectCache
)
cache_dir = pathlib.Path(
"/path/to/abc/cache/dir"
)
assert cache_dir.is_dir()
cache = AbcProjectCache.from_cache_dir(cache_dir)
# get path to the h5ad file containing neurons
neuron_path = cache.get_data_path(
directory='WHB-10Xv3',
file_name='WHB-10Xv3-Neurons/raw'
)
src = anndata.read_h5ad(neuron_path, backed='r')
t0 = time.time()
# instantiate an iterator that will return the
# cell-by-gene data 50,000 cells at a time
row_iterator = src.chunked_X(50000)
# create empty arrays to contain the n_features
# and n_counts results
n_features = np.zeros(src.shape[0], dtype=int)
n_counts = np.zeros(src.shape[0], dtype=int)
# iterate over chunks
for chunk in row_iterator:
# 'rehydrate' sparse array as a dense numpy array
chunk_arr = chunk[0].toarray()
# transcribe n_features and n_counts for this chunk to the
# appropriate slots in the empty arrays
n_features[chunk[1]:chunk[2]] = (chunk_arr>0).sum(axis=1)
n_counts[chunk[1]:chunk[2]] = chunk_arr.sum(axis=1)
print(
f'processed cells {chunk[1]}:{chunk[2]} after {time.time()-t0} seconds'
)
The script above loops over all the rows in the Neurons-raw.h5ad file (you can get the same result for non-neurons by changing the file_name kwarg in the invocation of cache.get_data_path()). It can process 50,000 cells in about 13 seconds, so you should be able to get through all 3 million cells in about 10 minutes (again: assuming Jeremy was correct about what you want to calculate)
Note: this estimate does not include the amount of time it will take to download the data (though that should not be more than an hour or two the first time you run this code).