Request for sample and read mapping QC metadata

xyaous · June 16, 2025, 5:39pm

Hi,

Thank you for your detailed tutorial and support. I was able to retrieve my genes of interest from the WHB-10Xv3 20241115 dataset.

Since the dataset contains over 3 million cells, I’m looking to filter for subsets based on QC metrics and homogeneity in each cell types. Is there mapping data available for each cell, such as those equivalent to Seurat’s nCount and nFeature parameters for each cell.

Best regards,
Sean

morriscb · June 16, 2025, 6:09pm

Hey Sean,

I checked the h5ad file and cell metadata and unfortunately those features or similar don’t look to be available in the current release.

jeremyinseattle · June 16, 2025, 6:17pm

As a follow-up, the nCount and nFeature parameters can be calculated in a straightforward way from the data itself.

I believe nCount is just the total counts per cell and nFeature is the total number of genes detected, both of which could be calculated in a single line of code in R (and I assume python).

nCount = rowSums([count_matrix])
nFeature = rowSums([count_matrix]>0)

xyaous · June 16, 2025, 7:10pm

Thank you very much for your quick response.

I stared learning your notebook tutorial last week. It looks the get_gene_data() points to log2(CPM+1) data by default when creating an expression data frame. We need to use raw data for nCount and nFeature, right? Do I need to create an all genes by all cells data matrix first, in order to run rowSums? Any other more straight forward or light-weighted approach?

Best regards,
Sean

morriscb · June 16, 2025, 7:54pm

Hi Sean,

The raw counts matrix is also available. You can check out the Getting Started and Accessing 10X gene expression data notebooks linked from the WHB data page: Human brain 10Xv3 single nucleus transcriptomes — Allen Brain Cell Atlas - Data Access

The Getting Started examples will walk you through how to, in general, download both the log2 and raw counts matrices.

xyaous · June 18, 2025, 9:53pm

Hello,

I’m splitting the 3M+ cells into 50K chunks and looping through get_gene_data() to calculate nCount and nFeature for each chunk. After a full day of processing, I’m only about halfway done.

You mush have generated these metrics during your initial data processing. It would be really helpful if they could be included in cell_metadata in future releases, so others don’t have to go through this expensive step again.

Best regards,
Sean

danielsf · June 20, 2025, 5:38pm

The get_gene_data() function is meant to select specific genes from the cell-by-gene array. This is admittedly quite slow since the data is stored as a sparse matrix in CSR format, which is optimized for slicing by rows (cells). If, as Jeremy suggested, you really just want to sum up the number of counts per cell and the number of genes with non-zero counts per cell, you can do that pretty quickly using anndata’s own API

import anndata
import numpy as np
import pathlib
import time

from abc_atlas_access.abc_atlas_cache.abc_project_cache import (
    AbcProjectCache
)

cache_dir = pathlib.Path(
    "/path/to/abc/cache/dir"
)

assert cache_dir.is_dir()

cache = AbcProjectCache.from_cache_dir(cache_dir)

# get path to the h5ad file containing neurons
neuron_path = cache.get_data_path(
    directory='WHB-10Xv3',
    file_name='WHB-10Xv3-Neurons/raw'
)

src = anndata.read_h5ad(neuron_path, backed='r')
t0 = time.time()

# instantiate an iterator that will return the
# cell-by-gene data 50,000 cells at a time
row_iterator = src.chunked_X(50000)

# create empty arrays to contain the n_features
# and n_counts results
n_features = np.zeros(src.shape[0], dtype=int)
n_counts = np.zeros(src.shape[0], dtype=int)

# iterate over chunks
for chunk in row_iterator:

    # 'rehydrate' sparse array as a dense numpy array
    chunk_arr = chunk[0].toarray()

    # transcribe n_features and n_counts for this chunk to the
    # appropriate slots in the empty arrays
    n_features[chunk[1]:chunk[2]] = (chunk_arr>0).sum(axis=1)
    n_counts[chunk[1]:chunk[2]] = chunk_arr.sum(axis=1)

    print(
        f'processed cells {chunk[1]}:{chunk[2]} after {time.time()-t0} seconds'
    )

The script above loops over all the rows in the Neurons-raw.h5ad file (you can get the same result for non-neurons by changing the file_name kwarg in the invocation of cache.get_data_path()). It can process 50,000 cells in about 13 seconds, so you should be able to get through all 3 million cells in about 10 minutes (again: assuming Jeremy was correct about what you want to calculate)

Note: this estimate does not include the amount of time it will take to download the data (though that should not be more than an hour or two the first time you run this code).

xyaous · June 21, 2025, 2:34am

Hi Scott,

Thanks so much for your reply.

Your code worked perfectly. It finished in just 40 minutes on my local machine, compared to ~40 hours with my previous get_gene_data() approach!

I’m continuing to learn more about using the ABC resources. Really appreciate your help.

Best,
Sean

Topic		Replies	Views
MapMyCells Now Accepts CSV Input Files! MapMyCells	0	51	March 7, 2025
New Whole Human Brain Data Explorable in the ABC Atlas Allen Brain Cell (ABC) Atlas	8	273	June 16, 2025
Mouse Whole Cortex and Hippocampus 10x Technical atlas-mouse-brain-adult , transcriptomics , rna-seq	3	509	August 8, 2022
ABC Atlas MERFISH metadata missing parts of cell data Technical atlas-cell-types , celltype	0	114	February 29, 2024
Mapmycell pipeline for user Cell Taxonomies	3	119	May 1, 2024

Request for sample and read mapping QC metadata

Related topics