SEA-AD Cell types transcriptomic comparative viewer

I would like to access the underlying data used in making the dotplot in Allen Brain Map

It’s not practical to explore all possible genes through the web. Is there a way to download this data?


The raw data used to generate the dotplots can be downloaded from AWS here: AWS S3 Explorer. After reading in the AnnData object with scanpy (Scanpy – Single-Cell Analysis in Python — Scanpy 1.9.2 documentation), we used the sc.get.obs_df() (scanpy.get.obs_df — Scanpy 1.9.2 documentation) to pull each gene of interest (normalized expression values are stored in .X), its subclass or supertype, and the metadata variable of interest (e.g. Cognitive Status) into a single data frame. We then grouped by the combination of metadata and subclass/supertype using pandas and calculated the mean and fraction of non-zero values within each group. Hope this helps!

Thanks so much Kyle.

This is very helpful.

Would it be possible to share a sample python script using scanpy? It will be deeply appreciated.

Sure, something like this should work for a single gene/metadata combination. You’ll have to iterate with for loops over all genes/metadata you’re interested in, so some parallelization may be beneficial (I’d recommend something like joblib.Parallel — joblib 1.3.0.dev0 documentation)

import scanpy as sc
import pandas as pd
adata = sc.read_h5ad(...)
i = "Cognitive Status"
j = "APOE"
splitby = "subclass"
df = sc.get.obs_df(adata, [i, j, splitby])
df["fraction_expressed"] = df[j] > 0
fraction = df.loc[:, ["fraction_expressed", i, splitby]].groupby([i, splitby]).mean()
expression = df.loc[df[j] != 0, [j, i, splitby]].groupby([i, splitby]).mean().fillna(0)
df = pd.concat([fraction, expression], axis=1).reset_index()

Hi Kyle,

Thanks for the sample script.

I tried using it and ran into an error.

My code:

file = “SEAAD_MTG_RNAseq_final-nuclei.2022-08-18.h5ad”

adata = sc.read_h5ad(file)

cog_status = “Cognitive Status”

gene = “APOE”

splitby = “subclass”

df = sc.get.obs_df(adata, [cog_status, gene, splitby])


KeyError: “Could not find keys ‘[‘subclass’]’ in columns of adata.obs or in adata.var_names.”

I think we saved subclass as “subclass_label” You can check with: adata.obs.columns[adata.obs.columns.str.contains("subclass")]

Hi Kyle,

The label was “Subclass” (upper case S).

I am able to run the script.

One last question: what is the unit for gene expression? The website mentions ln(UP10K+1). Could you please explain what this means and how this may relate to TPM or any other regular metric?

The expression values are natural log[[number of unique molecular identifiers (UMIs) for each gene in a given cell divided by the total number of UMIs in the same cell divided by 10,000] plus 1]. Transcripts per million and the related counts per million apply to RNAseq experiments without UMIs. 10x Genomics has more:

Thanks Kyle.

Hi Kyle,

One more question:

In the cognitive status in the dot plot, there are three categories: Reference, No dementia, and Dementia.

I was unable to find clear explanation on the web site for this.

What are the differences between these?

Hi again,

Apologies for missing this! Reference is applied to the young neurotypical reference donors described in No dementia/Dementia is applied to the aged, SEA-AD cohort and represents whether they had a clinical diagnosis of dementia at the time of their death.


Hi Kyle,

This is Chen from Dana-Farber Cancer Institute.
I have a question regarding understanding the SEA-AD dot plot.
The color of the dots represents the expression level. My questions is whether that is from ‘all cells in a certain cluster’ or ‘all POSITIVE cells in a certain cluster’, here ‘POSITIVE’ means the cells expressing a certain gene.

My understanding is that is is from POSITIVE cells. Because I see in many cases, e.g. CCND1. in Reference sample, only 10% of cells express CCND1 (judging by the size of the circle). But, the level (color) is very high (level 1.5 according to color scale). IIf the expression level is from all cells, then the CCND1 level in POSITIVE cells must be super-high, so the signal from 10% of cells is so strong, that even when diluted in 90% of negative cells remains very high.

Can you please let me know if I understand correctly?

Many thanks!