SEA-AD Datasets Query

I am the Scientific Project Manager for the SEA-AD Consortium. I am posting a community question received via email related to working with SEA-AD data:

I have made use of some of the SEA-AD Allen Institute Datasets for some analysis I did, linking to Power in Differential Expression in scRNA-seq data (with the hopes of having a publication down the line).

I just had a query about the datasets (as provided on https://cellxgene.cziscience.com/collections/1ca90a2d-2943-483d-b678-b809bf464c30). I noted that in your paper (https://doi.org/10.1101/2023.05.08.539485), you have stated that the data comes from 84 donors, and so each cell type should have 84 samples. However, from the data as in the link, when I download these files, I seem to be getting quite a wide range of sample numbers (usually 89, but often 88 or 87, with only one cell type having 84 samples).

Would you have any idea as to why this could be? I downloaded each dataset from the link above (in .rds format), then used “readRDS()” to load these in, and used the Seurat “as.SingleCellExperiment()” function to read these in as SCE objects. I then just looked at the unique Donor IDs for each dataset, and that is how I produced the “numSamples” column below (please disregard the other columns). Could you please guide me on this if possible?

The short answer to your question is that not every donor contains every cell type. This is largely for one of two reasons:

  1. Some cell types (e.g., Sst Chodl and L5 ET) are quite rare and may not be found in every donor
  2. A few donors (~2) have relatively low quality data and are missing several cell types

Finally, the reason that the max value is 89 rather than 84 is because this data set additionally includes the five younger adult donors from this reference data set.

It is also worth noting that we would encourage downloading the data from AWS here. There is more metadata and the gene matrix is slightly different (a few genes are lost when converting to the required cellxgene format). That said, we realize that the h5ad files on AWS are too large to open in R and are working on a work-around for a future release. You should still be able to access the metadata if you load in backed mode (e.g., data <- read_h5ad([FILENAME],backed="r").