I am the Scientific Project Manager for the SEA-AD Consortium. I am posting a community question received via email related to working with SEA-AD data:
We have been trying to work with SEA-AD data for some time, including via CELLxGENE (https://cellxgene.cziscience.com/e/c2876b1b-06d8-4d96-a56b-5304f815b99a.cxg/). The data file we have been using is SEAAD_MTG_RNAseq_final-nuclei.2023-05-05.h5ad . While exploring the data via the tool linked earlier and also in our analysis so far (maybe we are making a mistake), we are not finding any differentially expressed genes. We have been using the pseudobulked UMI counts to find DE genes in all cell-types, the ones categorized as ‘Subclass’ in metadata and are trying to use deseq2. We are using the samples labelled as ‘Reference’ as controls and the various Braak stages are our conditions. Do you have any advice/suggestions, if we want to continue working with this data? We are considering switching to scVI tools as mentioned in the paper.
Also, the ‘X’ data attribute within the h5ad data object seems to be a processed and normalized version of the UMI counts under the ‘layers’ attribute. How exactly was ‘X’ generated? I ask because it also seems to be batch corrected. Thanks in advance!
When picking a model for identifying differentially expressed genes it’s important to consider the following:
Pseudobulk versus Mixed effects models. The former will be more conservative and more prone to false negatives. When dealing with small effect sizes, it may not be the most appropriate model.
Categorical versus continuous variables of disease. The former will require averaging data within several groups (the 6 Braak stages) and comparing them to reference in a case-control context. This effectively reduces your number of observations and effect sizes, leading to a greater chance of false negatives.
Cell type resolution. Testing at the subclass level will also reduce effect sizes through averaging changes that are not common across all of the cell types within a subclass.
For these reasons we chose to use a mixed effects model (nebula) with a continuous covariate (CPS) at supertype resolution in our manuscript, where we identified several differentially expressed genes.
In our SEA-AD objects .X is just natural log (UMIs per 10,000 plus 1) from sc.pp.normalize_total() with a target of 1e5 and sc.pp.log1p(). No batch correction.
Thank you so much for the response. About batch effects: Since the Alzheimer’s samples come from two different studies: ACT & ADRC, and all the References/Controls from a separate one, shouldn’t we account for batch effects for any DE analysis? Although while exploring the data in the CZI’s Cell-by-Gene browser it doesn’t seem to be a concern. About cell-type annotations: Within the ‘obs’ attribute of the anndata object, we have the various annotations that the manuscript talks about, like, supertype, class, subclass, but we don’t seem to have the cell-type annotation, even though there is a way to subset with cell types in the CZI browser.
The UW Biorepository and Integrated Neuropathology Laboratory (BRaIN Lab) handles the rapid autopsy and tissue processing for all of the brains in our dataset using the same cryopreservation protocol, regardless of study origin. Similarly, all of the downstream molecular biology to isolate nuclei and construct sequencing libraries are handled by the same Molecular Biology core at Allen. You could consider adding it as a covariate in a model, but I don’t think it’s needed and I’d be careful about co-linearity because the ADRC cases tend to be more advanced.
The representation on CELLxGENE comes from an scVI model trained with Donor ID used as the batch covariate. This does not affect the data in either .layers[“UMIs”] or .X.
As noted in the manuscript, we defined supertypes as mappable cell types that we could confidently identify in a hold-one-out procedure in data from reference donors. We use it as our stand in for cell type.
CELLxGENE’s schema requires we list a cell_ontology_id, but these labels are much closer to subclass level and are not part of our official data release. The CELLxGENE version of the data actually has other issues, such as reduced metadata (to facilitate display in their browser) and slightly fewer genes (due to a mismatches between the genome annotation used in 10x’s current official human reference and that used by CELLxGENE). We recommend using the official data release available on our AWS Open Data Registry.
So, we do want to treat disease as a categorical variable (at least for now). So far we have divided the samples/people (we know there are more than one samples from some, but here I am using samples to mean people) into 4 groups:
1. Controls/References: Average group age:44.8 years: Total 5 people
2. Early AD: Avg group age:79.8 years: Total 5 people
3. Medium AD: Avg group age:89.2 years: Total 5 people
4. Late AD: Avg group age:74.8 years: Total 5 people
All the groups are of course mutually exclusive; and gender matched, with three males and two females. That’s because the control group has only five samples with three males and two females. We do pass age as a continuous covariate to the scVI model.
The problem is we don’t think this is enough and that we need age matched controls and 5 samples are likely not enough. The early, medium and late AD stages were chosen based off of:
Early AD: (Braak: 0, II or III) and (Thal : 0, 1 or 2) and (CERAD: Absent or Sparse) and (ADNC: Not AD or Low)
Medium AD: (Braak: IV or V) and (Thal: 3 or 4) and (CERAD: Moderate) and (ADNC: Intermediate)
Late AD: (Braak: VI) and (Thal: 5) and (CERAD: Frequent) and (ADNC: High)
If there were more than 5 samples in any group, we chose the 5 that had lowest age at death, to match the average age of references/controls as much as we could.
We could give up on AD staging for better matched controls. The idea is to compare differential gene expression signatures to a different disease. Could we use samples from the ACT study as controls? We need feedback on this set up and perhaps help designing a better one.
I can send over sample IDs for each group too if need be.
I’m not sure how to best advise, it’d be helpful to get a better sense of the analysis you’re hoping to run. Generally, I’d avoid downsampling each group to 5 donors because disease effects on gene expression are relatively small. I’d also avoid creating new bins of donors based on Braak, Thal, CERAD and ADNC. They each describe/scale with different kinds of pathology and, in some cases, are composites of the others. If you want to use a categorical variable, I’d recommend using ADNC. Our cohort has aged individuals with no or low AD pathology that you could use a “reference” (individuals with values of “Not AD” and “Low”) and compare to individuals that have “Intermediate” or “High” pathology. The donors labeled as reference that are included are much younger and used only for cell type mapping in our manuscript.
Yes, reducing patients in each group to just 5 is something we’ve been hoping to circumvent. I was hoping to use some individuals from ACT (which we earlier thought was a study reserved only for AD patients) as Controls.
But what you say about compositeness of different scores of AD pathology seems a little counter-intuitive: one would assume choosing AD patients based on 4 pathological scores that are in agreement (when I say agreement I mean all the scores indicating advanced AD pathology) would be better than just 1 (ADNC) especially in light of articles like: https://alz-journals.onlinelibrary.wiley.com/doi/full/10.1002/alz.054305
About samples labelled References: I see, thank you. I think we can drop the really young individuals from this group in lieu of better matched ones from ACT.
Hi Kyle,
That is super helpful. I suppose I got a little astray because there is/are ‘Braak V’ sample(s) that have ‘Not AD’ as the ADNC score, but never mind.
My new setup based on this feedback seems better for statistical power and age matching. The AD samples now are a result of: Braak==(Braak 5 or 6) and (Thal== 5 or 4) and (CERAD==Frequent or Moderate) and (ADNC== High)
which gives me 42 AD samples with an average age of 85.7.
I get 10 Controls with a mean age of 80.1 with the following filtering: (Braak:0,2,3,4 or Reference) and (Thal: 0,1,2 or Reference) and (CERAD: Absent or Reference) and (ADNC: Not AD or Refrence) and (Age at Death >=50) Only 2 ‘Reference’ samples are a part of this control group.
If I ignore everything and just do ADNC == ‘Not AD’ for controls, I get 9 samples with 8 of them same as above. So I am inclined towards using the prior filtering for Controls.