(SEA-AD) Donors and Samples

Victhagas · August 19, 2025, 6:43pm

Hi guys,

I was performing DEG analysis on the SEA-AD data and got confused about two things:

When we access the AWS S3 Explorer, we can get the donor data, there are around 84 .h5ad files. I checked, and many of them have the ‘Overall AD Neuropathological Change’ column in .obs marked as ‘Reference’. Due to RAM limitations, I was reading each .h5ad file individually, filtering them and removing “non-essential” variables, and keeping only the ‘Intermediate’ and ‘High’ ADNC cases (as AD), and ‘Not AD’ (as No AD). Can anyone clarify whether I should skip these ‘Reference’ ADNC cases? If so, I’d be “losing” around half of the donor data.
Later on, when I was performing DEG analysis using scvi, I noticed there’s an .obs column named ‘sample_name’, and its length is around 130. My question is: should I normalize based on these 130 samples, or only by the 84 donors? I know this is a beginner-level question — I’m just getting started with omics analysis.

Thanks a lot!

kyle.travaglini · August 19, 2025, 9:05pm

Hello,

There should only be 5 donors with ADNC == “Reference”. These are samples from young donors from the BRAIN initiative that we mapped SEA-AD data too to determine cell class, subclass, and supertype. They can be safely removed/are not considered part of the SEA-AD cohort. ADNC levels to keep are: “Not AD”, “Low”, “Intermediate”, and “High”.

Some donors have multiple libraries from the same technology (e.g. 2 RNA singleome) that come from the same nuclei suspension and some have multiple modalities (e.g. RNA singleome and multiome) that come from distinct nuclei suspensions generated from nearby tissue blocks. The most conservative batch variable to pass to scVI would be “library_prep”. It is worth noting that if the goal is to preserve variance associated with disease, scVI may not be an appropriate tool. In adding either “library_prep” or “Donor ID” to the latent space (and then what you pass to the decoder), nothing is enforcing variance associated with disease is preserved.

Kyle

Victhagas · August 20, 2025, 8:25pm

Thanks, Kyle. You definitely helped a lot. I’ll remove the ‘Reference’ cases and, due to my RAM limitations, keep only the ‘No-AD’ and ‘High’ ADCN. Unfortunately, if I want to include ‘Intermediate’ or ‘Low’ in a comparison, I’ll need to create separate .h5ad objects for them.

Regarding your second point: got it. Yes, my goal is to preserve the variance associated with disease so that I can focus on the meaningful part of the DEG analysis. Given that, which tool would you recommend? I could also approach this in a pseudobulk way (with pyDESeq2), or maybe use diffxpy.

Paulo

kyle.travaglini · August 20, 2025, 8:55pm

To handle pseudoreplication and variance associated with other covariates, we used a general linear mixed effects model called nebula. Pseudobulk would also be valid (and more conservative, though would help with memory limitations). To increase power we tested along our pseudo-progression variable, which is a continuous measure of pathological burden. But if you wanted to stick with ADNC, you could also encode it as an ordinal variable (mapping the values to 0, 1, 2, and 3 and min-max scaling). -Kyle

Topic		Replies	Views
Questions about RNA seq SEA-AD data Science sea-ad	9	656	December 12, 2023
SEA-AD Datasets Query Science sea-ad	1	305	October 26, 2023
SEA-AD donors metadata and diagnoses sea-ad	5	150	July 27, 2024
SEA-AD snATAC-seq data Science sea-ad	4	378	April 22, 2024
SEA-AD severely affected donor IDs transcriptomics	1	22	July 25, 2024

(SEA-AD) Donors and Samples

Related topics