Hi guys,
I was performing DEG analysis on the SEA-AD data and got confused about two things:
-
When we access the AWS S3 Explorer, we can get the donor data, there are around 84 .h5ad
files. I checked, and many of them have the ‘Overall AD Neuropathological Change’ column in .obs
marked as ‘Reference’. Due to RAM limitations, I was reading each .h5ad
file individually, filtering them and removing “non-essential” variables, and keeping only the ‘Intermediate’ and ‘High’ ADNC cases (as AD), and ‘Not AD’ (as No AD). Can anyone clarify whether I should skip these ‘Reference’ ADNC cases? If so, I’d be “losing” around half of the donor data.
-
Later on, when I was performing DEG analysis using scvi, I noticed there’s an .obs
column named ‘sample_name’, and its length is around 130. My question is: should I normalize based on these 130 samples, or only by the 84 donors? I know this is a beginner-level question — I’m just getting started with omics analysis.
Thanks a lot!
Hello,
There should only be 5 donors with ADNC == “Reference”. These are samples from young donors from the BRAIN initiative that we mapped SEA-AD data too to determine cell class, subclass, and supertype. They can be safely removed/are not considered part of the SEA-AD cohort. ADNC levels to keep are: “Not AD”, “Low”, “Intermediate”, and “High”.
Some donors have multiple libraries from the same technology (e.g. 2 RNA singleome) that come from the same nuclei suspension and some have multiple modalities (e.g. RNA singleome and multiome) that come from distinct nuclei suspensions generated from nearby tissue blocks. The most conservative batch variable to pass to scVI would be “library_prep”. It is worth noting that if the goal is to preserve variance associated with disease, scVI may not be an appropriate tool. In adding either “library_prep” or “Donor ID” to the latent space (and then what you pass to the decoder), nothing is enforcing variance associated with disease is preserved.
Kyle
Thanks, Kyle. You definitely helped a lot. I’ll remove the ‘Reference’ cases and, due to my RAM limitations, keep only the ‘No-AD’ and ‘High’ ADCN. Unfortunately, if I want to include ‘Intermediate’ or ‘Low’ in a comparison, I’ll need to create separate .h5ad
objects for them.
Regarding your second point: got it. Yes, my goal is to preserve the variance associated with disease so that I can focus on the meaningful part of the DEG analysis. Given that, which tool would you recommend? I could also approach this in a pseudobulk way (with pyDESeq2), or maybe use diffxpy.
Paulo
To handle pseudoreplication and variance associated with other covariates, we used a general linear mixed effects model called nebula. Pseudobulk would also be valid (and more conservative, though would help with memory limitations). To increase power we tested along our pseudo-progression variable, which is a continuous measure of pathological burden. But if you wanted to stick with ADNC, you could also encode it as an ordinal variable (mapping the values to 0, 1, 2, and 3 and min-max scaling). -Kyle