H5ad File Characteristics

Hello, I’m currently working with the SEAAD_MTG_RNAseq_final-nuclei.2024-02-13.h5ad dataset and it’s my first time working with RNAseq data, so please excuse the simplicity of the questions. I was just wondering a few things about the dataset

  1. Which column name refers to the gene counts?
  2. Which column name refers to the mitochondrial count (or is it just the UMI fractional mitochondria column that is available)?
  3. Is this data already filtered based on counts and fraction of genes that are mitochondrial? If so, what are the thresholds set for those?
  4. Has the data already been normalized?

Thank you very much!

Hello @karte,

  1. “Genes detected” is the number of genes detected per nucleus
  2. “Fraction mitochondrial UMIs” is the fraction of total UMIs that are mitochondrial. You can recover the total mitochondrial UMIs by multiplying this by “Number of UMIs”, which is the number of UMIs detected per nucleus.
  3. Yes, already filtered for nuclei with a fraction of mitochondrial UMIs less than 0.05. We also apply a filter at the cell cluster level and flag clusters that are outliers for fraction of mitochondrial UMIs. So some nuclei below our threshold would also have been removed. You can see the code for this here in the save_anndata() and clean_taxonomies() functions.
  4. .layers[“UMIs”] is unnormalized counts, while .X has been processed with sc.pp.normalize_total() and sc.pp.log1p().

Best,
Kyle