H5ad File Characteristics

Hello, I’m currently working with the SEAAD_MTG_RNAseq_final-nuclei.2024-02-13.h5ad dataset and it’s my first time working with RNAseq data, so please excuse the simplicity of the questions. I was just wondering a few things about the dataset

  1. Which column name refers to the gene counts?
  2. Which column name refers to the mitochondrial count (or is it just the UMI fractional mitochondria column that is available)?
  3. Is this data already filtered based on counts and fraction of genes that are mitochondrial? If so, what are the thresholds set for those?
  4. Has the data already been normalized?

Thank you very much!

Hello @karte,

  1. “Genes detected” is the number of genes detected per nucleus
  2. “Fraction mitochondrial UMIs” is the fraction of total UMIs that are mitochondrial. You can recover the total mitochondrial UMIs by multiplying this by “Number of UMIs”, which is the number of UMIs detected per nucleus.
  3. Yes, already filtered for nuclei with a fraction of mitochondrial UMIs less than 0.05. We also apply a filter at the cell cluster level and flag clusters that are outliers for fraction of mitochondrial UMIs. So some nuclei below our threshold would also have been removed. You can see the code for this here in the save_anndata() and clean_taxonomies() functions.
  4. .layers[“UMIs”] is unnormalized counts, while .X has been processed with sc.pp.normalize_total() and sc.pp.log1p().

Best,
Kyle

Hello, Kyle!

If .layers[“UMIs”] is unnormalized counts, When I run the “.layers[“UMIs”]” in python:sparse matrix of type ‘<class ‘numpy.float32’>’. Why not int64?

Thank you very much!

Hello,

Somewhere along the way the dtype must have been set to float. If you look at the actual values there will only be whole numbers.

Best,

Kyle

Hello Kyle,

I look at the actual values and indeed there are only integers.

Thanks for your reply.