Mouse Whole Cortex and Hippocampus SMART-seq intron counts not in hdf5 expression matrix file?

Hello,

I am working with the Allen smart-seq data posted here (Mouse Whole Cortex and Hippocampus SMART-seq - brain-map.org). I’m hoping to use Seurat to eventually integrate this smartseq dataset with the 10x dataset here (Mouse Whole Cortex and Hippocampus 10x - brain-map.org) to use as a reference dataset, and the 10x HDF5 expression matrix file does seem to have both introns and exons combined. However, while the smartseq HDF5 expression matrix file is described as having both introns and exons combined, the labels of the expression matrix file suggest only exon counts are included. Is this a mistake or does the HDF5 file really only contain the exon counts?
What I see in R for the smartseq dataset:

library(rhdf5)
h5ls(“expression_matrix.hdf5”)
group name otype dclass dim
0 / data H5I_GROUP
1 /data exon H5I_GROUP
2 /data/exon dims H5I_DATASET INTEGER 2
3 /data/exon i H5I_DATASET FLOAT 703397415
4 /data/exon p H5I_DATASET INTEGER 45769
5 /data/exon x H5I_DATASET FLOAT 703397415
6 /data t_exon H5I_GROUP
7 /data/t_exon dims H5I_DATASET INTEGER 2
8 /data/t_exon i H5I_DATASET FLOAT 703397415
9 /data/t_exon p H5I_DATASET INTEGER 73364
10 /data/t_exon x H5I_DATASET FLOAT 703397415
11 /data total_exon_counts H5I_DATASET FLOAT 73363
12 / gene_names H5I_DATASET STRING 45768
13 / sample_names H5I_DATASET STRING 73363

Thank you for any help!

Hi,
I asked one of our bioinformatics staff members to address your question and here is his response:

The Mouse SMARTerV4 reference is rsem, which means it only has exonic reads (we haven’t used rsem alignments since 2017).

None of the genome reference files are current and I don’t think any contain intron/exon counts.
• Mouse rsem_GRCm38.p3.gtf.zip is an rsem reference, which means alignment only to transcripts (no genomic or intronic alignments). Again we haven’t used any rsem since Q4 2017.
• Human rsem_GRCh38.p2.gtf.zip is the same as mouse. This would have to be an incomplete release of Human SMARTerV4 MTG data.
• Mouse mouse_10x_gtf.zip is likely our 10X Cell Ranger v3 reference. This reference has all introns reclassified to exons (premRNA reference). This was done, so both intron and exon alignments are included in the count matrix (Cell Ranger v3 did not have an option for this).

For any SMARTerV4 data anything aligned with STAR will have separate intron/exon count matrices. For 10X only Cell Ranger 6 aligned data will have intron/exon counts.

Additionally, since the data sets use different references integrating them will be difficult.

Thank you

1 Like