Loading data from transcrip.tome file in Python with h5py

Hello everyone,

I am trying to do some work on the mouse trancriptomics data from the Allen Brain Institute. I have downloaded the transcrip.tome file from here. From the .readme I found out that the recommended way to read this file is a R package. However, I don’t know R and since the file format is supposedly HDF5 based, I decided to try h5py first.
This works in principle. I can run the following code:

import h5py

filename = ‘transcrip.tome’

f = h5py.File(filename, ‘r’)
print(“Keys: %s” % f.keys())
a_group_key = list(f.keys())[0]

data = list(f[a_group_key])
print(f[‘data’][‘exon’])
print(type(f[‘data’][‘exon’]))
print(f[‘data’][‘exon’].keys())
print(type(f[‘data’][‘exon’][‘i’]))

This prints:
Keys: <KeysViewHDF5 [‘data’, ‘dend’, ‘gene_names’, ‘projection’, ‘sample_meta’, ‘sample_names’, ‘stats’]>
<HDF5 group “/data/exon” (4 members)>
<class ‘h5py._hl.group.Group’>
<KeysViewHDF5 [‘dims’, ‘i’, ‘p’, ‘x’]>
<class ‘h5py._hl.dataset.Dataset’>

So I generally seem to have file access. However, from the .readme I would expect a sparse matrix in f[‘data’][‘exon’], so a dataset.Dataset class. Instead I find there a group.Group object that contains 4 more keys. Neither of them seems to be further documented.

Does anyone know how to reconstruct the sparse matrix? Maybe a problem of h5py, which cannot account for some specifics of the .tome format? Thank you in advance for your help and the great community effort. I greatly appreciate it. Let me know if you require anymore information.

Best,
Daniel

Hi Daniel,

Theh5dump utility is handy for seeing into the structure of the .tome file. Use the -H switch to show just the structure of the archive, as in h5dump -H mouse/transcrip.tome.

That gives you some more information, but it’s still not entirely obvious how to construct a sparse matrix from the data. Here’s how I do it–

import scipy.sparse as ss
import h5py


# "h5f" is the handle that you get from "h5py.File('mouse/transcrip.tome')"
#
# "data_path" is the path within the archive.  In this case probably 
# either "/data/exon/" or "/data/intron/".
def extract_sparse_matrix(h5f, data_path):    
    data = h5f[data_path]
    x = data['x']
    i = data['i']
    p = data['p']
    dims = data['dims']   
    
    sparse_matrix = ss.csc_matrix((x[0:x.len()], 
                                   i[0:i.len()], 
                                   p[0:p.len()]), 
                                  shape = (dims[0], dims[1]))    
    return sparse_matrix

# The call looks like this
h5f = h5py.File('mouse/transcrip.tome')
exons = extract_sparse_matrix(h5f, '/data/exon/')

Once you have the sparse matrix you can do sparse things. If you have the RAM and inclination, you can turn that into a dense dataframe like so:

import numpy as np
import pandas as pd

# this uses the hdf5 file to extract the cell & gene names
# and uses them as row & column indices in the returned dataframe
def sparse_to_labeled_frame(h5f, sparse_matrix):
    cell_names = h5f['sample_names']
    gene_names = h5f['gene_names']    
    
    dense_matrix = sparse_matrix.todense()
    
    # make it a dataframe so we can add row and column labels
    df = pd.DataFrame(dense_matrix)
    # add the header
    df.columns = gene_names[0:].astype('U23')
    # add the cell labels as a column, and make it the row index
    df.insert(0, 'cell_label', cell_names[0:][0:].astype('U20'))
    df.set_index('cell_label', inplace=True)
    
    return df

Hope that helps,

Tim

1 Like

Hey Tim,

thank you very much for your quick response. I just ran your code and it works perfectly and does exactly what I needed. This has been off to a great start, thank you again!

Best,
Daniel