I am trying to do some work on the mouse trancriptomics data from the Allen Brain Institute. I have downloaded the transcrip.tome file from here. From the .readme I found out that the recommended way to read this file is a R package. However, I don’t know R and since the file format is supposedly HDF5 based, I decided to try h5py first.
This works in principle. I can run the following code:
So I generally seem to have file access. However, from the .readme I would expect a sparse matrix in f[‘data’][‘exon’], so a dataset.Dataset class. Instead I find there a group.Group object that contains 4 more keys. Neither of them seems to be further documented.
Does anyone know how to reconstruct the sparse matrix? Maybe a problem of h5py, which cannot account for some specifics of the .tome format? Thank you in advance for your help and the great community effort. I greatly appreciate it. Let me know if you require anymore information.
Theh5dump utility is handy for seeing into the structure of the .tome file. Use the -H switch to show just the structure of the archive, as in h5dump -H mouse/transcrip.tome.
That gives you some more information, but it’s still not entirely obvious how to construct a sparse matrix from the data. Here’s how I do it–
import scipy.sparse as ss
import h5py
# "h5f" is the handle that you get from "h5py.File('mouse/transcrip.tome')"
#
# "data_path" is the path within the archive. In this case probably
# either "/data/exon/" or "/data/intron/".
def extract_sparse_matrix(h5f, data_path):
data = h5f[data_path]
x = data['x']
i = data['i']
p = data['p']
dims = data['dims']
sparse_matrix = ss.csc_matrix((x[0:x.len()],
i[0:i.len()],
p[0:p.len()]),
shape = (dims[0], dims[1]))
return sparse_matrix
# The call looks like this
h5f = h5py.File('mouse/transcrip.tome')
exons = extract_sparse_matrix(h5f, '/data/exon/')
Once you have the sparse matrix you can do sparse things. If you have the RAM and inclination, you can turn that into a dense dataframe like so:
import numpy as np
import pandas as pd
# this uses the hdf5 file to extract the cell & gene names
# and uses them as row & column indices in the returned dataframe
def sparse_to_labeled_frame(h5f, sparse_matrix):
cell_names = h5f['sample_names']
gene_names = h5f['gene_names']
dense_matrix = sparse_matrix.todense()
# make it a dataframe so we can add row and column labels
df = pd.DataFrame(dense_matrix)
# add the header
df.columns = gene_names[0:].astype('U23')
# add the cell labels as a column, and make it the row index
df.insert(0, 'cell_label', cell_names[0:][0:].astype('U20'))
df.set_index('cell_label', inplace=True)
return df
thank you very much for your quick response. I just ran your code and it works perfectly and does exactly what I needed. This has been off to a great start, thank you again!