Batch download: Developing mouse brain

I want to download the data for all available genes in the developing mouse brain dataset, across all time-points, mapped to the highest level/coarsest Structure Unionization level (i.e neuroanatomical region).

The examples in the API for, e.g. Structure Unionization don’t work, and I’m unsure how to do it in batch. Given that I want all genes, I assume batch downloading would make more sense than the API (assuming I can get the data after it has been harmonized into the level of regions. i.e., I want to work with tabular data about expression per region and gene and timepoint)).

Has anyone done this or can help? Thanks!

Hi,

This is probably going to be a longer answer than you expected. I’ve put together some python code (end of this post) that will, I think, do what you are asking. That being said, there’s a lot of freedom in this system, so I’m going to try to describe my process so that you can tweak the code should it not be giving you exactly what you want.

General resources

The solution, such as it is, depends on the Allen Institute’s “Restful Model Access (RMA)” API. Some generally useful resources when interacting with this API are.

  • This post, which describes the RMA API more generally
  • This post, which attempts to describe how RMA queries are put together
  • This post, which attempts to catalog the kinds of filters and queries the RMA API can support in the most general sense
  • And this post, which links to the data model backing the API

Specific solution

The python code below depends on the libraries requests (which handles passing the query URLs to the API and getting results back) and pandas (mostly for shaping the data and writing it out to a CSV). These can be installed with

pip install requests pandas

The solution performs two queries. First, it queries the model to find all of the NCBI genes that are a part of the “Developing Mouse” product. It then assembles a list of the internal IDs of these genes and uses it to query the StructureUnionize model as described in this post. This code ran for me in about 4 minutes and created as CSV file with ~ 49,000 rows.

import json
import pandas as pd
import requests
import time


def main():

    # file that will be written out
    dst_path = "structure_unionization_result.csv"

    mouse_gene_id_list = get_list_of_mouse_genes()

    print("====START GETTING ACTUAL EXPRESSION DATA====")

    data = []

    # chunk through mouse gene_id_list
    gene_chunk_size = 500
    for i0 in range(0, len(mouse_gene_id_list), gene_chunk_size):
        gene_chunk = mouse_gene_id_list[i0:i0+gene_chunk_size]
        data += get_expression_by_structure(
            gene_id_list=gene_chunk
        )

    pd.DataFrame(data).to_csv(dst_path, index=False)


def get_list_of_mouse_genes():
    """
    Query the API for all of the known NCBI mouse genes.

    Return a sorted list of their 'id' values which can be
    used for querying

    The ID values being returned are IDs internal to the
    Allen Institute API. When we use these IDs to query
    for the actua gene expression, we will convert them
    to Entrez IDs, acronyms, and names.
    """
    t0 = time.time()
    print("====QUERYING API FOR LIST OF ALL KNOWN MOUSE GENES====")
    start_row = 0
    n_total = 0
    chunk_size = 10000

    id_set = set()

    while True:

        query_url = (
            "http://api.brain-map.org/api/v2/data/query.json?criteria="
            "model::NcbiGene,"
            "rma::criteria,organism[name$il'Mus musculus'],"
            "rma::include,organism,products,"
            "rma::options"
            f"[num_rows$eq{chunk_size}]"
            f"[start_row$eq{start_row}]"
        )

        raw_response = requests.get(query_url)

        try:
            response = json.loads(raw_response.text)
            all_genes = response['msg']

        except:
            print("=======RAW RESULT=======")
            print(raw_response.text)
            print("=======")
            raise

        if response['num_rows'] == 0:
            break

        # only keep genes that are associated with
        # 'Developing Mouse' products
        dev_mouse_genes = []
        for gene in all_genes:
            keep_it = False
            for prod in gene['products']:
                if 'Developing Mouse' in prod['name']:
                    keep_it = True
                    break
            if keep_it:
                dev_mouse_genes.append(gene)

        id_set = id_set.union(
            set([int(gene['id']) for gene in dev_mouse_genes])
        )
        start_row += len(all_genes)
        print(f'    FOUND {len(id_set)} GENES SO FAR')

    dur = time.time()-t0
    print(f'FOUND {len(id_set)} GENES IN {dur:.2e} SECONDS')

    return sorted(id_set)


def get_expression_by_structure(
        gene_id_list):
    """
    Query the API for gene expression in structure unionization
    for all genes whose ids are in gene_id_list

    Return a list of dicts characterizing the results.
    """
    t0 = time.time()
    print(f"QUERYING GENE IDs {min(gene_id_list)} through {max(gene_id_list)}")
    chunk_size = 10000
    start_row = 0

    final_result = []
    while True:

        gene_filter = "genes[id$in" + ",".join([str(g) for g in gene_id_list]) + "]"

        query_url = (
            "http://api.brain-map.org/api/v2/data/"
            "query.json?criteria=model::StructureUnionize,"
            "rma::criteria,"
            "section_data_set[delegate$eqfalse] ("
                f"{gene_filter}"
            "),"
            "structure("
                "structure_sets_structures("
                    "structure_set[name$eq'Developing Mouse - Coarse']"
                ")"
            "),"
            "rma::indclude,section_data_set(specimen(donor(age))),"
            f"&start_row={start_row}&num_rows={chunk_size}"
            "&order=ages.embryonic$desc,ages.days,structures.graph_order"
            "&tabular=structure_unionizes.section_data_set_id,"
                "genes.acronym as gene_acronym,"
                "genes.name as gene_name,"
                "genes.entrez_id as gene_entrez_id,"
                "ages.name as age,"
                "ages.embryonic,"
                "ages.days as age_in_days,"
                "donors.strain as donor_strain,"
                "donors.sex as donor_sex,"
                "structures.acronym as structure_acronym,"
                "structures.name as structure_name,"
                "structures.graph_order as structure_graph_order,"
                "structure_unionizes.expression_energy"
        )

        raw_result = requests.get(query_url)
        try:
            result = json.loads(raw_result.text)
            if start_row == 0:
                print(f"    EXPECT TO FIND {result['total_rows']} DATA ENTRIES")
        except:
            print("=======RAW RESULT=======")
            print(raw_result.text)
            print("=======")
            raise

        if not result['success']:
            print(raw_result.text)
        if result['num_rows'] == 0:
            break
        final_result += result['msg']
        start_row += len(result['msg'])
        print(f'    got {len(final_result)}')
    dur = time.time() -t0
    print(f"    FOUND {len(final_result)} DATA ENTRIES IN {dur:.2e} SECONDS")

    return final_result


if __name__ == "__main__":
    main()

1 Like

Amazing! Thank you so much!

It gets 1221 genes, not 2,000. Is this an error, or does the developing atlas have only 1.2k?

I tried changing the code to get gene expression for all the genes (± region), but it seems to be looking at millions of rows now and to be very slow. Is there a better structural organization I can use , or even getting expression across the whole brain? I want as many genes as possible + temporal resolution, structural resolution is a secondary thing.

def get_expression_by_structure(gene_id_list):
“”"
Query the API for gene expression across all structures and time points
for all genes whose ids are in gene_id_list.

Return a list of dicts characterizing the results.
"""
t0 = time.time()
print(f"QUERYING GENE IDs {min(gene_id_list)} through {max(gene_id_list)}")
chunk_size = 10000
start_row = 0
final_result = []

while True:
    gene_filter = "genes[id$in" + ",".join(str(g) for g in gene_id_list) + "]"

    query_url = (
        "http://api.brain-map.org/api/v2/data/"
        "query.json?criteria=model::StructureUnionize,"
        "rma::criteria,"
        # restrict to non-delegate section data sets and genes of interest
        "section_data_set[delegate$eqfalse] ("
            f"{gene_filter}"
        "),"
        # join the Structure model but do not restrict to any structure set
        "structure,"
        # include specimen and donor information for age/time
        "rma::include,section_data_set(specimen(donor(age)))"
        f"&start_row={start_row}&num_rows={chunk_size}"
        # sort by age and structure order (optional)
        "&order=ages.embryonic$desc,ages.days,structures.graph_order"
        # specify the tabular columns you want back
        "&tabular=structure_unionizes.section_data_set_id,"
            "genes.acronym as gene_acronym,"
            "genes.name as gene_name,"
            "genes.entrez_id as gene_entrez_id,"
            "ages.name as age,"
            "ages.embryonic,"
            "ages.days as age_in_days,"
            "donors.strain as donor_strain,"
            "donors.sex as donor_sex,"
            "structures.acronym as structure_acronym,"
            "structures.name as structure_name,"
            "structures.graph_order as structure_graph_order,"
            "structure_unionizes.expression_energy"
    )

    raw_result = requests.get(query_url)
    result = json.loads(raw_result.text)
    if start_row == 0:
        print(f"    EXPECT TO FIND {result['total_rows']} DATA ENTRIES")

    if not result['success'] or result['num_rows'] == 0:
        break

    final_result.extend(result['msg'])
    start_row += result['num_rows']
    print(f"    got {len(final_result)}")

dur = time.time() - t0
print(f"    FOUND {len(final_result)} DATA ENTRIES IN {dur:.2e} SECONDS")
return final_result


>
    FOUND 2106 GENES SO FAR
FOUND 2106 GENES IN 1.81e+02 SECONDS
====START GETTING ACTUAL EXPRESSION DATA====
QUERYING GENE IDs 11134 through 14582
    EXPECT TO FIND 3277514 DATA ENTRIES
(then runs for aaages)

I would like to take a step back and ask what you want this data to look like? If you do not want the structure unionization data, do you want something like the data described in this section of the documentation (gene expression quantified over a 3D grid of the brain)?

Ideally, I’d have the data in a csv, with the shape:

gene_acronym,gene_name,gene_entrez_id,age,age_in_days,,structure_acronym,structure_name,expression_energy

And possibly adding another column, identifying the specific mouse source, in case the data was gathered from multiple mice/donors.

My goal is to get a dataset with information about as many genes as possible, with information on how they are expressed differently at different developmental steps/time-points. Optionally also including information about anatomic/spatially different development (i.e the mouse brain regions) - at a coarse level (e.g. I’d be happy with anatomic resolution at the level of “hindbrain, midbrain, cortex, cerebellum” but I’ll work with the existing structures, whatever already exists and is harmonized).

I’m also assuming expression_energy is a “normalized” metric of gene expression, that I can compare and track between regions/timepoints/donors.

Thanks again very much for the help! I can adapt to any solution you can offer, so long as I get lots of genes, over time, for the developing mouse :).

In that case, I really think you should leave in the

            "structure("
                "structure_sets_structures("
                    "structure_set[name$eq'Developing Mouse - Coarse']"
                ")"
            "),"

from my example code. The way you have edited the API call, the query has no way of knowing that you are only interested in the developing mouse atlas, so it is calling for all instances of those genes across all of our datasets. I assume this is why the number of rows ballooned to more than 3 million.

If you decide you still want to eliminate the structure set limitation, try reducing the number of genes the code is querying for at a time by setting gene_chunk_size to something smaller in main(). I suppose you could also set chunk_size to something smaller in the actual query function, as well. If a single API call contains too many rows, that can slow the process down.

“expression energy” is defined in this section of the “Developing Mouse” post.

And to be clear, now that I think a bit more about your output, this

 FOUND 2106 GENES IN 1.81e+02 SECONDS
====START GETTING ACTUAL EXPRESSION DATA====
QUERYING GENE IDs 11134 through 14582
    EXPECT TO FIND 3277514 DATA ENTRIES

means that there were 3.3 million rows associated with the first 500 genes of the 2106 genes you are ultimately querying for, so you will (naively) be looking at a final CSV with ~ 12 million rows.