How to: Use BKP file manifest to access BICAN data at NeMO archive

SvenOtto · July 18, 2024, 10:52pm

Manifest download and formatting

Scientists can download a project’s file manifest from its specimens viewer in the BKP’s Data Catalog.

Example: Download the BICAN rapid release file manifest

Access the Project’s specimen viewer at https://knowledge.brain-map.org/data/BUQ7G50XHDCFCJCQ03A/specimens.
Filters down to a relevant subset of specimens via the UI’s filter capabilities.
Click the download button.
Select “File Manifest” from the modal.

Note: If you experience issues with the filtered file manifest download (e.g. file download not triggering or output not matching expectations), we recommend reloading the specimen viewer and trying again. Another workaround is to download the full metadata & file manifest, combine them, and then subset as desired in a data editor of your choice.
Unzip the downloaded .zip archive to access the readme and manifest.csv. The manifest output will match the filters & results of the user interface.

Archive tools may require adjusting the manifest’s column names and order to access data.

Example: Adjust the column names & order to match the designations at NeMO archive

BKP Manifest	BKP order	NeMO	NeMO order	Field Meaning
Project ID	1	N/A	—
Specimen ID	2	sample_id	5	library aliquot name
File Name	3	file_id	1	filename
Checksum	4	md5	2	md5 checksum of the file
N/A	—	size	3	file size in byte
File Type	5	N/A	—
Archive	6	N/A	—
Archive URI	7	urls	4	downloadable file path, either HTTPS path or restricted GCP bucket path

Note: You’ll need to manually add the size column. If there are no know values to fill, populate entries with a hyphen (“-”). Cell’s must not be empty.

You can then use the the manifest in NeMO’s Portal-Client tool. Two alternatives are also listed below.

NeMO Portal-Client tool for downloading files using file manifest

The Portal-Client is a python-based client for downloading data files hosted by an instance of the portal software developed by the GDC and further modified by the Institute for Genome Sciences (IGS). The user has to install the tool to download the files using portal-client compatible NeMO manifest file as an input. The manifest can be generated and downloaded from the Brain Knowledge Platform’s (BKP) Data Catalog hosted by the Allen Institute for Brain Science. Instructions on how to access and reformat it can be found here in the Allen Brain Map Community Forum. This manifest file contains the URLs for the files to be downloaded. These URLs can either be the HTTPS file paths and/or restricted GCP bucket file paths. The user will have to be granted access by NeMO to the restricted GCP bucket containing the controlled access files before downloading them using GCP bucket file paths in the manifest (refer to this document for getting approval from NDA to access controlled data at NeMO). Refer to this documentation for installing and using the Portal-Client tool.

Portal-Client compatible manifest template

The manifest file is a 5 column tab delimited file. The columns in the same order as in the manifest:
file_id (filename)
md5 (md5 checksum of the file)
size (file size in bytes)
urls (downloadable file path, either HTTPS path or restricted GCP bucket path)
sample_id (library aliquot name)
If the file size is not known, fill the hyphen (“-”) as a value in the column.

Example of a manifest file:

Basic invocation

The following command is the most basic way of invoking the client by using the manifest file downloaded from the Data Catalog query interface.

portal_client --manifest /path/to/data_catalog_manifest.tsv

NeMO data collection and landing pages

Data accessible: restricted and open data

A collection is a defined dataset that is generated by bundling data submitted under a specific grant, lab, technique, species, subspecimen type, file format and data use limitation (for controlled access data). Each collection is assigned a unique NeMO identifier with a prefix “nemo:col:”. A meta-collection is a “collection of collections” which is generated by bundling a variety of data (collections). Example: A multiome fastq files meta-collection is generated by bundling ATAC and RNA collections together. The fastq files included in an ATAC collection (first child collection with a unique NeMO identifier) and RNA collection (second child collection with a unique NeMO identifier) are produced by sequencing libraries using two different techniques - 10X Genomics Multiome ATAC and RNA sequencing. Each meta-collection is also assigned a unique NeMO identifier.

The (meta-)collection landing page provides basic metadata about the (meta-)collection and contains a link to a BDBag (an archive file containing downloadable file paths) for downloading the files. These pages are made available at assets.nemoarchive.org. To bring up a landing page for a particular (meta-)collection on a web browser, the NeMO (meta-)collection identifier (‘col’ identifier) has to be appended to the end of the url (Eg: NeMO Data Archive Assets). A landing page can be identified as a meta-collection landing page if the “Technique”, “Access”, or “Species” section has more than one value.

The landing page contains a HTTPS link in the “HTTPS URL” section, when clicked, opens a HTTPS location where open access files associated with the (meta-)collection are released to the public for download. Restricted access files (Eg: restricted human data) and embargoed files are not accessible at HTTPS location. The page also contains a link to a downloadable BDBag in the “BDBag URL” section. Users will have to install the BDBag software to download the files from the bag. Please refer to this documentation on installing the BDBag tool and downloading files from a BDBag. More information is available here. A meta-collection landing page has a master BDBag (“bag of bags”) linked in the “BDBag URL” section, which means that there will be a BDBag for each child collection within the master BDBag. Each child BDBag contains a manifest file containing a list of files available for download from the bag and the associated metadata. One important metadata element in the manifest is the “library_aliquot_nhash_id” which is a unique identifier for a library aliquot generated by the NIMP. It can be used for acquiring donor and specimen metadata from the NIMP and Brain Knowledge Platform’s (BKP) Data Catalog Specimen table.

Example of a collection landing page: NeMO Data Archive Assets

Accessing Data

Open data

The HTTPS link provided in the “HTTPS URL” section of the collection landing pages takes the user to a HTTPS server-based browser where open access data is available for download. Data can be downloaded from the location using any tools that support https downloads.

The data in the collection can also be downloaded from the BDBag linked in the “BDBag URL” section of the landing page. Users will have to install the BDBag software to download the files from the bag. Please refer to this documentation on installing the BDBag tool and downloading files from a BDBag. More information is available here.

Restricted data

If a collection contains restricted access files, then the files within the BDBag linked in the landing page can be downloaded only if the user has approval from NIMH Data Archive (NDA) to access data.

If a meta-collection contains a combination of restricted (raw and alignment) and open access (counts, peaks) collections, then the meta-collection landing page will have a HTTPS link in “HTTPS URL” section from where users can download open access data. This open access data will also be available for download from the child BDBags within the master BDBag. Restricted data will not be available in HTTPS location, but will be available for download from a child BDBag within the master bag only if the user has approval from NDA to access data. To obtain approval from NIMH, the user will have to log into NDA and open a request for access to the data at NeMO. After the request gets approved by the Data Access Committee (DAC), the user will receive an email notification of the decision. The user will have to forward the email to NeMO (nemo@som.umaryland.edu) to get access.

Data collections available in Rapid Releases

All data collections made available in Rapid Releases are publicly accessible. There are no controlled access human datasets. Please click the links below to get more information on the collections included in Rapid Releases.

Accessing NeMO Collections from Brain Knowledge Platform (BKP)

There are two ways of finding NeMO collection data at Brain Knowledge Platform’s Data Catalog:

NeMO Collection landing pages linked in Project page of BKP’s Data Catalog
Searching and filtering metadata and downloading a file manifest along with associated specimen and donor metadata from Specimen table in BKP’s Data Catalog
NeMO Collection landing pages linked in Project pages of BKP’s Data Catalog

The NeMO collection landing page URLs are linked under each collection listed in “DATA COLLECTIONS” section in the master BICAN program page - “BICAN Rapid Release Inventory: Single cell transcriptomics and epigenomics” in BKP Data Catalog.

a. Select BICAN Program in Data Catalog search filters:

b. Click on the “DATA COLLECTIONS” option in the Program page, then click on the “NEMO” link to navigate to the corresponding collection landing page where you will find links to collection BDBag and HTTPS path for file download. Details on downloading the files from a BDBag are in the section “Data collections and landing pages “ of this document. Details on accessing files from HTTPS links are in the section “HTTPS public access browser” of this document”. Allen’s documentation on finding collections is here.

Searching and filtering metadata and downloading a file manifest along with associated specimen and donor metadata from Specimen table of BKP’s Data Catalog

A tutorial on searching the metadata and downloading a file manifest from Specimen Table of BKP’s Data Catalog is posted for users reference here- “Download a file manifest for all female chimpanzees from Ed Lein’s - UM1MH130981 BICAN grant". The file manifest downloaded from Specimen table containing the HTTPS file paths can be used as an input into the Portal-Client tool to download the files after reformatting the manifest. Instructions can be found here in the Allen Brain Map Community Forum. . Details on downloading the files in a manifest using portal-client are in the “Portal-Client tool for downloading files using file manifest” section of this document.

HTTPS public access browser

The open access BICAN data is released at https://data.nemoarchive.org/. Grant specific data can be accessed by navigating through the data directory structure. The top or root level of the HTTP browser is based on the program. Within each program, data is organized by grant, lab, modality, subspecimen type, technique, species, and data type. Individual files can be downloaded from the browser by right clicking on the file and copying/saving the link. For downloading via command line, use any online tools that support http downloads such as Wget or cURL.

Eg: Index of /bican/grant

Topic		Replies	Views
BICAN Rapid Release: Reference documentation Brain Knowledge Platform	1	119	January 21, 2025
Tutorial: Download a file manifest for all female chimpanzees from Ed Lein’s - UM1MH130981 BICAN grant Brain Knowledge Platform	0	107	September 23, 2024
Brain Knowledge Platform now with dedicated landing page and program pages Brain Knowledge Platform	0	19	September 30, 2024
Slow downloads and the best usage for downloading data via Nemo portal Technical how-to , rna-seq	0	492	November 2, 2022
BICAN consortium data now available in Data Catalog Brain Knowledge Platform	0	129	September 23, 2024