Opening Raw Data via Jupyter notebook made in AWS sagemaker

maciej-123 · July 21, 2021, 11:40am

I have recently connected to the allen brain observatory bucket via AWS and now have access to download raw data (i.e. spike_band.dat).

However, the raw data is quite large and I would like to access it without downloading anything to my computer in order to choose which data is most suitable for me.

I have attempted to open it via jupyter notebooks but I am getting a permission error. This is probably because the files are private and need to be accessed by a method however I do not know of any function in the form of get_raw_data(session_id, probe_id) or get_spike_band(session_id, probe_id) as is in the case get_ophys_experiments() for the brainobservatory cache. Does anyone know of any way to open and process the data in Jupyter notebooks?

njmei · July 23, 2021, 5:08pm

Hi @maciej-123,

Thanks for the question! There were some public permissions issues for the S3 bucket for those spike_band.dat files that needed to be addressed. I’ve modified the bucket settings so that those files should now be publicly accessible. Could you try again and let me know how it goes?

Best,

Nick

maciej-123 · July 23, 2021, 7:10pm

Thank you for helping me out, I have restarted jupyter and attempted to reconnect but I am still getting the same permission error (I assume no extra steps are needed apart from this).

Did you give public access to all the spike_band files? If not then, please let me know which files to access. Also, am I using the correct method to access the data or should I try some other way?

njmei · July 23, 2021, 7:32pm

Hi @maciej-123 ,

I’ve confirmed that all spike_band.dat files in the S3 bucket are now public. Because of the large size of the raw data, there is unfortunately not a set of AllenSDK classes and methods to allow easy access and download of raw data.

If you want to download and look at downsampled data you can take a look at:
https://allensdk.readthedocs.io/en/latest/_static/examples/nb/ecephys_data_access.html

Additional tutorials for working with and analyzing downsampled data can be found at: Visual Coding – Neuropixels — Allen SDK dev documentation

In terms of accessing the raw spike_band.dat files, it looks like you have mounted the S3 bucket as a local file-system. Are you using s3fs? If so, can you share how you are mounting that ecephys raw data S3 bucket?

Best,

Nick

maciej-123 · July 27, 2021, 10:54am

Thank you for your help, I do not quite understand what you mean by mounting the bucket, I am using a Jupyter notebook created via amazon AWS to try and access the data:

From this sentence on the github page: ‘s3fs allows Linux and macOS to mount an S3 bucket via FUSE’, it appears that I am not using s3fs as my operating system is windows 10.

njmei · July 27, 2021, 5:11pm

Hi @maciej-123,

Thanks for the additional information. Can you share details of how you’re setting up that AWS notebook?

Another thing I’m curious about, in your notebook can you try running the following?

from pathlib import Path

data_path = Path("/data/allen-brain-observatory")
print(data_path.is_dir())
print(data_path.exists())

If you’re getting False for both of the above print statements, then it means that the Linux machine that AWS is running your Jupyter notebook on (a.k.a. Sagemaker or similar instance) does not have the S3 bucket mounted on the instance’s file system.

Normal AWS Jupyter notebook instances do not automatically (to the best of my knowledge) have the ability to access S3 buckets via directory paths (e.g. /data/allen-brain-observatory), that capability requires custom Sagemaker notebook lifecycle configurations.

Here is an example script that our Informatics team uses as the lifecycle_config_on_start.sh to allow Sagemaker instances that we set up to access the public Allen datasets S3 bucket as a file path in notebooks.

#!/usr/bin/env bash

set -xe

mount_s3() {
    mkdir -p /data/allen-brain-observatory
    
   # Install s3fs in our Sagemaker instance's Linux machine
    yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    yum install -y yum-utils
    yum-config-manager -y --enable epel
    yum install -y s3fs-fuse

    # Use s3fs to mount the allen-brain-observatory S3 bucket at the "/data/allen-brain-observatory" file path
    s3fs allen-brain-observatory /data/allen-brain-observatory -o default_acl="public-read" -o complement_stat,uid=0,gid=$GROUP,umask=0222,allow_other,public_bucket="1",endpoint="us-west-2"
}

# ===== Run setup steps =====
echo "Started"

export GROUP=$(id -g ec2-user)

mount_s3

echo "Restarting the Jupyter server.."
restart jupyter-server

echo "Finished"

Topic		Replies	Views
How to download raw data from Neuropixels public datasets Brain Knowledge Platform allensdk	4	1203	July 8, 2023
Open Mouse Brain Atlas data available on AWS atlas-mouse-brain-adult , transcriptomics , how-to , histology , software	0	1060	June 27, 2020
Allensdk package (gain full access to data and results) Technical brain-observatory-visual-coding , github , experiment-design , allensdk	5	1570	February 13, 2020
Is it possible to pull data into Google Cloud Storage from AllenSDK API directly from a cloud notebook instance? Technical analysis , allensdk , api	1	324	October 9, 2023
Reg. Allen Brain Observatory data download Technical brain-observatory-visual-coding , allensdk , how-to	2	1163	November 27, 2018

Opening Raw Data via Jupyter notebook made in AWS sagemaker

Related topics