Opening Raw Data via Jupyter notebook made in AWS sagemaker

I have recently connected to the allen brain observatory bucket via AWS and now have access to download raw data (i.e. spike_band.dat).

However, the raw data is quite large and I would like to access it without downloading anything to my computer in order to choose which data is most suitable for me.

I have attempted to open it via jupyter notebooks but I am getting a permission error. This is probably because the files are private and need to be accessed by a method however I do not know of any function in the form of get_raw_data(session_id, probe_id) or get_spike_band(session_id, probe_id) as is in the case get_ophys_experiments() for the brainobservatory cache. Does anyone know of any way to open and process the data in Jupyter notebooks?

Hi @maciej-123,

Thanks for the question! There were some public permissions issues for the S3 bucket for those spike_band.dat files that needed to be addressed. I’ve modified the bucket settings so that those files should now be publicly accessible. Could you try again and let me know how it goes?

Best,

Nick

Thank you for helping me out, I have restarted jupyter and attempted to reconnect but I am still getting the same permission error (I assume no extra steps are needed apart from this).

Did you give public access to all the spike_band files? If not then, please let me know which files to access. Also, am I using the correct method to access the data or should I try some other way?

Hi @maciej-123 ,

I’ve confirmed that all spike_band.dat files in the S3 bucket are now public. Because of the large size of the raw data, there is unfortunately not a set of AllenSDK classes and methods to allow easy access and download of raw data.

If you want to download and look at downsampled data you can take a look at:
https://allensdk.readthedocs.io/en/latest/_static/examples/nb/ecephys_data_access.html

Additional tutorials for working with and analyzing downsampled data can be found at: Visual Coding – Neuropixels — Allen SDK dev documentation

In terms of accessing the raw spike_band.dat files, it looks like you have mounted the S3 bucket as a local file-system. Are you using s3fs? If so, can you share how you are mounting that ecephys raw data S3 bucket?

Best,

Nick

Thank you for your help, I do not quite understand what you mean by mounting the bucket, I am using a Jupyter notebook created via amazon AWS to try and access the data:

From this sentence on the github page: ‘s3fs allows Linux and macOS to mount an S3 bucket via FUSE’, it appears that I am not using s3fs as my operating system is windows 10.

Hi @maciej-123,

Thanks for the additional information. Can you share details of how you’re setting up that AWS notebook?

Another thing I’m curious about, in your notebook can you try running the following?

from pathlib import Path

data_path = Path("/data/allen-brain-observatory")
print(data_path.is_dir())
print(data_path.exists())

If you’re getting False for both of the above print statements, then it means that the Linux machine that AWS is running your Jupyter notebook on (a.k.a. Sagemaker or similar instance) does not have the S3 bucket mounted on the instance’s file system.

Normal AWS Jupyter notebook instances do not automatically (to the best of my knowledge) have the ability to access S3 buckets via directory paths (e.g. /data/allen-brain-observatory), that capability requires custom Sagemaker notebook lifecycle configurations.

Here is an example script that our Informatics team uses as the lifecycle_config_on_start.sh to allow Sagemaker instances that we set up to access the public Allen datasets S3 bucket as a file path in notebooks.

#!/usr/bin/env bash

set -xe

mount_s3() {
    mkdir -p /data/allen-brain-observatory
    
   # Install s3fs in our Sagemaker instance's Linux machine
    yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    yum install -y yum-utils
    yum-config-manager -y --enable epel
    yum install -y s3fs-fuse

    # Use s3fs to mount the allen-brain-observatory S3 bucket at the "/data/allen-brain-observatory" file path
    s3fs allen-brain-observatory /data/allen-brain-observatory -o default_acl="public-read" -o complement_stat,uid=0,gid=$GROUP,umask=0222,allow_other,public_bucket="1",endpoint="us-west-2"
}

# ===== Run setup steps =====
echo "Started"

export GROUP=$(id -g ec2-user)

mount_s3

echo "Restarting the Jupyter server.."
restart jupyter-server

echo "Finished"