Using the raw data


I am trying to use the raw data to do CSD and spike sorting, and not sure of the data format and the time alignment with stimulus presentation.

For the lfp_band.dat and spike_band.dat files, i assume it’s the binary data that can be memory mapped with a Matrix{Int16}(nch,nsample), where nch=384, and nsample=filebytes/2/nch. The nsamples i got were integers, so i assume i got it right.

Then, to extract raw data for each of stimulus presentation in these binaries, i use start_time and stop_time in the dataframe returned by session.get_stimulus_table(['flashes']) for each stimulus presentation, and the 2.5kHz and 30kHz sampling rate to cut out the raw data for lfp_band and spike_band.

However, the CSD i got were quite different from the one calculated from csd = session.get_current_source_density(probe_id), I am wandering whether the timeline in stimulus presentation are the same for the raw data. I checked the overall duration of all stimuli in session.get_stimulus_epochs() and each of the binaries by nsample/sampling_rate, and they are almost the same length.

I am not sure where goes wrong, would someone know the details clarify what i might missing?


Hi Alex – you’re definitely on the right track in terms of loading the data. We just use np.memmap to read in the raw .dat files at 16-bit integers. A scaling factor of 0.195 is then used to convert the values to microvolts.

To get the correct sample index for a particular stimulus presentation, you first need to find the index in the LFP data returned by the AllenSDK. This is an xarray DataArray with dimensions of time and channel. If you have a presentation start time (in seconds), you can find the corresponding index for the various arrays using the following method:

index = np.searchsorted(lfp.time, start_index)
raw_lfp_index = index * 2  # convert from 1250 Hz to 2500 Hz
raw_spike_index = index * 24 # convert from 1250 Hz to 30000 Hz

Once you have the correct indices, there are a few additional transformations that need to happen to match the LFP data that is returned by the AllenSDK:

  • The channels are reordered according to this mapping
  • The data is high-pass filtered to remove the DC offset
  • The median value of out-of-brain channels is subtracted to remove noise
  • The data is downsampled in space (4x) and time (2x)
  • Reference channels and channels far outside the brain are removed

Let me know if that helps!

Hi Joshs,

I found a mistake in my estimation of total duration. I just checked the lfp_band.dat file and spike_band.dat file are almost same duration, but when i compare it with the stat_time and stop_time of the first stimulus and last stimulus, it seems the raw data recording continues over ~1-2 hours after all stimuli have been presented.

Anyway, when i check the lfp.time returned by AllenSDK, that starting time is around 4 secs instead of 0, that explains why my CSD looks abnormal, because the recording start time t0 is not zero in stimulus timeline.

For simplicity, I assume i can just subtract the t0 of raw data from the stimulus start_time, and then map to the index using the corresponding sampling rates, where t0 =[0]?

I also checked the t0 for several probes of the same session, and they are seems the same, so can i assume they all equals, so i can just use one t0 instead of each t0 for each probe?

The LFP AllenSDK returned seems out of order regarding the channels layout in the probe, as you mentioned they’ve been remapped, but if i assume column order memory mapping the binary data, the CSD looks similar to the one returned by SDK.

(i assume samples of the same time saved together then next sample, so the 1D binary mapping should be reshaped in column-wise matrix(nch, nsample) and here the channel order are probe left column -< right, and then tip → bottom)



Hi Joshs,

It seems the system blocked my previous reply, so i am writing this one.

after checking the start time of lfp.time returned by AllenSDK, it seems Neuropixels recording were started ~4s in the master clock, where first stimulus presentation was ~8s. So for simplicity, I just use the[0] as the t0 of the rawdata. The stimulus times could be shifted to raw data times by subtract the t0 from stimulus start_time, then i use the sampling rate to get indices of each presentation.

I only checked several t0 of probes in the same session, and they are almost the same, so can i assume the t0s of every probe in the same session all equals?

For clarity, the memory mapped 1D binary file should be reshaped in column-wise matrix of nrow=384, ncol=nsample? The resulting CSD now looks similar to the one returned by AllenSDK.


Hi Li – each probe may actually have a slightly different start time, as well as a different sample rate. So you should use the method I described to find the correct index for each probe individually.

You’re right about the format of the binary files – there are always 384 rows.

I assume the scaling factor of 0.195 you mentioned was for raw lfp_band, is this also applys to raw spike_band?

That’s right.