Questions on Timescale of Cell Types EPHYS data

Hello, my name is Tyler Brassel with the QUON-titative biology lab. Our team is interested in applying deep learning approaches to analyze the Allen Cell Types Database and we have a few questions regarding the EPHYS data. When we began examining long square datasets, we noticed two different sampling rates were used, 50,000 Hz and 200,000 Hz, most likely corresponding to an improvement in experimental apparatus. In order to use a neural network to better understand the features of this data, we need to downsample the data such that all the points correspond to a common timescale. We initially believed that we could simply find the common factor between the two sampling rates and downsample accordingly.

Upon closer examination we can see that datasets do not share a consistent number of data points even when considering the different sampling rates. For example, with a long square stimulus of -110 pA 323 sweeps have 401,000 data points, 6 sweeps have 401,050 data points, 8 sweeps have 1,604,001 data points, and 1 sweep has 1,604,000 data points. Similar differences are seen across the dataset. It would make sense that samples with 4 times the sampling rate would also have about 4 times the number of data points, but this does not explain the additional 1 or 50 data points.

A look at the documentation suggests that some sweeps include a test pulse. We wonder if the discrepancy in the data ranges is meant to convey the presence of a test pulse. In addition, a team member found one of the sweeps had an index_range of (150000, 1604000), meaning the index_range does not start at 0. Even when they tried to account for this range we still got several different output lengths corresponding to the ranges I mentioned earlier. Our questions are as follows:

  1. Why do some sweeps have 1 or even 50 more data points than others?
  2. What are the characteristics of a test pulse and how can we best detect a test pulse at the beginning of a sweep?
  3. Does the index_range attribute include or exclude the test pulse?
  4. Why does the index_range start at a number other than 0?
  5. Given this information, what do you feel is the best way to extrapolate a common timescale for this data? We have unsuccessfully tried using functions like allensdk.ephys.extract_cell_features.get_square_stim_characteristics(), which relies on knowing beforehand if a test pulse has occurred.

Thank you for your time. Any help answering these questions is appreciated.