How are the different electrophysiological features (used in the clustering analysis) extracted and arranged for a given cell?

Hi everyone,

I am trying to reproduce the electrophysiological clustering you did in Gouwens, et al 2019, on our database of electrophysiological recordings, starting from the raw stimulus-membrane voltage traces, and I have some questions about the features and how they are extracted and organized in order to be used in the clustering.
Here are my questions:

  • About the Action potential waveform (and its derivative): For a given cell, does the waveform come from the first sweep with at least 5 AP, or is it averaged over all sweeps with AP?
  • About the Instantaneous Firing Rate (and all the other features that are ‘Binned , using steps from rheobase to rheobase+100pA’ ) : The feature corresponds to an array of binned-value, but how are they arranged between the different sweeps? Are they concatenated to each other or averaged across all sweeps?
  • Also, am I right to suppose that each feature is stored as an array, and that all these arrays are stored in a json file for a specific cell, that will then be used in the clustering analysis?

Thank you for any help you can give me,

Best wishes,

Julien

In Gouwens et al. 2019, we use three action potential waveforms - one from the short square stimulus (3 ms current step), one from the long square stimulus (1 s current step), and one from the ramp stimulus. For the long square action potential, we use the first action potential elicited by the rheobase sweep (so there is no minimum number of APs required), and we don’t average it.

Those binned long-square features from sweeps with different amplitudes are concatenated together, not averaged. You can see an example of the concatenated, binned AP threshold values in Figure 2c in Gouwens et al. 2019.

The features are stored as cell by feature 2D arrays and saved as either a set of NumPy files (one for each array) or a single HDF5 file (now the default, with different datasets in the file containing the different arrays) by the IPFX package.

Hello,

Sorry for my delayed answer. Thank you for these clarifications.
Regarding Action potential Waveform (and it its derivative), my question came from the fact that in the feature extraction code, I saw:

def first_ap_vectors(sweeps_list, spike_info_list,
        target_sampling_rate=50000, window_length=0.003,
        skip_clipped=False):
    """Average waveforms of first APs from sweeps

    Parameters
    ----------
    sweeps_list: list
        List of Sweep objects
    spike_info_list: list
        List of spike info DataFrames
    target_sampling_rate: float (optional, default 50000)
        Desired sampling rate of output (Hz)
    window_length: float (optional, default 0.003)
        Length of AP waveform (seconds)

    Returns
    -------
    ap_v: array of shape (target_sampling_rate * window_length)
        Waveform of average AP
    ap_dv: array of shape (target_sampling_rate * window_length - 1)
        Waveform of first derivative of ap_v
    """

    if skip_clipped:
        nonclipped_sweeps_list = []
        nonclipped_spike_info_list = []
        for swp, si in zip(sweeps_list, spike_info_list):
            if not si["clipped"].values[0]:
                nonclipped_sweeps_list.append(swp)
                nonclipped_spike_info_list.append(si)
        sweeps_list = nonclipped_sweeps_list
        spike_info_list = nonclipped_spike_info_list

    if len(sweeps_list) == 0:
        length_in_points = int(target_sampling_rate * window_length)
        zero_v = np.zeros(length_in_points)
        return zero_v, np.diff(zero_v)

    swp = sweeps_list[0]
    sampling_rate = int(np.rint(1. / (swp.t[1] - swp.t[0])))
    length_in_points = int(sampling_rate * window_length)

    ap_list = []
    for swp, si in zip(sweeps_list, spike_info_list):

        ap = first_ap_waveform(swp, si, length_in_points)
        ap_list.append(ap)

    avg_ap = np.vstack(ap_list).mean(axis=0) #Do the average of all first AP?

    if sampling_rate > target_sampling_rate:
        sampling_factor = int(sampling_rate // target_sampling_rate)
        avg_ap = _subsample_average(avg_ap, sampling_factor)

    return avg_ap, np.diff(avg_ap)

where the output “avg_ap”, is the result of an average made by avg_ap = np.vstack(ap_list).mean(axis=0) . , where ap_list is a concatenation of multiple action_potential_waveform. I wondered why this line was present, can you enlighten me please?

I also have another question about the spike trough (fast and slow), and how they are defined, because in the Materials&Methods of Gouwens et al 2019 it is said :

For each AP, several features were calculated as follows: threshold; peak; fast trough (defined as where the dV/dt was 1% of the peak downstroke)

whereas in the electrophysiology overview technical whitepaper, available on the Allen Cell type web page, it is described as:

Action potential fast trough: Minimum value of the membrane potential in the interval lasting 5 ms after the peak.

Can you tell me which of this measurement is used as a feature in the clustering analysis (“AP fast trough across long steps”) please?
Also, I suppose that for the computation of the “width at half-height”, the trough value used is the" Action potential trough: Minimum value of the membrane potential in the interval between the peak and the time of the next action potential ", (found in the Allen Cell type web page), am I right?

Thank you for your help!
Best regards,
Julien

We do average the AP waveforms from the short squares if there are multiple sweeps run at the same minimum short square sweep amplitude (although often there is only one). For the long squares, there’s only ever one used rheobase sweep, so the APs don’t get averaged.

We did change the definition of the fast trough after the initial release of the Cell Types Database (prior to any publications using that data) to the one based on the derivative of the voltage. The new one is what we have used for our clustering analyses.

You are also correct that the trough (reported and used for spike width definition) is just the minimum value between spikes.

1 Like

Thank you very much for these clarifications!

Hi again,
Regarding the first_ap_vector extraction, according to the code,


def first_ap_vectors(sweeps_list, spike_info_list,
        target_sampling_rate=50000, window_length=0.003,
        skip_clipped=False):
    """Average waveforms of first APs from sweeps

    Parameters
    ----------
    sweeps_list: list
        List of Sweep objects
    spike_info_list: list
        List of spike info DataFrames
    target_sampling_rate: float (optional, default 50000)
        Desired sampling rate of output (Hz)
    window_length: float (optional, default 0.003)
        Length of AP waveform (seconds)

    Returns
    -------
    ap_v: array of shape (target_sampling_rate * window_length)
        Waveform of average AP
    ap_dv: array of shape (target_sampling_rate * window_length - 1)
        Waveform of first derivative of ap_v
    """

    if skip_clipped:
        nonclipped_sweeps_list = []
        nonclipped_spike_info_list = []
        for swp, si in zip(sweeps_list, spike_info_list):
            if not si["clipped"].values[0]:
                nonclipped_sweeps_list.append(swp)
                nonclipped_spike_info_list.append(si)
        sweeps_list = nonclipped_sweeps_list
        spike_info_list = nonclipped_spike_info_list

    if len(sweeps_list) == 0:
        length_in_points = int(target_sampling_rate * window_length)
        zero_v = np.zeros(length_in_points)
        return zero_v, np.diff(zero_v)

if there is no sweep with at least 1 spike (for example in short-square stimuli ), then the value is replaced by an array of 0, as seen in this plot, where for this cell, there was no short_square or Ramp protocols.
First_AP_waveform

Similarly I have a question regarding the binned-features (peak, threshold, etc). I saw in your documentation that when a bin has no data, you extrapolate from the neighboring bins that contains data. Then,

  • if the first data containing-bin is not bin 0, then the first bins take the value of the first data containing bin (and similar for last bins)
  • If between two data containing bins there are some bins without data, you do a linear extrapolation from the two data containing bins

That can be seen in this plot from sweep 29 of cell 517330781, where the first spike occurs in the 5th bin (triangle) so the first four bins in which there is no spike (circles) have the same value than the 5th. Similarly, between two data containing bins (triangle), the bins value are linearly extrapolated (circles).


I also understood that if a sweep was missing, you extrapolated in the same way from the two neighboring data-containing sweeps bin by bin.

My question is what is the motivation for extrapolating values (by assigning array of 0s or by linear extrapolation from neighboring bins/sweep ) rather than considering the array/bin/sweep as empty (with NaN value like I think you did when there is no sub-threshold trace for a cell)?

I realize it is a lot of points, so I thank you very much for all your help!
Best,

Julien

If I’m remembering correctly, the zero-valued sections for absent spikes is somewhat of a historical artifact; some piece of the early processing pipeline wasn’t handling NaN values properly (or something along those lines). But you are correct that it’s a sign of missing data - the drcme package, for example, has an option to check for an all-zero ramp spike and will drop those cells if a ramp spike is required; in other cases, we don’t check for that spike but don’t use the ramp spikes from any of the cells (so we only analyze the first two-thirds of that feature vector).

The interpolation for the features of the spikes during long square steps is for a somewhat different reason - on long square stimuli, the timing of spikes is not that precise and can vary from trial to trial with the same stimulus amplitude. Since we are using the feature vector outputs for clustering and classification, we didn’t want the specific timing of spikes on a given trace to have an outsized effect on the type definitions. We therefore decided to interpolate the spike features to allow comparisons across all cells while reducing the influence of particular spike times. One way to think about this is that these feature vectors represent our best estimate of what features a spike would have had if it had occurred at that particular time during the sweep. But the main point is that the decision was driven by our specific use of these feature vectors for clustering across the entire data set; it could certainly be represented in other ways, as you say.