Running the spca fit with the drcme package

Hello,

I am trying to run the spca fit from the drcme package with the h5 file obtained after extracting the vectors with the ipfx package.

When I run this line :

run_spca_fit.main(params_file="default_spca_params.json",output_dir="E:/MARGAUX/drcme_1/drcme-master/output_1",output_code="output_1",datasets=[{"fv_h5_file":"fv_test.h5","metadata_file": None,"dendrite_type": "all","limit_to_cortical_layers": []}])

I get an error telling me that I didn’t fill in the ‘allow_missing_structure’ argument despite it being set as required=False. When I add ‘allow_missing_structure’=False (the default value), I get the same type of error for “need_ramp_spike”,“allow_missing_dendrite” and “id_file”. I set them all to their default value (respectively True, False, None).

run_spca_fit.main(params_file="default_spca_params.json",output_dir="E:/MARGAUX/drcme_1/drcme-master/output_1",output_code="output_1",datasets=[{"fv_h5_file":"fv_test.h5","metadata_file": None,"dendrite_type": "all","limit_to_cortical_layers": [],"allow_missing_structure":False,"need_ramp_spike":True,"allow_missing_dendrite":False,"id_file":None}])

But then I have this error :

Traceback (most recent call last):
  File "spca_1.py", line 6, in <module>
    run_spca_fit.main(params_file="default_spca_params.json",output_dir="E:/MARGAUX/drcme_1/drcme-master/output_1",output_code="output_1",datasets=[{"fv_h5_file":"fv
_test.h5","metadata_file": None,"dendrite_type": "all","limit_to_cortical_layers": [],"allow_missing_structure":False,"need_ramp_spike":True,"allow_missing_dendrite"
:False,"id_file":None}])
  File "E:\MARGAUX\drcme_1\drcme-master\drcme\bin\run_spca_fit.py", line 132, in main
    subset_for_spca = spca.select_data_subset(data_for_spca, spca_zht_params)
  File "E:\MARGAUX\drcme_1\drcme-master\drcme\spca.py", line 74, in select_data_subset
    subset_for_spca[k] = data[:, indices]
IndexError: index 450 is out of bounds for axis 1 with size 447

What seems to be the problem ?

Best,
Margaux

You’re getting that error because the default_spca_params.json file is not matching up with the feature vector file you’ve produced with IPFX. The default_spca_params.json file in the repository is the original one that worked with IPFX outputs from original versions of the code. You could try using the file spca_inh_patchseq_cell_2020_config.json as a starting point instead, since that should line up better with the current output from IPFX. You may need to adjust it further, too, because the electrophysiology protocols changed somewhat from the 2019 study to the 2020 Patch-seq. But it should be a closer starting point.

Also, the documentation about the sPCA configuration on this page:
https://drcme.readthedocs.io/en/latest/getting_started.html#sparse-principal-component-analysis
which explains what the different values in that file mean. That could help you determine which values need to be adjusted to match up with your output file.

I understand how to adjust the number of components, but not the number of non-zero values for each kept component.

For example in the spca_inh_patchseq_cell_2020_config.json the ‘first_ap_v’ data set initially keeps 7 components with a list of non-zero components equal to [267, 233, 267, 233, 233, 250, 233].

In my output file after running the spca fit, the spca_components_used_output.json returns 5 indices, meaning that I have to keep 5 components because they are the only ones with an adjusted explained variance greater than 1%. Thus I modify the spca_inh_patchseq_cell_2020_config.json so that n_components equals 5, which means that the non-zero component list must take a list of 5 values. But do these 5 values correspond to the first 5 values of the non-zero component list (i.e. [267, 233, 267, 233, 233]) originally implemented in the spca_inh_patchseq_cell_2020_config.json ? If so, how were these values initially chosen? If not, how can I determine them ?

The values for the non-zero loadings per component were chosen by working with random subsets of the data and seeing how the adjusted explained variance for that component changed as its number of non-zero loadings was varied - generally, we tried to pick values that were as low as possible before the explained variance started to fall off rapidly.

The spca_components_used JSON file has the indices for each component, so that’s how you’d match them up with the number of non-zero loadings. They are also usually consecutive. So if you ask it to calculate the sPCA with 7 components (and give it the non-zero loading for components list with 7 elements), and it only keeps 5 components that exceed the adjusted explained variance threshold, the spca_components_used JSON will say which 5 of the 7 were kept in the “indices” entry - if it was the first 5, the “indices” array will be [0, 1, 2, 3, 4].

You don’t need to necessarily re-run anything in that case - when we were doing the analysis, we usually wanted to have enough components so that at least one of them would be below the cut-off to ensure we weren’t unintentionally excluding components that would have exceeded the explained variance threshold.

You can also see more information about what values are being kept by running the script with the --log_level INFO or --log_level DEBUG (for even more information) as command line arguments to the run_spca_fit script, in case that’s helpful.

But in the case of some features present in the h5 output coming from ipfx but not present in the spca_inh_patchseq_cell_2020_config.json (i.e. psth), I still have to redo these steps to determine their n_components and nonzero_component_list keys and thus add these new features to the config file.

And on the contrary I want to have the inst_freq_norm feature (which is present in the spca_inh_patchseq_cell_2020_config.json) but it was not taken into account in my h5 output coming from the ipfx. Would there be a way during the feature extraction to specify the features I want to be extracted ?

I was also wondering, why didn’t you take all the electrophysiological features used in the 2019 paper and also used it the 2020 paper (i.e. missing the psth feature) ?

That’s true, you would have to do those steps for a newly added feature. You could probably start with a similar feature to get a reasonable starting point, but it would be an iterative process. There isn’t any specific code written for that process, so it’d probably be faster overall to load specific data sets and call individual functions rather than use the whole run_spca_fit script.

We stopped using the PSTH feature because it seemed mostly redundant with the inst_freq feature (the PCs from each were highly correlated). So it seemed simpler to use just one of them.

The inst_freq_norm feature is actually calculated from the inst_freq feature vector by the DRCME code, so it isn’t in the HDF5 file on its own. So if inst_freq_norm and inst_freq are in the configuration file, it will be used in the sPCA analysis.

Hi,

Thank you for this discussion and the answers you already provided. However, I still have some questions about the spca_inh_patchseq_cell_2020_config.json file:

From what I understood, for each feature:

  • “n_components” is the number of components explaining more than a certain percentage of variance
  • “nonzero_component_list” indicates for each of the components how many variables influence it (?)
  • “use_corr” specifies if the data needs to be scaled to its standard deviation
  • “range” indicates for any cell for that particular feature what section of the vector should be used

I am not sure to understand the meaning of the values contained in the non-zero_component_list for a particular feature, and how you determine them.
For example, if I have a new feature I want to add in the analysis, I would

  1. Perform several PCA with increasing number of components and stop when the last component explain less than 1% of the variance (as an example). That would results in X components
  2. Then for each of the PC determined, count how many variables varies with it?

Is that right or am I completely wrong?

Also, if we take the example of spca_inh_patchseq_cell_2020_config.json:

"step_subthresh": {
        "n_components": 3,
        "nonzero_component_list": [],
        "use_corr": false,
        "range": null
    },

why does step_subthreshold indicates 3 components, and the list for non-zero_component_list is empty?

Sorry for the quantity of question, and thank you for all the help you can give me,
Best whites,
Julien