Hi @GraceRosen ,
Happy new year!
The avg_correlation statistic is a single value for each (cell, cell type) pair. It is calculated as follows:
At each bootstrapping iteration, the cell will be represented by a vector of a few hundred genes
v_cell = [cell_gene0, cell_gene1, cell_gene2, cell_gene3, ...]
where the genes in the vector are the subset of marker genes selected to be used for that bootstrapping iteration. The values making up the vector are the values of the expression of the cell in those chosen genes.
Each cell type will similarly be represented by a vector
v_cell_type_N = [type_N_gene0, type_N_gene1, type_N_gene2, type_N_gene3, ...]
The genes are the same as in v_cell, but here the values are the average expression of that cell type in each of the chosen genes.
The correlation between a cell and a cell type is just the Pearson’s correlation coefficient
corr = mean[(v_cell-mean[v_cell])*(v_cell_type_M-mean[v_cell_type_M])]/(std[v_cell]*std[v_cell_type_M])
where the means and standard deviations are taken over the chosen genes.
The avg_correlation value reported by MapMyCells is, as you say, the average of this correlation coefficient over each of the bootstrapping iterations that actually chose the assigned cell type (i.e. if 52 of the 100 iterations chose cell_type_J, then the avg_correlation will be the average of these 52 corr values between the cell and cell_type_J).
Let me know if anything is still unclear.