Reproducing r-score correlations in Allen Human Brain Atlas

Hello, I am a high school student working on a research project regarding co-expression. I want to reproduce the r-scores automatically generated when you find correlates on the Allen Human Brain Atlas for a certain probe, but can’t figure out how. When I download the data myself, I put it into Excel and use the =CORREL function, but the r-score is a bit off. For ex., when comparing A_23_P216779 prove for NTRK2 and CUST_1359_PI416573500 probe for NTRK2, the atlas says the r-score is 0.947, but in Excel, I got 0.967605151.

Could anyone help me with this? How do I reproduce these r-scores? It’d be great if the help was a little bit simplified since I am in high school and new to data analysis.

Thanks!

Hi @jenniferhu04,

This question has come up before, prior to the start of the community forum, and I’ve copied the answer here:

The issue is that the defaults for viewing microarray data do not match with the data that is used for calculating correlations. The correlations are calculated using the log2 intensity values on a sample-by-sample resolution. Gain access to this data, you will need to do the following:

  1. Go to your link: Microarray Data :: Allen Brain Atlas: Human Brain
  2. Turn Filter Heatmap: On (located just below the heatmap)
  3. Set Resolution = Samples
  4. Change the Color Map to log2 intensity
  5. THEN click “download the data” and save these data
  6. When you calculate the correlations on these data they should match with the value listed on the website.

If this doesn’t solve your problem or you still have questions, please reply to this thread. Good luck in your research!

Best,
Jeremy

@jeremyinseattle Thank you so much for your help! I did download the data and got the correct r-score. I also have a question about potentially combining correlations between probes. For example, with the find correlates button, I can see that an F3 probe is correlated with an NTRK2 probe, but I want to calculate a gene to gene correlation by combining the r-scores of all F3 probes with all NTRK2 probes.

I was originally planning on taking averages, but I was wondering if this was scientifically accurate, and if not, what other way would I be able to accomplish this?

I look forward to your response!

Hi, @jenniferhu04. Deciding which probe(s) to best represent each gene is something that we have spent a lot of time considering. You have a few choices.

  1. Use the probe with the highest average expression level. Usually this probe best represents the underlying gene expression
  2. We wrote a paper comparing gene expression from the Allen Human Brain Atlas using microarray and RNA-seq. Additional File 8 shows statistics for each probe, and probes with the lowest q-value best represent true gene expression values as measured with RNA-seq.
  3. If you want to aggregate probes using the average or another metric (take the mostly highly expressed probe per gene as mentioned in #1 above), you can do that easily with the collapseRows R function.

After calculating your gene expression matrix using one of the above three options, then I would suggest defining your R score between genes.

Hi @jeremyinseattle , thank you for your response!

I am not quite sure what you mean by the highest average expression level. Is this taking the average of all log2 intensity values for a probe and choosing the greatest one? How would I find which probe that is, for example, in the 7 trk-B probes?

Also, for the collapseRows R function, would I use the log2 intensities or z-scores? And would this essentially be taking the average of each column, or is something more complex occurring?

I am not very familiar with these methods, so I apologize for having to ask repeatedly. Thank you for all your help!

Hi @jenniferhu04. Yes, this is what I meant: “taking the average of all log2 intensity values for a probe and choosing the greatest one”. If you are only interested in a single gene (or very few) like trk-B (see this link), you can also look visualize it in the heatmap by changing the “Color Map” at the bottom of the heatmap to “log2 intensity” and seeing which one has warmer colors. In this case, I think A_23, CUST_1359, or CUST_486 should all be fine. For the collapseRows function, there are several options, but they all start from the intensities and not the Z-scores.

@jeremyinseattle I took a look at the table you linked in #2, but I couldn’t find any of the genes or probes I was interested in, which is very confusing. I tried searching for NTRK2 (trk-B), BDNF, as well as probe names, but nothing came up.

The information is in the table. I think by default something is selected in the spreadsheet to you need to click anywhere to deselect and then also click on the “Enable editing” button on the top of the screen. Any any case, this is the results:
image

1 Like

Hi @jeremyinseattle ,

It has been a while, but I had another question regarding my project. Is it possible to find gene correlates to genes in only a specific category, for example, Nervous System Development Category? Right now, when I try to find correlates for a gene in that category, I end up seeing all 58000 probes instead of only the 3000 probes in the category I need.

Thanks!

Hi @jenniferhu04, to the best of my knowledge there is no way to reorganize the genes selected in a category or to perform any calculation (e.g., finding correlates) on only a subset of the genes.

I would suggest downloading the data and working with it in R or excel or something to address this question, but if you don’t want to do this, you could do something like the following:

  1. Click on the Nervous System Development category. When you get to the microarray page, “Download the data”. You will need more than one download for categories with >2000 probes. Once you download the data open the csv file with the probe/gene names to see which genes are in your category (you can ignore these expression matrices).
  2. Perform your desired correlation analysis separately. Download these data to find the top however many genes you want (again, if it’s more than 2000, you’ll need multiple downloads).
  3. Filter the data from step 2 to only include genes/probes from step 1 in Excel or R or something.

Maybe someone else on the forum has a better solution, but in the meantime hopefully this helps.