Transcriptomics (RNA-seq/microarray) data normalization - FAQ

We get many questions about how we normalized data for a given atlas, and why we chose the method we chose. For any atlas, the best place to look to find out what normalization we used is in the “Documentation” tab. For example in the Human Brain Atlas, there is an entire white paper about Microarray Data Normalization". In other Atlases, there are sections about normalization in a more general white paper. If you’d like more information about why we chose that normalization method, that is often also found either in the same white paper or in the relevant "Primary Publication Citation"for the atlas. In many cases, we tried several normalization strategies, and the one presented was the one that provided the most reliable biological insights in the data.

As a general rule, we have different goals of normalization for different data modalities:

  • Microarray data - For most microarray data sets, normalization is designed to remove sources of unwanted variability (e.g., batch effects, RNA quality) without removing sources of desired biological variability (e.g., brain structure).
  • RNA-seq (bulk) - In this case the goal of normalization is to account for differences in sequencing depth as well as gene length, which is typically done using TPM or FPKM explained clearly in this blog. If needed additional normalization for unwanted variability is also applied.
  • Single cell/nucleus RNA-seq - We don’t see many of the quality issues in single cell/nucleus data that we find with bulk tissue, and therefore in most cases we only control for sequencing depth (e.g., by taking CPM, or counts per million). Much of the information is found in reads mapping to introns, and so we report these along with reads mapping to exons as well.

Please post any related questions or comments as responses and we can address them!

2 Likes

Hello! Being new to the site and while using the Allen Human Brain Atlas or more specifically the Differential Search of the Human Microarrays section (performing a search between two different brain areas ), I noticed that there is one singular fold change value, which appears near the heat map for every single gene and I was wondering how this specific fold change value has been estimated, especially in reference with the various expression values that are generated for the same search (and gene) once we decide to download the data in a .csv format. The same thing happens with the p-value as well. When it comes to the expression values of the .csv files on the other hand, what do they stand for and how are they connected to the heat map or the expression z-score and log2 intensity localized in the gene info section of the map in question? Looking forward to your response.

Hi @fanikonst,
This is a great question, and one which has caused some confusion for many people. Broadly, in the Allen Human Brain Atlas, the fold change value represents the average log2(intensity) of all samples in the Target Structure minus the average values in the Contrast Structure. The reason this is confusing is because the two bolded values are not the defaults. By default, z-score values are shown and samples are wrapped up by structure. To make these adjustments, you’ll need to toggle the “Resolution” and “Color Map” options to the bottom right of the heatmap. It is worth noting that when you click on “Download this data”, you will download whatever values are currently being shown in the heatmap (so if you are looking at z scores, you will download z scores). You can download all the data on the download page.

I’m not sure what statistic is used to calculate the P-value, but likely it is calculated on raw data rather than Z scores (Does anyone else know?). Finally, the z-scores shown in the app are calcluated as z-score(log2(fpkm_normalized)) where the population for the z-score is all samples for the given gene (as above). For atlases with RNA-seq data, we use log2(fpkm_normalized +1) instead.

Here is an example differential search where the downloaded data would produce the expected fold changes. If you have any more questions, please post again!

1 Like

Thank you very much for your kind reply! Very helpful indeed! Actually, I would still have another question regarding specifically the log2(intensity) values. Is it an absolute or a relative value? Meaning, if we want to study the expression profile of a specific gene in a certain bran area, can this value be used as it is after having downloaded the .csv data? Has it been calculated compared to a housekeeping gene for example? Or is it relative to something else? Thank you again for your time.

Microarray data should always be considered relative, regardless of normalization. For example, if a gene’s log2(intensity) is 6 in cerebellum and 5 in cortex, you can believe that it is twice as highly expressed in cerebellum, but you should not use the value of 6 in comparison with any other gene. Another example is that multiple probes for the same gene often have different log2(intensities). That said, we have published a paper on how to make the microarray data from the HBA more accurate on an absolute scale if this is important.

1 Like

I’m sorry for the delay in my response and I thank you again for all the new info, which I am still trying to put in order. I tried to calculate the fold change according to your instructions (contained in your very first reply) and by opening the example of differential search that you suggested but still, I didn’t manage to obtain the exact fold change values reported in the table next to the heatmap, as well as in the downloaded probes–.csv file. I did manage to extract similar values or close to the ones expected, but they were clearly different although i did download the data, after having changed the color map setting of the heatmap, and by doing so, in other words, the scale of my results from the default z-score values to the log2 intensity ones. I made sure to apply the formula suggested by you (considering that my .csv values values at that point were all expressed in log2 intensity), meaning: fc = average log2(intensity)of all samples (with the heatmap resolution set to “Samples”) in the Target Structure minus the average log2(intensity) of all samples in the Contrast Structure. Any idea what may have gone wrong? Thank you again in advance

Here is an example:

  1. Go to this page. It has genes in DG vs. CA1 of hippocampus for one donor. Set samples and color bar as above.
  2. Download the data for the top gene (TNNT2; fold change=473.99)
  3. Open the “Expression.csv” in the zip file. The first five samples are DG, the last 5 are DG.
  4. log2(FC) = average(DG)-average(CA1) = average(10.1275, 12.9943, 12.6905, 11.4518, 12.391)-average(2.5524, 1.5237, 3.9465, 3.8081, 3.3809) = 8.8887
  5. FC = 2^(log2(FC)) = 2^8.8887 = 473.99

Thanks again! I did manage to resolve the issue in the end. It turns out that there was a constant but slight modification, not easily detectable, of some of my data in the downloaded .csv file every time I downloaded this specific type of file, which altered, as a consequence, the final result. I would advise everyone to pay very close attention to their excel file, because the way the data are saved depends up to some level on the settings and your personal pc or laptop OS.

Hello. I have been following this specific topic and I would also have a question/observation, that is slightly different, but it concerns nevertheless the microarray section of the AHBA. It’s been a while, but if I remember correctly, amongst the log2 (intensity) values of the downloaded excel data files (and no matter the type of differential search) there are absolutely no negative values for the down-regulated genes in specific brain areas. As a consequence, the fold change, considering that it has already been mentioned in this section of the forum, never ends up being inferior to 1. Is it possible that we don’t have the down-expression values? Because it seems highly unlikely. In fact, on the other hand, the z-score expression data do change from negative (when down-expressed) to positive (when up-expressed). Just some food for thought.

Hi @irisoliver, this is a useful comment because it gets at the difference between absolute and differential expression. In all of the microarray and RNA-seq data sets, the downloadable files have some measure of absolute expression (e.g., the amount of a particular transcript measured in a sample). When we say “down-regulation” we are more precisely stating that there is less measured absolute expression in sample 1 vs. sample 2. In other words, sample 1 will have a lower positive value that sample 2. Z-scores show the number of standard deviations a particular value is above or below the mean, and therefore roughly half of the values will be negative. My personal opinion is that z-scores can be misleading (e.g., 0 transcripts can have different negative values for different genes); however, they have a major advantage of showing genes with similar patterning across the brain but highly different absolute expression levels at roughly the same scale.

Sure, the fact that the downloadable log2 intensity expression values included in the Allen database were absolute values, and therefore always positive, was what i hypothesized as well in the end. But having gone back to the site since my first post, the question still remains regarding why, although positive for the above reason, when it comes to the lowest absolute values amongst them, meaning the ones that are considered to be down-expressed or less expressed if we will, in a specific brain area compared to the other one, they are never inferior to 1.4 or 1.3 or 1.2 for example. Why can’t we see values that vary between 0 and 1? They still would be positive. And more importantly, as a consequence, the fold change, which represents a relative ratio of gene expression between two different areas , never ends up being inferior to 1 as well, while we know that it could vary between 0 and 1 for down-expressed genes and between 1 and infinity for up-regulated ones. If it is equal to 1, structure A (in terms of log2 expression) and structure B will be equal with each other. If fc >1, then A>B. But what we may be missing is option number 3 and therefore what happens when B>A.

If I am understanding the question, I think it as a matter of website design. The website only shows genes the are overexpressed in the target vs the contrast structure. To see genes higher in the contrast structure, you need to click the double circle arrow icon to the right of the target structure, which switches the target and contrast and is equivalent to finding downregulated genes.

2 Likes

Hi. I’m using the Human Brain atlas and was wondering what is the best way to find an average expression value of a certain tissue. Each probe can give varying expression levels for a single structure, and I wanted to know what is the best practice to find an average value among them. Thanks!

I’m not sure about best practice, but this is what I would do. For a given probe, you can get roll-ups by structure and by coarse brain region selecting different resolutions at the bottom of the screen in the heatmap (after which you can download that data):
image
If you are looking for a different resolution, or if you want to do it from the downloaded files, you could download the data and then take the average expression across all samples within your structure of interest.

If you want to choose the best probe for a given gene, I have a paper on different ways to do this (and an associated R function). If you don’t want to read the paper, usually you can just take the probe with the highest average expression.

1 Like

It is not clear to me if log2 expression for a given probe AND anatomical region does refer to ALL the donors (when we select all the donors) or it is calculated within each donor’s brain. Same question for Z-score. I am referring to the gene search function.

If I’m understanding correctly, the log2 expression is an absolute metric so the answer would be the same regardless of which samples are selected. As for the z-score, these are calculated separately for each donor. You can see this by looking at gene expression for XIST, which is on the X-chromsome and dramatically higher in females than males. You can see that all donors have both negative and positive Z score values, even though the log2 expression in the female donor is much higher.

1 Like

I see. So, in gene search mode, the Z-score of probe A in region B of donor C is calculated using the mean and the standard deviation for probe A in the whole brain of donor C. The other brains are not considered. Is it correct?

That is correct. There is a bit more detail about it in this post.

1 Like