Strange pattern in copy number plots

alwaysnow · September 12, 2024, 4:24am

Hello!
I have been working with the transcriptomic data in the ABC atlas and noticed something interesting: When using the log-transformed data and plotting the copy number of one transcript against another in a handful of cells, strict linear relationships seem to appear in the midst of the larger data, even when the broader correlation is a negative one. What explains these small trends?

tmchartrand · October 2, 2024, 4:54pm

This is very odd indeed! Which dataset is this in specifically?

alwaysnow · October 5, 2024, 4:30am

The data is the Chromium 10x v2 log-transformed counts, filtering cells by subclass “MB Dopa” and plotting the counts of each transcript per cell. Notably this pattern shows up in most plots regardless of what gene is on the y-axis.

tmchartrand · October 7, 2024, 8:49pm

Ok I’m stumped, @zizheny any ideas?

jeremyinseattle · October 7, 2024, 11:31pm

This likely has something to do with the normalization. Each cell is normalized as counts per million and then log-scaled. This means for each cell independently, each cell count is multiplied by a single value such the total reads for that cell is 1 million. So for every cell there are only a discrete set of possible values for each gene which is the same for every gene within a cell (both before and after log-scaling), but which differs between cells. This probably explains why these diagonal stripes have what looks like a slope of 1, as values within a cell for both genes increment by 1 and the point would go equally higher in both axes. All that said, I’m still surprised to see these stripes at all, and I’m also just making an educated guess. (And maybe Zizhen has a better answer…).

gouwens · October 9, 2024, 5:00pm

Yeah, it’s because the original data are integer counts, so any given pair of genes can only exist in integer ratios compared to each other (which then get transformed by the normalization). Some ratios are more common than others, so they appear as the “lines” in these graphs.

You can see it more easily if you transform from log2(CPM + 1) to CPM - here is (roughly) the same data shown in just CPM. The right adds plots of some common fixed integer ratios (from 1:1 to 5:1), and you can see that accounts for many of the most prominent lines.

Of course if you look at the raw values you can see how only certain pairs are possible (the lower graph just adjusts the scale of the y-axis):

So the lines are produced because you have some common ratios of values in the data, which then get “smeared” into a line by the CPM normalization (so a given location in the second set of graphs becomes a line of points with the slope of the original ratio in the first set of graphs), and then the log(CPM + 1)-transformation shifts it so the lines no longer obviously go through the origin.

gouwens · October 9, 2024, 5:12pm

I’m not sure if this makes it easier to understand, but here I am coloring in red all the cells in which the Vamp2 counts are exactly twice the Slc17a6 counts in both raw counts (left), CPM (middle), and log2(CPM + 1) (right), so you can see how sets of discrete pairs of values on the left become lines in the middle and right.

alwaysnow · October 9, 2024, 6:41pm

Very helpful, thank you for the info!

Topic		Replies	Views
Downloading human age-related gene expression levels and making figures Technical atlas-human-brain-adult , analysis , how-to , human , cancer	2	917	June 26, 2019
ABC Atlas Spatial Transcriptomics Beta - class names Technical	2	332	July 14, 2023
Understanding expression of gene in Developing Human Brain RNA-Seq	1	512	September 26, 2022
ABC Atlas Update Notes - February 27, 2024 Allen Brain Cell (ABC) Atlas	0	247	February 26, 2024
Open for (neuro)science tutorial: Atlases for development Q&A Show & Tell atlas-mouse-brain-developing , atlas-human-brain-developing , training , workshop	0	977	November 20, 2020

Strange pattern in copy number plots

Related topics