Strange pattern in copy number plots

Hello!
I have been working with the transcriptomic data in the ABC atlas and noticed something interesting: When using the log-transformed data and plotting the copy number of one transcript against another in a handful of cells, strict linear relationships seem to appear in the midst of the larger data, even when the broader correlation is a negative one. What explains these small trends?

This is very odd indeed! Which dataset is this in specifically?

The data is the Chromium 10x v2 log-transformed counts, filtering cells by subclass “MB Dopa” and plotting the counts of each transcript per cell. Notably this pattern shows up in most plots regardless of what gene is on the y-axis.

Ok I’m stumped, @zizheny any ideas?

This likely has something to do with the normalization. Each cell is normalized as counts per million and then log-scaled. This means for each cell independently, each cell count is multiplied by a single value such the total reads for that cell is 1 million. So for every cell there are only a discrete set of possible values for each gene which is the same for every gene within a cell (both before and after log-scaling), but which differs between cells. This probably explains why these diagonal stripes have what looks like a slope of 1, as values within a cell for both genes increment by 1 and the point would go equally higher in both axes. All that said, I’m still surprised to see these stripes at all, and I’m also just making an educated guess. (And maybe Zizhen has a better answer…).

Yeah, it’s because the original data are integer counts, so any given pair of genes can only exist in integer ratios compared to each other (which then get transformed by the normalization). Some ratios are more common than others, so they appear as the “lines” in these graphs.

You can see it more easily if you transform from log2(CPM + 1) to CPM - here is (roughly) the same data shown in just CPM. The right adds plots of some common fixed integer ratios (from 1:1 to 5:1), and you can see that accounts for many of the most prominent lines.

Of course if you look at the raw values you can see how only certain pairs are possible (the lower graph just adjusts the scale of the y-axis):

So the lines are produced because you have some common ratios of values in the data, which then get “smeared” into a line by the CPM normalization (so a given location in the second set of graphs becomes a line of points with the slope of the original ratio in the first set of graphs), and then the log(CPM + 1)-transformation shifts it so the lines no longer obviously go through the origin.

1 Like

I’m not sure if this makes it easier to understand, but here I am coloring in red all the cells in which the Vamp2 counts are exactly twice the Slc17a6 counts in both raw counts (left), CPM (middle), and log2(CPM + 1) (right), so you can see how sets of discrete pairs of values on the left become lines in the middle and right.

1 Like

Very helpful, thank you for the info!