Question on PatchSeq jupyter example

Straightforward q: in the example jupyter notebook from the Multimodal Characterization in Mouse Visual Cortex where is the select_markers.csv file or do you have an example of information needed to create my own similar file?

import umap
marker_genes_for_umap = pd.read_csv(“select_markers.csv”, index_col=0)
embedding = umap.UMAP(n_neighbors=25).fit_transform(
np.log2(gene_data.loc[marker_genes_for_umap[“Gene”], :].values.T + 1)
)
Thanks!

The select_markers.csv file is a list of differentially expressed genes taken from the Tasic et al. (2018) paper. It was calculated by combining the top 50 differentially expressed genes from each pair of clusters in that data set, which resulted in a list of 4,020 genes.

After a bit of searching, I wasn’t able to find it in one of the public data or code repositories - maybe @yzizhen knows if it’s available somewhere? But it was calculated using the scrattch.hicat software package - specifically with the select_markers() function. You might be interested in taking a look at a vignette overview of scrattch.hicat for more information.

It is just a list of genes, though, so if you have your own list of genes of interest for another data set, you could use that instead. Or you can perform the UMAP on all of the genes in the data set, if you’re just trying to work through the notebook (or take a random sample of ~4,000 genes, so the UMAP finishes faster).

Ah ok, thank you, so looking at the scrattch.hicat code, it returns a csv list of gene names in standard NCBI naming convention in the first column and a statistic (?) in the second column with column header named ‘Gene’? <-return(list(markers=markers, de.genes=de.genes[select.pairs])
With this dataset, I would like to look at patched cell pairs. I would like to see for a given cell that was stimulated, the response of the other patched cells and the expression for a list of genes within these cells, and their MET type. Is this possible?
Many thanks for your reply!

Hi again,
I have tried to run it on all the genes and it does not really cluster and then there are matrix errors down the line:
plt.figure(figsize=(8, 8))
plt.scatter(
embedding.T,
s=1,
c=gene_data.loc[“Calb2”, :].values,
vmin=0,
vmax=5e3,
cmap=“viridis”,
edgecolor=“none”
)
ValueError: ‘c’ argument has 4435 elements, which is inconsistent with ‘x’ and ‘y’ with size 45768.

And so I cannot finish the analysis.
I tried to make a file a few different ways, a simple csv list with column header ‘Gene’, a gene lists including quoted “”, and transposed into row version. I keep getting key errors.

markerGenes.head(3)
Gene
Lhx6
Myo5b
Pdlim3
embedding = umap.UMAP(n_neighbors=25).fit_transform(
np.log2(gene_data.loc[markerGenes[‘Gene’]].values.T + 1)
)
KeyError: ‘Gene’

And not specifying a header

embedding = umap.UMAP(n_neighbors=25).fit_transform(
np.log2(gene_data.loc[markerGenes[0]].values.T + 1)
)
KeyError: 0

I tried to follow the scratch.hicat demo in R and my script fails on:
cl.clean= cl.droplevels() with the error:
Error in UseMethod(“droplevels”) :
no applicable method for ‘droplevels’ applied to an object of class “c(‘double’, ‘numeric’)”

Head(cl)
Gad2_tdTpositive_cell_18 Gad2_tdTpositive_cell_25 Gad2_tdTpositive_cell_26
5 5 6
Gad2_tdTpositive_cell_40 Gad2_tdTpositive_cell_48 d2_tdTpositive_cell_49
6 6 5

Tried to get around this by just keeping outliers
cl.clean=cl
select.markers = select_markers(norm.dat, cl.clean, de.genes=de.genes,n.markers=50)$markers

head(select.markers)
[1] “Lhx6” “Myo5b” “Pdlim3” “Cdca7” “Odz3” “Plxdc2”

But this file does not work in jupyter either, so I thought need to continue until the end of the script to get the proper output but then cl.df errors

rank ← setNames(1:nrow(cl.df), row.names(cl.df))
Error in row.names(cl.df) : object ‘cl.df’ not found

I tried
rank ← setNames(1:nrow(ref.cl.df), row.names(ref.cl.df))

As I thought that might be what cl.df refers to

Error in cor(cl.dat) : ‘x’ has a zero dimension.
In addition: Warning message:
package ‘dplyr’ was built under R version 4.0.5

dend ← dend.result$dend
Error: object ‘dend.result’ not found
cl.clean ← setNames(factor(as.character(cl.clean), levels = labels(dend)), names(cl.clean))
Error in labels(dend) : object ‘dend’ not found

the tSNE clustering works but the Dendrograms fail and thus cannot finish the script.
Do you have an idea why my srattch.hicat is failing? What should the cl.df look like?

Do you think you could post a few lines of the select markers data to jupyter so I can then make a file of gene names?

Finally, I had hoped to have cell pairs of cells that were patched in the same experiment. And to identify which cell types they are and what their response was to stimulation in one of the cells in the clusters patched. From there I would like to look at the gene expression in the paired cells separated by MET type. Do you think this is possible with this data? Were the patched cells in the same set of patchings of different cell type? Is there a way to tell in the data which cells were patched in the same experiment?

Many thanks in advance for your reply, I am really keen to use this data :slightly_smiling_face:

With this dataset, I would like to look at patched cell pairs. I would like to see for a given cell that was stimulated, the response of the other patched cells and the expression for a list of genes within these cells, and their MET type. Is this possible?

I’ll respond to the other post, as well, but I just wanted to clear up that our published Patch-seq data set is just from individual cells recorded with synaptic blockers present, so there aren’t any connected pairs of cell in this data set. We don’t have gene expression data for cells in the synaptic physiology data set, although that is something we are interested in doing in the future.

This suggests to me that the UMAP embedding might have been done on a transposed gene by cell matrix - there are 45,768 genes in the matrix and 4,435 elements, so your UMAP result should have 4,435 data points, not 45,768. That might also be why the clusters look off?

I think there might be a missing part of that vignette for defining cl. I was able to get a list of marker genes, though, just using the reference clustering. Here’s the code that produced it:

library(tasic2016data)
library(dendextend)
library(matrixStats)
library(Matrix)
library(scrattch.hicat)

# Load sample annotations (anno)
anno <- tasic_2016_anno

# Make a data.frame of unique cluster id, type, color, and broad type
ref.cl.df <- as.data.frame(unique(anno[,c("primary_type_id", "primary_type_label", "primary_type_color", "broad_type")]))

# Standardize cluster annoation with cluster_id, cluster_label and cluster_color. These are the required fields to visualize clusters properly.
colnames(ref.cl.df)[1:3] <- c("cluster_id", "cluster_label", "cluster_color")

# Sort by cluster_id
ref.cl.df <- ref.cl.df[order(ref.cl.df$cluster_id),]
row.names(ref.cl.df) <- ref.cl.df$cluster_id

ref.cl <- setNames(factor(anno$primary_type_id), anno$sample_name)

norm.dat <- log2(cpm(tasic_2016_counts)+1)

# Here's where I'm just using the reference clusters to find marker genes
select.markers <- select_markers(norm.dat, ref.cl, n.markers=50)$markers

Then you get a list of genes like this:

> select.markers[1:10]
 [1] "Nrn1"          "Slc17a7"       "Runx1t1"       "Neurod6"      
 [5] "Baiap2"        "Pcp4"          "Satb2"         "Ptn"          
 [9] "Lingo1"        "3110035E14Rik"

When I ran it, I got 2,210 genes from the Tasic 2016 data set, so you can then save that to a CSV in the format shown below as follows:
> write.csv(select.markers, "select_markers.csv")

Sure, the file is (as you suspected), just a list of gene names - here’s the first few lines:

"","Gene"
"1","Scrg1"
"2","Nxph1"
"3","Krt73"
"4","Tmem45a"
"5","Piezo2"
"6","Pax6"
"7","Frmd7"
"8","Luzp2"
"9","Frem1"

Note that there is an index column, which I’m accounting for when I load the file by:
marker_genes_for_umap = pd.read_csv("select_markers.csv", index_col=0)
If your file doesn’t have that column, and you are still using the index_col=0 option, maybe that’s why there isn’t a “Gene” column in your data frame?

As I mentioned in the previous response, for the Patch-seq data set, we didn’t patch cells at the same time or measure synaptic connections (and in fact used synaptic blockers), so that kind of analysis isn’t possible with this data set.