Question on PatchSeq jupyter example

Straightforward q: in the example jupyter notebook from the Multimodal Characterization in Mouse Visual Cortex where is the select_markers.csv file or do you have an example of information needed to create my own similar file?

import umap
marker_genes_for_umap = pd.read_csv(“select_markers.csv”, index_col=0)
embedding = umap.UMAP(n_neighbors=25).fit_transform(
np.log2(gene_data.loc[marker_genes_for_umap[“Gene”], :].values.T + 1)
)
Thanks!

The select_markers.csv file is a list of differentially expressed genes taken from the Tasic et al. (2018) paper. It was calculated by combining the top 50 differentially expressed genes from each pair of clusters in that data set, which resulted in a list of 4,020 genes.

After a bit of searching, I wasn’t able to find it in one of the public data or code repositories - maybe @yzizhen knows if it’s available somewhere? But it was calculated using the scrattch.hicat software package - specifically with the select_markers() function. You might be interested in taking a look at a vignette overview of scrattch.hicat for more information.

It is just a list of genes, though, so if you have your own list of genes of interest for another data set, you could use that instead. Or you can perform the UMAP on all of the genes in the data set, if you’re just trying to work through the notebook (or take a random sample of ~4,000 genes, so the UMAP finishes faster).

Ah ok, thank you, so looking at the scrattch.hicat code, it returns a csv list of gene names in standard NCBI naming convention in the first column and a statistic (?) in the second column with column header named ‘Gene’? <-return(list(markers=markers, de.genes=de.genes[select.pairs])
With this dataset, I would like to look at patched cell pairs. I would like to see for a given cell that was stimulated, the response of the other patched cells and the expression for a list of genes within these cells, and their MET type. Is this possible?
Many thanks for your reply!

Hi again,
I have tried to run it on all the genes and it does not really cluster and then there are matrix errors down the line:
plt.figure(figsize=(8, 8))
plt.scatter(
embedding.T,
s=1,
c=gene_data.loc[“Calb2”, :].values,
vmin=0,
vmax=5e3,
cmap=“viridis”,
edgecolor=“none”
)
ValueError: ‘c’ argument has 4435 elements, which is inconsistent with ‘x’ and ‘y’ with size 45768.

And so I cannot finish the analysis.
I tried to make a file a few different ways, a simple csv list with column header ‘Gene’, a gene lists including quoted “”, and transposed into row version. I keep getting key errors.

markerGenes.head(3)
Gene
Lhx6
Myo5b
Pdlim3
embedding = umap.UMAP(n_neighbors=25).fit_transform(
np.log2(gene_data.loc[markerGenes[‘Gene’]].values.T + 1)
)
KeyError: ‘Gene’

And not specifying a header

embedding = umap.UMAP(n_neighbors=25).fit_transform(
np.log2(gene_data.loc[markerGenes[0]].values.T + 1)
)
KeyError: 0

I tried to follow the scratch.hicat demo in R and my script fails on:
cl.clean= cl.droplevels() with the error:
Error in UseMethod(“droplevels”) :
no applicable method for ‘droplevels’ applied to an object of class “c(‘double’, ‘numeric’)”

Head(cl)
Gad2_tdTpositive_cell_18 Gad2_tdTpositive_cell_25 Gad2_tdTpositive_cell_26
5 5 6
Gad2_tdTpositive_cell_40 Gad2_tdTpositive_cell_48 d2_tdTpositive_cell_49
6 6 5

Tried to get around this by just keeping outliers
cl.clean=cl
select.markers = select_markers(norm.dat, cl.clean, de.genes=de.genes,n.markers=50)$markers

head(select.markers)
[1] “Lhx6” “Myo5b” “Pdlim3” “Cdca7” “Odz3” “Plxdc2”

But this file does not work in jupyter either, so I thought need to continue until the end of the script to get the proper output but then cl.df errors

rank ← setNames(1:nrow(cl.df), row.names(cl.df))
Error in row.names(cl.df) : object ‘cl.df’ not found

I tried
rank ← setNames(1:nrow(ref.cl.df), row.names(ref.cl.df))

As I thought that might be what cl.df refers to

Error in cor(cl.dat) : ‘x’ has a zero dimension.
In addition: Warning message:
package ‘dplyr’ was built under R version 4.0.5

dend ← dend.result$dend
Error: object ‘dend.result’ not found
cl.clean ← setNames(factor(as.character(cl.clean), levels = labels(dend)), names(cl.clean))
Error in labels(dend) : object ‘dend’ not found

the tSNE clustering works but the Dendrograms fail and thus cannot finish the script.
Do you have an idea why my srattch.hicat is failing? What should the cl.df look like?

Do you think you could post a few lines of the select markers data to jupyter so I can then make a file of gene names?

Finally, I had hoped to have cell pairs of cells that were patched in the same experiment. And to identify which cell types they are and what their response was to stimulation in one of the cells in the clusters patched. From there I would like to look at the gene expression in the paired cells separated by MET type. Do you think this is possible with this data? Were the patched cells in the same set of patchings of different cell type? Is there a way to tell in the data which cells were patched in the same experiment?

Many thanks in advance for your reply, I am really keen to use this data :slightly_smiling_face:

With this dataset, I would like to look at patched cell pairs. I would like to see for a given cell that was stimulated, the response of the other patched cells and the expression for a list of genes within these cells, and their MET type. Is this possible?

I’ll respond to the other post, as well, but I just wanted to clear up that our published Patch-seq data set is just from individual cells recorded with synaptic blockers present, so there aren’t any connected pairs of cell in this data set. We don’t have gene expression data for cells in the synaptic physiology data set, although that is something we are interested in doing in the future.

This suggests to me that the UMAP embedding might have been done on a transposed gene by cell matrix - there are 45,768 genes in the matrix and 4,435 elements, so your UMAP result should have 4,435 data points, not 45,768. That might also be why the clusters look off?

I think there might be a missing part of that vignette for defining cl. I was able to get a list of marker genes, though, just using the reference clustering. Here’s the code that produced it:

library(tasic2016data)
library(dendextend)
library(matrixStats)
library(Matrix)
library(scrattch.hicat)

# Load sample annotations (anno)
anno <- tasic_2016_anno

# Make a data.frame of unique cluster id, type, color, and broad type
ref.cl.df <- as.data.frame(unique(anno[,c("primary_type_id", "primary_type_label", "primary_type_color", "broad_type")]))

# Standardize cluster annoation with cluster_id, cluster_label and cluster_color. These are the required fields to visualize clusters properly.
colnames(ref.cl.df)[1:3] <- c("cluster_id", "cluster_label", "cluster_color")

# Sort by cluster_id
ref.cl.df <- ref.cl.df[order(ref.cl.df$cluster_id),]
row.names(ref.cl.df) <- ref.cl.df$cluster_id

ref.cl <- setNames(factor(anno$primary_type_id), anno$sample_name)

norm.dat <- log2(cpm(tasic_2016_counts)+1)

# Here's where I'm just using the reference clusters to find marker genes
select.markers <- select_markers(norm.dat, ref.cl, n.markers=50)$markers

Then you get a list of genes like this:

> select.markers[1:10]
 [1] "Nrn1"          "Slc17a7"       "Runx1t1"       "Neurod6"      
 [5] "Baiap2"        "Pcp4"          "Satb2"         "Ptn"          
 [9] "Lingo1"        "3110035E14Rik"

When I ran it, I got 2,210 genes from the Tasic 2016 data set, so you can then save that to a CSV in the format shown below as follows:
> write.csv(select.markers, "select_markers.csv")

Sure, the file is (as you suspected), just a list of gene names - here’s the first few lines:

"","Gene"
"1","Scrg1"
"2","Nxph1"
"3","Krt73"
"4","Tmem45a"
"5","Piezo2"
"6","Pax6"
"7","Frmd7"
"8","Luzp2"
"9","Frem1"

Note that there is an index column, which I’m accounting for when I load the file by:
marker_genes_for_umap = pd.read_csv("select_markers.csv", index_col=0)
If your file doesn’t have that column, and you are still using the index_col=0 option, maybe that’s why there isn’t a “Gene” column in your data frame?

As I mentioned in the previous response, for the Patch-seq data set, we didn’t patch cells at the same time or measure synaptic connections (and in fact used synaptic blockers), so that kind of analysis isn’t possible with this data set.

Hello!
Thank you so much for your replies! I hope you are still monitoring this question, I had a baby so did not have a chance to work on this for a while.
I am still having trouble with the input genelist to the clustering and I really dont know why.
My R output file from the Tasic example seems to work fine:
The data file looks like yours
“”,“Gene”
“1”,“Nrn1”
“2”,“Slc17a7”
“3”,“Runx1t1”
“4”,“Neurod6”
“5”,“Baiap2”

and jupyter reads it fine and outputs a table with .head() which seems to be appropriate

Gene
1 Nrn1
2 Slc17a7
3 Runx1t1

but it fails on
“ValueError: Input contains NaN, infinity or a value too large for dtype(‘float32’).” I tried Na.omit in R when writing the file in R and it does not make a difference but I just wonder should I be having so many problems on this? did you have to do any post processing to get it to work in jupyter? or did your file work straight away? It seems like such a silly roadblock and I have no idea what I am doing wrong but the clustering is the only thing not working all the other steps work well. I would really appreciate your help once again…
Many thanks! :slight_smile:

Congratulations!

Hmm, I wonder if this isn’t directly related to the gene file and might be something else, since (1) you said that it seems like it loads into Jupyter without a problem and (2) there isn’t any numerical information in that file, so I’m not sure why it would be referring to a float32. Could you post the complete error message?

My guess at the moment is that perhaps there’s a gene name in that list that’s not present in the Patch-seq data, and you’re getting a NaN returned when you try to get a missing gene value, but it’s hard to say without the full error.

Thank you! :slight_smile:

This was my code and error:
import umap.umap_ as umap

marker_genes_for_umap = pd.read_csv("/select_markers_smallClusterNaRNov.csv", index_col=0)

embedding = umap.UMAP(n_neighbors=25).fit_transform(
np.log2(gene_data.loc[marker_genes_for_umap[“Gene”], :].values.T + 1)
)
and the error is:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
return self._getitem_tuple(key)


ValueError Traceback (most recent call last)
in
6
7 embedding = umap.UMAP(n_neighbors=25).fit_transform(
----> 8 np.log2(gene_data.loc[marker_genes_for_umap[“Gene”], :].values.T + 1)
9 )
10 #fails Nan ValueError: Input contains NaN, infinity or a value too large for dtype(‘float32’).

c:\users—\appdata\local\programs\python\python37\lib\site-packages\umap\umap_.py in fit_transform(self, X, y)
2632 Local radii of data points in the embedding (log-transformed).
2633 “”"
→ 2634 self.fit(X, y)
2635 if self.transform_mode == “embedding”:
2636 if self.output_dens:

c:\users—\appdata\local\programs\python\python37\lib\site-packages\umap\umap_.py in fit(self, X, y)
2140 “”"
2141
→ 2142 X = check_array(X, dtype=np.float32, accept_sparse=“csr”, order=“C”)
2143 self._raw_data = X
2144

c:\users—\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
—> 63 return f(*args, **kwargs)
64
65 # extra_args > 0

c:\users—\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
662 if force_all_finite:
663 _assert_all_finite(array,
→ 664 allow_nan=force_all_finite == ‘allow-nan’)
665
666 if ensure_min_samples > 0:

c:\users—\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
104 msg_err.format
105 (type_err,
→ 106 msg_dtype if msg_dtype is not None else X.dtype)
107 )
108 # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype(‘float32’).

I am just wondering why your Tasic R output genelist file would work in jupyter and mine wouldnt? if it is a matter of a genename error? Has anyone else had problems that you know of? Many thanks :slight_smile:

For the example notebook that I went through, I didn’t actually use the select genes file directly from that Tasic 2016 example (I used another file I already had), so I think that’s why there’s a discrepancy. But it does look like the issue is that there are genes in the marker file that aren’t in the Patch-seq data. That’s what the warning about “Passing list-likes to .loc or [] with any missing label” is about. It’s returning a NaN value for the missing labels, and the UMAP library can’t handle that.

You should be able to get just the ones that are actually in gene_data by something like:

marker_genes_for_umap = pd.read_csv("/select_markers_smallClusterNaRNov.csv", index_col=0)

shared_genes = gene_data.index.intersection(marker_genes_for_umap["Gene"])

embedding = umap.UMAP(n_neighbors=25).fit_transform(
    np.log2(gene_data.loc[shared_genes, :].values.T + 1)
)