Open questions from recent workshops on cell ontology and taxonomy

The Allen Institute recently hosted two workshops on Cell Ontology & Taxonomy, which brought together scientists studying brain cell types with subject matter experts in ontology building. In my opinion, these workshop provided three useful outcomes: (1) it got people thinking about this topic, (2) it led to cell naming and organization convention that will be used for single cell RNA-sequencing data in our upcoming data release (stay tuned…), and (3) it highlighted some of out-standing questions and challenges that still need to be tackled moving forward. I will focus the remainder of this post on a subset of these questions and challenges.

  • How do we name cell types in a systematic way? - Discussion revolved around how to balance the use of high-throughput gene expression information with canonical, generally morphologically-based names (e.g., Chandelier cells), or whether to use more generic cell type names (e.g., Neuron 12).
  • How do we organize cell types? - In some brain regions, cell types can be hierarchically organized, while in others the organization is more complicated. One outcome of our first workshop was a proposal for how to put these cell types in a probabilistic framework, and how to map cells within this framework. Discussion at the second workshop focused on how to organize cell types into new or existing ontologies.
  • To what extent can we take advantage of existing ontology work? - Several highly-used ontologies already exist in this space (e.g. cell ontology, gene ontology). Discussion revolved around situations where we could use or expand these existing tools, and situations where novel strategies would make more sense.
  • How do we extend to multiple modalities or organ systems? - While the primary topic of discussion was related to gene expression in brain tissue, many people felt that an ontology system should allow inclusion of data from other organ systems (e.g., kidney or liver) or from other modalities (e.g., electrophysiology or morphology).

Whether you are a workshop attendee, an Allen Institute employee, a student, or any other interested scientist, I would encourage you to reply with any thoughts you have on this topic. I would also encourage you to Explore information related to cell taxonomy data and publications from the Allen Institute:

1 Like

The workshop participants proposed the following convention for cell type nomenclature in the adult mammalian brain.

Descriptor Description Example Comment
Cell type accession ID - Leaf An ID uniquely identifying a cell type across all possible data sets or taxonomies. CS1910120001
Cell type alias Current reasonably-accurate names for cell types, based on gene expression and other anatomic features (published). Inh L1-2 PAX6 CDH12 @scheuerm suggests Inh L1-2 FBXL7 TGFBR2
Cell type label A short name with a header that is part of a dictionary of broad cell types (e.g., Inh = inhibitory) followed by a number. Neuron 1; Non-neuron 3
Cell set accession ID - Node An ID uniquely identifying a node across all possible data sets or taxonomies. CS1910120014
Cell set alias Either NULL or a flexible descriptor that approximates the features of the included cell types. ADARB2 (CGE); FEZF2
Cell set label A concatenation of the included cell type names (non-hierarchical) Neuron 1-6; Non-neuron 1-4

I don’t see the value of having separate “cell sets” that are distinct from cell types. The cell ontology handles these hierarchical relationships through the type-subtype hierarchy. In the immune system, a central memory helper T cell is_a helper T cell is_a T cell is_a lymphocyte is_a leukocyte. These are all cell types and some include collections (sets) of cell types. Is this what is being sought using the cell set nomenclature proposed. If so, I don’t think it is necessary.

There is a desire that this naming convention would be “extensible to the larger cell typing community” but it is not clear what this would look like for cells in other organs.

I’m more concerned with cell type definitions than cell type names. As you know, we have developed the NS-Forest method to objectively select necessary and sufficient marker genes with statistical measures of their classification accuracy that can be used to construct reproducible cell type definitions. However, we have found that these optimal marker genes are different from the genes currently being used for these cell type names, which are being selected by procedures that are unclear to me. This could lead to some confusion for the users. To make the proposed nomenclature consistent will the cell type definition, we would select the best marker gene for the granular cell type and the parent cell type, if that is what is desired. Thus the name for Inh L1-2 PAX6 CDH12 cell type listed above, would be Inh L1-2 FBXL7 TGFBR2. This may not mesh precisely with prior knowledge, but this is the most logical way to support automated naming and definition construction.