Open questions from recent workshops on cell ontology and taxonomy

The Allen Institute recently hosted two workshops on Cell Ontology & Taxonomy, which brought together scientists studying brain cell types with subject matter experts in ontology building. In my opinion, these workshop provided three useful outcomes: (1) it got people thinking about this topic, (2) it led to cell naming and organization convention that will be used for single cell RNA-sequencing data in our upcoming data release (stay tuned…), and (3) it highlighted some of out-standing questions and challenges that still need to be tackled moving forward. I will focus the remainder of this post on a subset of these questions and challenges.

  • How do we name cell types in a systematic way? - Discussion revolved around how to balance the use of high-throughput gene expression information with canonical, generally morphologically-based names (e.g., Chandelier cells), or whether to use more generic cell type names (e.g., Neuron 12).
  • How do we organize cell types? - In some brain regions, cell types can be hierarchically organized, while in others the organization is more complicated. One outcome of our first workshop was a proposal for how to put these cell types in a probabilistic framework, and how to map cells within this framework. Discussion at the second workshop focused on how to organize cell types into new or existing ontologies.
  • To what extent can we take advantage of existing ontology work? - Several highly-used ontologies already exist in this space (e.g. cell ontology, gene ontology). Discussion revolved around situations where we could use or expand these existing tools, and situations where novel strategies would make more sense.
  • How do we extend to multiple modalities or organ systems? - While the primary topic of discussion was related to gene expression in brain tissue, many people felt that an ontology system should allow inclusion of data from other organ systems (e.g., kidney or liver) or from other modalities (e.g., electrophysiology or morphology).

Whether you are a workshop attendee, an Allen Institute employee, a student, or any other interested scientist, I would encourage you to reply with any thoughts you have on this topic. I would also encourage you to Explore information related to cell taxonomy data and publications from the Allen Institute:

1 Like

The workshop participants proposed the following convention for cell type nomenclature in the adult mammalian brain.

Descriptor Description Example Comment
Cell type accession ID - Leaf An ID uniquely identifying a cell type across all possible data sets or taxonomies. CS1910120001
Cell type alias Current reasonably-accurate names for cell types, based on gene expression and other anatomic features (published). Inh L1-2 PAX6 CDH12 @scheuerm suggests Inh L1-2 FBXL7 TGFBR2
Cell type label A short name with a header that is part of a dictionary of broad cell types (e.g., Inh = inhibitory) followed by a number. Neuron 1; Non-neuron 3
Cell set accession ID - Node An ID uniquely identifying a node across all possible data sets or taxonomies. CS1910120014
Cell set alias Either NULL or a flexible descriptor that approximates the features of the included cell types. ADARB2 (CGE); FEZF2
Cell set label A concatenation of the included cell type names (non-hierarchical) Neuron 1-6; Non-neuron 1-4

I don’t see the value of having separate “cell sets” that are distinct from cell types. The cell ontology handles these hierarchical relationships through the type-subtype hierarchy. In the immune system, a central memory helper T cell is_a helper T cell is_a T cell is_a lymphocyte is_a leukocyte. These are all cell types and some include collections (sets) of cell types. Is this what is being sought using the cell set nomenclature proposed. If so, I don’t think it is necessary.

There is a desire that this naming convention would be “extensible to the larger cell typing community” but it is not clear what this would look like for cells in other organs.

I’m more concerned with cell type definitions than cell type names. As you know, we have developed the NS-Forest method to objectively select necessary and sufficient marker genes with statistical measures of their classification accuracy that can be used to construct reproducible cell type definitions. However, we have found that these optimal marker genes are different from the genes currently being used for these cell type names, which are being selected by procedures that are unclear to me. This could lead to some confusion for the users. To make the proposed nomenclature consistent will the cell type definition, we would select the best marker gene for the granular cell type and the parent cell type, if that is what is desired. Thus the name for Inh L1-2 PAX6 CDH12 cell type listed above, would be Inh L1-2 FBXL7 TGFBR2. This may not mesh precisely with prior knowledge, but this is the most logical way to support automated naming and definition construction.

Hi @scheuerm, thank you for your comments. It is important to note that our current naming convention completely sidesteps the question of what you call “cell type definitions”, and therefore would be compatible with NS-Forest or any other strategy used in other organs for defining cell types. With our proposed convention, Inh L1-2 FBXL7 TGFBR2 would go in the cell set alias slot. More generally, our schema is intended to allow for a likely lengthy process of deciding how to name cell types (which may vary in different organ systems). The cell type label slot should also allow for linking across organ systems, as these terms (e.g., “Neuron”) should be broad enough to be indisputable in the relevant community and would (ideally) link up with cell ontology, or another existing ontology. We also intend to release a descriptor that we are currently calling a cell type alternative alias which is meant to house the common usage term for a given cell type (e.g., Chandelier cell).

Finally, we definitely appreciate your input regarding whether “cell sets” and “cell types” should be distinct or treated the same–we have had much debate on this issue at the Allen Institute as well! From an ontological perspective your point is well taken–cell types are just part of the hierarchy, and schema does treat cell types the same way as cell sets by providing a label and an accession ID of the same format for both. The only difference is in the alias, for which (as mentioned above) standard conventions still need to be established. That said, from both an analysis and a presentation perspective, cell types have very special meaning in the space of cell sets. For example, when we cluster single cell RNA-sequencing data we define cell types first and then construct a hierarchy post-hoc. These cell types represent the most reliable parceling of data into types (for a given analysis and data set) and allow us to give a quantitative answer to the much-asked question of: “How many cell types are there in the brain?” We hope that our schema will be compatible with both of these goals (consistency in the ontology with some distinction of cell types) and would love feedback as to how this could be improved.