Error in build_orthology_table

I’m looking to use MapMyCells on a naked mole-rat dataset, so I’m going through the process of finding orthologous genes found on the GeneOrthology github page. I really only need orthologs between mice and NMRs, so I ran this:

taxIDs <- setNames(c(10181, 10090),
                   c("Naked.mole.rat", "Mouse"))

build_orthology_table(taxIDs = taxIDs, primaryTaxID = c(10181, 10090),  
                       outputFilePrefix="nmr_orthologs",verbose=TRUE,
                       includeNonMammalianSpecies = FALSE)

and got the following output/error:

[1] "The following species are not included in NCBI's gene ortholog table and will be omitted:"
[1] "Mouse"
[1] "Building the output orthology table..."
[1] "...putting first primary taxonomy"
[1] "...building the table"
Error in `[.data.frame`(orthologTable, , 1) : undefined columns selected

Any advice for how to fix this would be appreciated!

Hi @lizchcase,

I’m glad to see you are interested in this library! The issue is that Naked Mole Rat is not one of the options for primary taxonomies (e.g., species that orthologs are compared against), but is one of the options for taxIDs (e.g., species included in the NCBI ortholog table at all). This code should work instead:

library(GeneOrthology)
options(timeout = 300)  # To avoid time out on downloading large files
taxIDs <- setNames(c(10181, 10090),
                   c("Naked.mole.rat", "Mouse"))

build_orthology_table(taxIDs = taxIDs, primaryTaxID = 10090,  # Note edit here
                       outputFilePrefix="nmr_orthologs",verbose=TRUE,
                       includeNonMammalianSpecies = FALSE)

This worked for me, but let me know if you run into issues.

Best,
Jeremy

Hi Jeremy,

Thanks so much! A couple more questions:

  1. This ran, but the resulting orthology table only has three rows. I would have expected the output to include many more genes. What should the orthology table look like?

  2. There have been quite a few NMR genomes released over the past 14 years. My data is mapped to the genome released in November of 2024 (Sokolowski … Wilson 2024, bioRxiv), which is the most recent genome on ENSEMBL. Do you know which genome was used for this? Does it matter if my data is mapped to a different one?

Thanks!

@jeremyinseattle pinging on this!

Hi @lizchcase,

Thanks for retagging me… somehow I missed your reply. You are right about three rows–it seems something weird is going on with the Mouse genome (I think on the NCBI end, but possibly with my code). If I run with human as well (code below), I get this much more reasonable output file. (2.6 MB). Hopefully, this will work for you!

Updated code to get this ^:

library(GeneOrthology)
options(timeout = 300)  # To avoid time out on downloading large files
taxIDs <- setNames(c(10181, 10090, 9606),
                   c("Naked.mole.rat", "Mouse", "Human"))

build_orthology_table(taxIDs = taxIDs, primaryTaxID = c(9606,10090),  
                       outputFilePrefix="nmr_orthologs",verbose=TRUE,
                       includeNonMammalianSpecies = FALSE)

For your second question, the scripts pull whatever is in this file when you run the script: “https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz”. I believe all of these DATA files on NCBI are updated daily, but I don’t know the details of exactly how. I suspect there is metadata in that file saying which versions are being used, but I have not looked.

The versions may matter if there is a MAJOR genome update that is somehow different from what you have, then you may run into issues, or if you get unlucky and the specific gene you want is one of the differences, you may run into issues. That said, overall I think you are unlikely to have an issue if versions disagree.

Best,
Jeremy

1 Like