taxize use case: Resolving species names when you have a lot of them
Species names can be a pain in the ass, especially if you are an ecologist. We ecologists aren’t trained in taxonomy, yet we often end up with huge species lists. Of course we want to correct any spelling errors in the names, and get the newest names for our species, resolve any synonyms, etc.
We are building tools into our R package taxize, that will let you check your species names to make sure they are correct.
An important use case is when you have a lot of species. Someone wrote to us recently, saying that they had thousands of species, and they wanted to know how to check their species names efficiently in R.
Below is an example of how to do this.
Get some species, in this case all species in the Scrophulariaceae family from theplantlist.org
Lets grab the species and concatenate to genus_species
It’s better to do many smaller calls to a web API instead of few big ones to be nice to the database maintainers.
Query for your large species list with pauses between calls, with 3 seconds in between calls to not hit the web service too hard. Using POST method here instead of GET - required when you have a lot of species.
Note that there are multiple names for some species because data sources have different names for the same species (resulting in more than one row in the data.frame ‘outdf’ for a species). We are leaving this up to the user to decide which to use. For example, for the species Buddleja montana there are two names for in the output
The source iPlant matched the name, but NCBI actually gave back a genus of cricket (follow the link under the column uri for Montana). If you look at the page for Buddleja on NCBI here there is no Buddleja montana at all.
Another thing we could do is look at the score that is returned. Let’s look at those that are less than 1 (i.e., )
As we got this speies list from theplantlist.org, there aren’t that many mistakes, but if it was my species list you know there would be many :)