A missed chat on the rOpenSci website the other day asked:
Hi there, i am trying to use the taxize package and have a .csv file of species names to run through taxize updating them. What would be the code i would need to run to achieve this?
One way to answer this is to talk about the basic approach to importing data, doing stuff to the data, then recombining data. There are many ways to do this, but I’ll go over a few of them.
Install taxize
install.packages("taxize")
install.packages("downloader")
library("taxize")
Import data
We’ll use Winston Chang’s new downloader
package to avoid problems with https
, and get a dataset from our ropensci datasets repo https://github.com/ropensci/datasets
downloader::download("https://raw.githubusercontent.com/ropensci/datasets/master/planttraits/morphological.csv", "morphological.csv")
dat <- read.csv("morphological.csv", stringsAsFactors = FALSE)
head(dat)
#> species log_SLA leaf_water_content log_wood_density
#> 1 Abies concolor 3.46 0.51 -0.52
#> 2 Abies grandis 3.58 0.49 -0.51
#> 3 Abies magnifica 3.87 0.62 -0.53
#> 4 Acacia farnesiana NA NA NA
#> 5 Acer glabrum 5.07 0.69 -0.54
#> 6 Adenostoma fasciculata 3.56 0.46 -0.31
#> log_ht log_N
#> 1 7.72 0.02
#> 2 7.51 -0.31
#> 3 7.58 -0.14
#> 4 5.70 NA
#> 5 3.25 1.02
#> 6 5.33 0.29
After importing data, there are a variety of approaches you could take:
- Vector: Take species names as vector from your
data.frame
, cleaning them, then re-attching to thedata.frame
later, or - In-Place: Use for loops or
lapply
family functions to iterate over each name while simultaneously re-inserting into thedata.frame
1. Vector
Make a vector of names
splist <- dat$species
Then proceed to do name cleaning, e.g, we can use the tnrs
function to see if any names are potentially not spelled correctly.
tnrs_out <- tnrs(splist, source = "iPlant_TNRS")
head(tnrs_out)
#> submittedname acceptedname sourceid score
#> 1 Ceanothus prostratus Ceanothus prostratus iPlant_TNRS 1
#> 2 Abies magnifica Abies magnifica iPlant_TNRS 1
#> 3 Arctostaphylos canescens Arctostaphylos canescens iPlant_TNRS 1
#> 4 Berberis nervosa Berberis nervosa iPlant_TNRS 1
#> 5 Arbutus menziesii Arbutus menziesii iPlant_TNRS 1
#> 6 Calocedrus decurrens Calocedrus decurrens iPlant_TNRS 1
#> matchedname authority
#> 1 Ceanothus prostratus Benth.
#> 2 Abies magnifica A. Murray bis
#> 3 Arctostaphylos canescens Eastw.
#> 4 Berberis nervosa Pursh
#> 5 Arbutus menziesii Pursh
#> 6 Calocedrus decurrens (Torr.) Florin
#> uri
#> 1 http://www.tropicos.org/Name/27500276
#> 2 http://www.tropicos.org/Name/24900142
#> 3 http://www.tropicos.org/Name/12302547
#> 4 http://www.tropicos.org/Name/3500175
#> 5 http://www.tropicos.org/Name/12302436
#> 6 http://www.tropicos.org/Name/9400069
Those with score of less than 1 may have misspellings
tnrs_out[ tnrs_out$score < 1, ]
#> submittedname acceptedname sourceid score
#> 23 Adenostoma fasciculata Adenostoma fasciculatum iPlant_TNRS 0.97
#> 24 Arctostaphylos glandulosus Arctostaphylos glandulosa iPlant_TNRS 0.97
#> 36 Chamaebatia foliosa Chamaebatia foliolosa iPlant_TNRS 0.95
#> 38 Juniperus californicus Juniperus californica iPlant_TNRS 0.97
#> 77 Prunus illicifolia Prunus ilicifolia iPlant_TNRS 0.99
#> 78 Prunus subcordatus Prunus subcordata iPlant_TNRS 0.97
#> matchedname authority
#> 23 Adenostoma fasciculatum Hook. & Arn.
#> 24 Arctostaphylos glandulosa Eastw.
#> 36 Chamaebatia foliolosa Benth.
#> 38 Juniperus californica Carrière
#> 77 Prunus ilicifolia (Nutt. ex Hook. & Arn.) D. Dietr.
#> 78 Prunus subcordata Benth.
#> uri
#> 23 http://www.tropicos.org/Name/27801458
#> 24 http://www.tropicos.org/Name/12300542
#> 36 http://www.tropicos.org/Name/27801486
#> 38 http://www.tropicos.org/Name/9400374
#> 77 http://www.tropicos.org/Name/27801102
#> 78 http://www.tropicos.org/Name/27801124
So let’s take the acceptedname
column as a the new names and assign to a new vector
cleaned_names <- tnrs_out$acceptedname
Then join names back, replacing them, or adding as a new column
Replace
dat$species <- cleaned_names
head(dat)
#> species log_SLA leaf_water_content log_wood_density
#> 1 Ceanothus prostratus 3.46 0.51 -0.52
#> 2 Abies magnifica 3.58 0.49 -0.51
#> 3 Arctostaphylos canescens 3.87 0.62 -0.53
#> 4 Berberis nervosa NA NA NA
#> 5 Arbutus menziesii 5.07 0.69 -0.54
#> 6 Calocedrus decurrens 3.56 0.46 -0.31
#> log_ht log_N
#> 1 7.72 0.02
#> 2 7.51 -0.31
#> 3 7.58 -0.14
#> 4 5.70 NA
#> 5 3.25 1.02
#> 6 5.33 0.29
New column
dat$species_cleaned <- cleaned_names
head(dat)
#> species log_SLA leaf_water_content log_wood_density
#> 1 Ceanothus prostratus 3.46 0.51 -0.52
#> 2 Abies magnifica 3.58 0.49 -0.51
#> 3 Arctostaphylos canescens 3.87 0.62 -0.53
#> 4 Berberis nervosa NA NA NA
#> 5 Arbutus menziesii 5.07 0.69 -0.54
#> 6 Calocedrus decurrens 3.56 0.46 -0.31
#> log_ht log_N species_cleaned
#> 1 7.72 0.02 Ceanothus prostratus
#> 2 7.51 -0.31 Abies magnifica
#> 3 7.58 -0.14 Arctostaphylos canescens
#> 4 5.70 NA Berberis nervosa
#> 5 3.25 1.02 Arbutus menziesii
#> 6 5.33 0.29 Calocedrus decurrens
2. In-place
You can use functions from the dplyr
package to split-apply-combine
, where split
is split apart your vector for each taxon, apply
to apply a function or functions to do name cleaning, then combine
to put them back together.
Here, we’ll attach taxonomic ids from the Catalogue of Life to each species (each row) (with just a subset of the data to save time):
library("dplyr")
tbl_df(dat)[1:5,] %>%
rowwise() %>%
mutate(colid = get_colid(species)) %>%
select(species, colid)
#> Source: local data frame [5 x 2]
#> Groups: <by row>
#>
#> species colid
#> 1 Ceanothus prostratus 19544732
#> 2 Abies magnifica 18158318
#> 3 Arctostaphylos canescens 19358934
#> 4 Berberis nervosa 19374077
#> 5 Arbutus menziesii 19358819
Let’s do something a bit more complicated. Get common names for each taxon in a new column, if more than 1, concatenate into a single character string for easy inclusion in a data.frame
sci2comm_concat <- function(x){
temp <- sci2comm(x, db = "eol")
if(length(temp) == 0) NA else paste0(temp[[1]], collapse = ", ")
}
dat_new <- tbl_df(dat)[1:5,] %>%
rowwise() %>%
mutate(comm = sci2comm_concat(species))
To see the new column, do
dat_new %>% select(comm)
#> Source: local data frame [5 x 1]
#> Groups: <by row>
#>
#> comm
#> 1 Mahala-mat Ceanothus, prostrate ceanothus, squawcarpet
#> 2 Prächtige Tanne, Goldtanne (Gold-Tanne), Kalifornische Rot-Tanne, Pracht-Ta
#> 3 hoary manzanita, hoary manzanita, Sonoma manzanita
#> 4 Longleaf Oregon-grape, Cascade barberry, Dull Oregon grape, Oregon grape-ho
#> 5 pacific madrone, Madrona, madrone, Kalifornianmansikkapuu